Friday, December 16, 2005

Pocket Query World Scan

Well, I need to get this started. I've just finished my third scan of all the caches in the world. I started the latest scan on September 10th and will finish on December 23rd when my queued pocket queries complete, which is pretty good. There are currently 220,836 active caches according to the Getting Started page. That means that if you were to have 5 pocket queries every day with near the maximum of 500 caches in each, it would take 442 queries over 89 days. I actually took 523 queries and 105 days. Of course there is no way of having complete coverage with queries that are pegged by the 500 cache limit, so while my 523 queries could have returned 261,500 caches they didn't. Also there are almost 600 new caches a day on the average, so each successive world scan takes longer. I actually added queries to pick up new caches, so that's another reason that my scan took longer than the minimum.

That also doesn't count caches that have been archived since the last time that I retrieved them. There's no way to get archived caches in a pocket query, so I've manually downloaded an average of nearly 300 single cache gpx files a day. I currently have 24,381 archived caches in my GSAK database. I use GSAK's 'Last GPX' field to find caches that weren't picked up by a pocket query when they should have been. Nearly all of those are archived caches and I can use the GSAK split screen to down load the single gpx files for the archived caches as I identify them. I do that rather than just mark them as archived because it picks up the archive log entries that explain what happened to the cache.

My GSAK database is huge and Clyde's recent enhancements to GSAK have been absolutely necessary to manage it. There is no way that one could download all the logs for the caches in one's database, since a gpx entry is limited to 5 logs. More than 5 geocachers could have found and logged a cache in a single day, which is the finest granularity available. The Getting Started page gives the number of logs in the past week, with 81,199 most recently listed. I currently have over 3 million logs in my database.

I refine my techniques for each world scan and I'm completing it in less than half the time it took for the earlier ones. The first world scan attempted to overlay the earth with 500 mile radius circles. That's hard to do and I couldn't come up with a general solid geometry method to determine where the centers were. It ended up being ad hoc and unsatisfactory. I had to use date ranges to reduce the number of caches in each query to below the 500 limit. The biggest problem with that method however is that the circles had to overlap to avoid dead spaces between them. The overlap meant that many caches were contained in more than one query which greatly extended the number of queries to cover the world. For finishing the world scan in the minimum number of queries, no cache can appear in more than one query.

My second scan of the world created queries by country or by state. Again I needed to partition the areas using date ranges for the 500 cache limit in a query. I discovered that I could bunch countries or states with few caches into a single query by multiple selection in the pull down list. Hold the shift key on a second selection to include all the items between it and the first. Use the control key to toggle the selection on an item. When I included too many areas and had to partition it into a lot of date ranges, I found that frequently I couldn't get even close to the 500 limit. The number of additional caches added in a day was a significant fraction of the 500 limit. I also ignored the fact that Australia, Belgium, Canada, and New Zealand appeared in both the state list and as individual countries, so I got those caches twice in the scan. I did notice that I didn't want to include the 'none' state when I was also using the country list. That would be over 40,000 duplicated caches including all of the other countries twice. The one limitation I could see is that there could be caches that are outside all of the states and countries. I've decided to assume that there couldn't be too many of those and ignore them.

By the time I was finishing my second world scan I was starting to understand the principles that I needed for an optimum scan. Before I did my third scan of the world I used my GSAK database to split the areas into regions with roughly equal number of caches. That way I could even the number of caches in the regions out. I experimented with various sized regions and determined that a good fit would be regions with about 10,000 caches. That meant that I would need 25 regions for the world and that each region would take about 20 queries and 5 days. The query for the latest dates in each region then include roughly 30 days and the maximum number of days that gave less than 500 caches averaged to reasonably close to the 500 cache limit. California itself is over twice that size and Germany is half again the average. I also couldn't include states and countries in the same query, so there ends up being some compromises.

Here are regions that I used for the third world scan:

  • WS3aStateList Alabama, Alaska, Alberta, Antwerpen, Arizona

  • WS3bStateList Arkansas, Australian Capital Territory, Brabant wallon, British Columbia

  • WS3cStateList California

  • WS3dStateList Colorado, Connecticut, Delaware, District of Columbia, Florida

  • WS3eStateList Georga, Hainaut, Hawaii, Idaho, Illinois

  • WS3fStateList Indiana, Iowa, Kansas, Kentucky

  • WS3gStateList Liege, Limburg, Louisiana, Luxembourg, Maine, Manitoba, Maryland, Massachusetts

  • WS3hStateList Michigan, Minnesota, Mississippi

  • WS3iStateList Missouri, Montana, Namur, Nebraska, Nevada, New Brunswick, New Hampshire

  • WS3jStateList New Jersey, New Mexico, New South Wales, New York

  • WS3kStateList Newfoundland, North Carolina, North Dakota, North Island, Northern Territory, Northwest Territories, Nova Scotia, Nunavut

  • WS3lStateList Ohio, Oklahoma

  • WS3mStateList Ontario, Oost-Vlaanderen, Oregon

  • WS3nStateList Pennsylvania, Prince Edward Island, Quebec

  • WS3oStateList Queensland, Rhode Island, Saskatchewan, South Australia, South Carolina, South Dakota, South Island, Tasmania, Tennessee

  • WS3pStateList Texas

  • WS3qStateList Utah, Vermont, Victoria, Virginia, Vlaams-Brabant

  • WS3rStateList Washington, West Virginia, West-Vlaanderen

  • WS3sStateList Western Australia, Wisconsin, Wyoming, Yukon Territory

  • WS3tWorldList Afghanistan to Finland; except Australia, Belgium, Canada

  • WS3uWorldList France to Norway; except Germany, New Zealand

  • WS3vWorldList Germany

  • WS3wWorldList Oman to Sweden

  • WS3xWorldList Switzerland to Zimbabwe; except United Kingdom

  • WS3yWorldList United Kingdom

  • WS3zWorldList (spare)

As an example, here are my first 15 queries:

09/10/2005: Saturday
WS3aStateList Alabama, Alaska, Alberta, Antwerpen, Arizona
001 WS3aStateList 01-01-2000 thru 04-20-2002 GC57 Total Records: 492
002 WS3aStateList 04-21-2002 thru 10-15-2002 GC4FAC Total Records: 499
003 WS3aStateList 10-16-2002 thru 03-16-2003 GC9CBD Total Records: 498
004 WS3aStateList 03-17-2003 thru 07-18-2003 GCE8E1 Total Records: 495
005 WS3aStateList 07-19-2003 thru 11-21-2003 GCGFYH Total Records: 492

09/11/2005: Sunday
006 WS3aStateList 11-22-2003 thru 03-04-2004 GCH8M7 Total Records: 497
007 WS3aStateList 03-05-2004 thru 05-22-2004 GCHVE0 Total Records: 491
008 WS3aStateList 05-23-2004 thru 08-04-2004 GCJH0W Total Records: 494
009 WS3aStateList 08-05-2004 thru 10-24-2004 GCK6FT Total Records: 497
010 WS3aStateList 10-25-2004 thru 01-14-2005 GCKY7Z Total Records: 494

09/12/2005: Monday
011 WS3aStateList 01-15-2005 thru 03-04-2005 GCMJWG Total Records: 499
012 WS3aStateList 03-04-2005 thru 04-12-2005 GCN00A Total Records: 498
013 WS3aStateList 04-13-2005 thru 05-25-2005 GCNFCN Total Records: 498
014 WS3aStateList 05-26-2005 thru 07-05-2005 GCP4M7 Total Records: 490
015 WS3aStateList 07-06-2005 thru 08-12-2005 GCPKJR Total Records: 492

In addition to the query number, my region name, and the date range, I record the first cache id returned by the query and the number of records at the time that I create the query. Notice the accelleration in the growth rate over time. Later queries fill the 500 cache limit with fewer and fewer days in the date range.

I'll continue this discussion in subsequent articles. I'm hoping to interest others to comment on my blog entries and to build a statistical model describing the geocaching growth phenomena.