Thursday, June 15, 2006

World Scan 5 Half Way Point (roughly)

Well, after 227 additional pocket queries, I have 322,811 caches in the database with 4,697,078 logs. Of those caches, 274,194 are currently active. Actually, that many are ones that I haven't proven are archived yet. The new scheme of running the oldest date queries first and then looking at those that have a placed date before the start date of the last run pocket query, a last gpx date before the beginning of the world scan, and are not already archived is working well. GSAK does run slowly with such large databases so I need to be careful of what I run. I'm getting about 90 newly discovered archived caches a day. By the time I finish the 5th world scan sometime in August, I will have all of the caches archived before May identified. About once a week, I run queries to pickup all of the new caches. I can usually keep up to about a week before the current date. With caches placed within the prior week I can't estimate how many caches will be approved by the time the query runs to keep the query count below 500. If any query reaches 500 I have no way of telling how many or which I have missed.

Monday, April 10, 2006

Database Maintenance Strategy for Fifth World Scan

I completed my fourth scan of the world and then ran a more current update on the caches between home and northern California and Nevada for my upcoming trip. In preparing to start my fifth scan of the world, I thought I'd try to improve my pocket query strategy.

In the fourth scan, I emphasized getting information about archived caches which I had largely ignored in the third scan. To do that I had to have queries covering all dates from January 1, 2000 to as close to the current date as possible for each of the 25 partitions of the world listed in the "Pocket Query World Scan" entry. For the current size of the database that took just over 3 months. I have over 250,000 caches with over 4 million logs.

I wanted to catch up on identifying the archived caches and manually download the final gpx file for each of them. However, that meant that I didn't keep current on new caches. I obtained the new caches only in the placed date ranges at the end of each partition. I was behind three months on the oldest partition and a month and a half on the median partition. The only partition that I was ever nearly current on was the one I had just finished at any point in time.

There are five classes of caches that need to be considered; new caches, caches that have been received in a pocket query and have not yet been archived, caches that have been archived since the last time they were received in a pocket query, older archived caches, and caches that were archived before they could be received in any pocket query. I choose to ignore caches that were once active but have returned to the not-yet-approved state for this discussion. There are a few which appear in my database as not archived but yet having a very stale 'Last GPX date'. I get the error message when I try to access them for an individual gpx file update.

In order to make sure that my database has a record of as many caches as possible, I need to run pocket queries that capture them as they are approved. New caches that have not yet been approved are not available to pocket queries and only become visible when they are approved. Usually I have to wait about a week after caches are placed before I can count on most of them being approved. At the current time roughly 350 new caches are placed every day and reliably show up in pocket queries within a week. If I were to place a pocket query every day for all of the caches placed one week earlier, then I would have to ask for just one day. Otherwise only 500 random caches in the group would be returned. If I ran it every other day and asked for two placed dates, I would expect 700 new caches and miss about 200 of them.

By partitioning the world into three equal sized zones, I would expect just over 100 caches in each zone each day. By experimentation, I determined that I could select multiple states/provinces to cover all the caches in three roughly equivalently sized groups. The 'None' value covers caches that are in countries that are not included elsewhere in the state list. The ranges are states that begin with A-M, N, and O-Z. The range with all of the N's includes the value 'None'.

That would give me flexibility to ask for 3 or 4 days worth of caches for a group and have a query nearer the 500 maximum. I create the query, run a preview to test how many caches will be returned, and mark it to run once and then delete the query. I select the States/Provinces for the group and adjust the range of 'placed during' days to result in as close to 500 caches as possible. If more than 500 caches were placed and approved for a single day, I'd have to temporarily split the zone. I'll place a comment to this article after I've started to let you know how well that strategy is working. This strategy will use about one of my five daily queries for the first class of caches, leaving four remaining queries.

The second type of pocket query will be like those I used on the fourth world scan using it's 25 partitions, only at a slower rate to leave pocket queries for other uses. This allows me to update caches that have not been archived since the last time they were received in a pocket query and to infer which caches have been archived so that I can download the final individual gpx files for them. I keep the archived caches in my database with 'Last GPX Date' older than that of similar active caches. I don't need to worry about updating caches that I know have been archived because if any were ever reactivated, I would receive it the next time that that cache range is scanned. About a hundred caches are newly archived every day. The scanning process will also catch any caches that are missed in the first strategy for new caches. I can run these queries at a slower pace as long as it is faster than new caches are added. I plan to use some of the extra pocket queries to prepare for actual cache outings.

Caches that were archived before they could be received in any pocket query present a separate problem. They will never appear in a pocket query. To get information about these caches I do a manual scan of all the possible waypoint codes that aren't already in my database. This is a laborious process and I've only done it for a few of the earliest placed caches.

Until next time, this is Nudecacher.

Friday, March 10, 2006

"Sorry, you cannot view this cache listing until it has been published."

This is just a short note to point out an anomaly that I've noticed for some time. Groundspeak returns a few caches in pocket queries with reasonable coordinates and description that later cannot be viewed on the website. They return the "Sorry, you cannot view ..." message. So it appears that either the pocket query generator incorrectly includes them originally, or that it is possible to change the status of a cache from published back to not-yet-reviewed after one receives it in a pocket query.

Here are the examples from my GSAK database that I received in pocket queries between June and October in 2005:

  • GCD456, 51.919883, 4.488217, Manhattan Aan De Maas by Lonny The Blue Randonneur (2/1)

  • GCGBG8, 52.755033, 7.024767, 't Aole Compas (heulemaol verneit) by Team Dauwtrappers (1/1)

  • GCH2AD, 29.766617, -95.346850, DRAWBRIDGES of HOUSTON by PARKERPLUS (2.5/2)

  • GCM8QB, 51.970867, 4.123383, Op weg naar de grote zee (3) by Stekelsteef (1/1)

  • GCMHW7, 34.961300, -81.017017, Cash-in-the-park by Parrolet and son (1.5/1)

  • GCNRRD, 47.352567, -117.804917, Pope George Ringo 1 by Whiskeyriver (1.5/1)

  • GCP75N, 36.197150, -78.857317, Creatures of Hill Forest by paraclete (3/4)

  • GCPKEM, 41.156567, -112.139133, Hooper Cache by low land clan 2 (2/1.5)

  • GCPM6R, 39.740217, -105.140567, Blue Star Memorial by sanru5 (1.5/1)

  • GCPV42, 40.845600, -112.182217, Wreck of the Hesperis by Streight Arrow (1/4.5)

  • GCPVR4, 38.431983, -122.700133, Under the Fair Catalpas by Moozer (2.5/1)

  • GCPXDY, 43.377400, 2.952267, La Capitelle d'Andre by Les Patates Glandoises (1.5/2)

  • GCQ3D9, 39.625917, -75.709667, The Redeemer by NeoGeoATHF (2/2)

  • GCQ599, 39.134233, -106.554017, John's Bright Idea by black_jack (2/4)

  • GCQ939, 44.308417, -79.312617, Jif # 1 by DaisyField & Petals (1/1.5)


This anomaly shows up in my procedures for doing the world scan when I'm manually retrieving the archived caches. It leaves caches in my GSAK database with old 'Last GPX' dates that haven't been archived. It's not a problem for me but it reveals either a bug or a feature in the geocaching.com system. I'm planning on studying the archived caches to model the expected life of a cache.

Friday, January 13, 2006

Effect of archived caches on World Geocaching Model

Ultimately I want to develop a formula for the number of active caches for any date as one of the parts of my World Geocaching Model. It will be a statistical estimate tracking the growth of geocaching and will allow one to see when geocaching behavior changes and let us determine the factors affecting it, such as season or day of week. The number of active caches on a day is the number on the prior day plus the new caches approved since the prior day. That's not quite right as some caches are archived making them not active and a few priorly archived caches may be resurrected.

Pocket Queries also occasionally return gpx records of caches that have not yet been approved. I haven't figured out how or when this happens, but when I go to these caches from GSAK I get a message saying that I can't see them until they are approved. My GSAK database has information about them that it must have gotten from an earlier pocket query, unless caches can go from the approved state back to the not yet approved state. I wouldn't have expected this to happen. Perhaps I'll explore the issue in more detail in a future note.

In this note I want to talk about caches that are archived. Pocket Queries never return archived caches. They have to be downloaded manually. I use GSAK to probe potentially archived caches and download the single cache gpx files to get the final logs that explain what happened to the cache.

One's GSAK database entries for archived caches present several problems. First, if a cache is approved and then archived before it comes up in the pocket query scan sequence, no record of it will ever appear in the GSAK database. Normally I only refresh existing cache entries from within GSAK for caches that I suspect may have been archived. A couple of times I have sorted my database in waypoint order and then manually brought up the caches with the apparent missing waypoints. This is a slow and tedious process, so I've only done it for a few of the very early caches. Some of the missing waypoints are for caches that were archived before I could get them in a pocket query and some of them have never been used for a cache in groundspeak's system and I get an error.

To use GSAK to find archived caches I sort it on the last gpx date. This places all of the caches priorly marked as archived intermixed with caches that look like they are active but are in front of caches with last gpx date of the just completed scan. If they had not been archived the pocket queries in the scan would have returned them and updated the last gpx date. I call these stale caches. Then I use the split window view in GSAK to view the detail for each of the stale caches. I download the individual cache gpx file for each into a directory and then load all of the gpx files into GSAK at once. Once this has completed, I reexamine the database for stale caches. If I missed any in the process, I just repeat. Sometimes I filter the database on a state or country that has just been scanned with my pocket queries to form a subset that is easier to look at.

My World Geocaching Model will need to account for archiving stale caches and will want to model the factors that affect the length of life from approval to archiving of caches.

The archived cache identification process is quite time intensive. Sometimes geocaching.com has too much activity to be able to return caches in a timely manner and I wait until later when there aren't so many geocachers placing demands on the system. If there are special circumstances where I can identify a group of newly archived caches and can update their gpx records earlier, that will leave fewer caches for later sessions. Event caches frequently are like this and we had a one time mass archiving on January 1st of all locationless caches.

Once the gpx for an archived cache is loaded then the record should never again change. Should a cache be unarchived for any reason in the future, then eventually the world scan pocket queries will pick it back up. It is possible that such re-activated caches will put the number of caches in a pocket query over the 500 limit. In that case there will be a few caches that are not updated on that scan. They will be discovered as still active and refreshed when I do the stale cache process.

Well so much for tonight, I'll post more later. Until next time, happy caching.

Nudecacher

Monday, January 09, 2006

Nudecacher



Profile for nudecacher
Wear a Smile, Nothing Else

Friday, December 16, 2005

Pocket Query World Scan

Well, I need to get this started. I've just finished my third scan of all the caches in the world. I started the latest scan on September 10th and will finish on December 23rd when my queued pocket queries complete, which is pretty good. There are currently 220,836 active caches according to the Getting Started page. That means that if you were to have 5 pocket queries every day with near the maximum of 500 caches in each, it would take 442 queries over 89 days. I actually took 523 queries and 105 days. Of course there is no way of having complete coverage with queries that are pegged by the 500 cache limit, so while my 523 queries could have returned 261,500 caches they didn't. Also there are almost 600 new caches a day on the average, so each successive world scan takes longer. I actually added queries to pick up new caches, so that's another reason that my scan took longer than the minimum.

That also doesn't count caches that have been archived since the last time that I retrieved them. There's no way to get archived caches in a pocket query, so I've manually downloaded an average of nearly 300 single cache gpx files a day. I currently have 24,381 archived caches in my GSAK database. I use GSAK's 'Last GPX' field to find caches that weren't picked up by a pocket query when they should have been. Nearly all of those are archived caches and I can use the GSAK split screen to down load the single gpx files for the archived caches as I identify them. I do that rather than just mark them as archived because it picks up the archive log entries that explain what happened to the cache.

My GSAK database is huge and Clyde's recent enhancements to GSAK have been absolutely necessary to manage it. There is no way that one could download all the logs for the caches in one's database, since a gpx entry is limited to 5 logs. More than 5 geocachers could have found and logged a cache in a single day, which is the finest granularity available. The Getting Started page gives the number of logs in the past week, with 81,199 most recently listed. I currently have over 3 million logs in my database.

I refine my techniques for each world scan and I'm completing it in less than half the time it took for the earlier ones. The first world scan attempted to overlay the earth with 500 mile radius circles. That's hard to do and I couldn't come up with a general solid geometry method to determine where the centers were. It ended up being ad hoc and unsatisfactory. I had to use date ranges to reduce the number of caches in each query to below the 500 limit. The biggest problem with that method however is that the circles had to overlap to avoid dead spaces between them. The overlap meant that many caches were contained in more than one query which greatly extended the number of queries to cover the world. For finishing the world scan in the minimum number of queries, no cache can appear in more than one query.

My second scan of the world created queries by country or by state. Again I needed to partition the areas using date ranges for the 500 cache limit in a query. I discovered that I could bunch countries or states with few caches into a single query by multiple selection in the pull down list. Hold the shift key on a second selection to include all the items between it and the first. Use the control key to toggle the selection on an item. When I included too many areas and had to partition it into a lot of date ranges, I found that frequently I couldn't get even close to the 500 limit. The number of additional caches added in a day was a significant fraction of the 500 limit. I also ignored the fact that Australia, Belgium, Canada, and New Zealand appeared in both the state list and as individual countries, so I got those caches twice in the scan. I did notice that I didn't want to include the 'none' state when I was also using the country list. That would be over 40,000 duplicated caches including all of the other countries twice. The one limitation I could see is that there could be caches that are outside all of the states and countries. I've decided to assume that there couldn't be too many of those and ignore them.

By the time I was finishing my second world scan I was starting to understand the principles that I needed for an optimum scan. Before I did my third scan of the world I used my GSAK database to split the areas into regions with roughly equal number of caches. That way I could even the number of caches in the regions out. I experimented with various sized regions and determined that a good fit would be regions with about 10,000 caches. That meant that I would need 25 regions for the world and that each region would take about 20 queries and 5 days. The query for the latest dates in each region then include roughly 30 days and the maximum number of days that gave less than 500 caches averaged to reasonably close to the 500 cache limit. California itself is over twice that size and Germany is half again the average. I also couldn't include states and countries in the same query, so there ends up being some compromises.

Here are regions that I used for the third world scan:

  • WS3aStateList Alabama, Alaska, Alberta, Antwerpen, Arizona

  • WS3bStateList Arkansas, Australian Capital Territory, Brabant wallon, British Columbia

  • WS3cStateList California

  • WS3dStateList Colorado, Connecticut, Delaware, District of Columbia, Florida

  • WS3eStateList Georga, Hainaut, Hawaii, Idaho, Illinois

  • WS3fStateList Indiana, Iowa, Kansas, Kentucky

  • WS3gStateList Liege, Limburg, Louisiana, Luxembourg, Maine, Manitoba, Maryland, Massachusetts

  • WS3hStateList Michigan, Minnesota, Mississippi

  • WS3iStateList Missouri, Montana, Namur, Nebraska, Nevada, New Brunswick, New Hampshire

  • WS3jStateList New Jersey, New Mexico, New South Wales, New York

  • WS3kStateList Newfoundland, North Carolina, North Dakota, North Island, Northern Territory, Northwest Territories, Nova Scotia, Nunavut

  • WS3lStateList Ohio, Oklahoma

  • WS3mStateList Ontario, Oost-Vlaanderen, Oregon

  • WS3nStateList Pennsylvania, Prince Edward Island, Quebec

  • WS3oStateList Queensland, Rhode Island, Saskatchewan, South Australia, South Carolina, South Dakota, South Island, Tasmania, Tennessee

  • WS3pStateList Texas

  • WS3qStateList Utah, Vermont, Victoria, Virginia, Vlaams-Brabant

  • WS3rStateList Washington, West Virginia, West-Vlaanderen

  • WS3sStateList Western Australia, Wisconsin, Wyoming, Yukon Territory

  • WS3tWorldList Afghanistan to Finland; except Australia, Belgium, Canada

  • WS3uWorldList France to Norway; except Germany, New Zealand

  • WS3vWorldList Germany

  • WS3wWorldList Oman to Sweden

  • WS3xWorldList Switzerland to Zimbabwe; except United Kingdom

  • WS3yWorldList United Kingdom

  • WS3zWorldList (spare)



As an example, here are my first 15 queries:

09/10/2005: Saturday
WS3aStateList Alabama, Alaska, Alberta, Antwerpen, Arizona
001 WS3aStateList 01-01-2000 thru 04-20-2002 GC57 Total Records: 492
002 WS3aStateList 04-21-2002 thru 10-15-2002 GC4FAC Total Records: 499
003 WS3aStateList 10-16-2002 thru 03-16-2003 GC9CBD Total Records: 498
004 WS3aStateList 03-17-2003 thru 07-18-2003 GCE8E1 Total Records: 495
005 WS3aStateList 07-19-2003 thru 11-21-2003 GCGFYH Total Records: 492

09/11/2005: Sunday
006 WS3aStateList 11-22-2003 thru 03-04-2004 GCH8M7 Total Records: 497
007 WS3aStateList 03-05-2004 thru 05-22-2004 GCHVE0 Total Records: 491
008 WS3aStateList 05-23-2004 thru 08-04-2004 GCJH0W Total Records: 494
009 WS3aStateList 08-05-2004 thru 10-24-2004 GCK6FT Total Records: 497
010 WS3aStateList 10-25-2004 thru 01-14-2005 GCKY7Z Total Records: 494

09/12/2005: Monday
011 WS3aStateList 01-15-2005 thru 03-04-2005 GCMJWG Total Records: 499
012 WS3aStateList 03-04-2005 thru 04-12-2005 GCN00A Total Records: 498
013 WS3aStateList 04-13-2005 thru 05-25-2005 GCNFCN Total Records: 498
014 WS3aStateList 05-26-2005 thru 07-05-2005 GCP4M7 Total Records: 490
015 WS3aStateList 07-06-2005 thru 08-12-2005 GCPKJR Total Records: 492

In addition to the query number, my region name, and the date range, I record the first cache id returned by the query and the number of records at the time that I create the query. Notice the accelleration in the growth rate over time. Later queries fill the 500 cache limit with fewer and fewer days in the date range.

I'll continue this discussion in subsequent articles. I'm hoping to interest others to comment on my blog entries and to build a statistical model describing the geocaching growth phenomena.

Nudecacher

Friday, November 18, 2005

Introduction

Nudecacher created his persona on the geocaching site as an educational tool promoting body acceptance and tolerance toward nudists. Nudecacher is just like any other cacher, except he performs the activities while nude and posts appropriate pictures with his logs to prove it. The Evidence of the success of the effort over the past two and a half years is the good will email communications and support in the geocaching forums that have been received from geocachers all over the world.

Nudecacher has been studying the global geocaching phenomena. This blog will contain random sporadic entries at infrequent intervals. Initial entries are planned to work out estimates of the size and scope of world wide geocaching activities and to contemplate its significance. The Groundspeak geocaching.com site allows premium members to download Pocket Queries containing geocache waypoints to avoid having to manually enter them into their GPS. We'll create strategies to use these to form our estimates.