At the NYC OGP meeting, I demoed my thesis site. Since there isn’t a video of it, here’s a write-up covering the same material. There are also some slides available.
The goal of my Masters thesis is to make all the world’s spatial data accessible. This goal is accomplished by expanding OpenGeoPortal in two significant ways. First, spatial data files are discovered via web crawls and then ingested. Second, the ability to preview and download layers without needing OGC protocols was developed. This expanded version of OpenGeoPortal is on the web at http://WorldWideGeoWeb.com.
Data on WorldWideGeoWeb.com was discovered by crawling the web, relying exclusively on HTTP GET requests. This is the same technique used by Google and other search engines. The WorldWideGeoWeb crawler can be instructed to crawl a specific site. Sites are searched for links to zip files. The ingest code retrieves and unzips these files. If they contain a shapefile, the bounding box is determined using the shp and prj files. Any metadata file is also parsed. Information about each discovered layer is ingested into WorldWideGeoWeb’s Solr instance.
The user can add any of these Westchester layers to the cart and download them. The zip files are transferred directly from the Westchester server to the browser. Clipping the data or converting it to another format are not supported.
There are significant limitations with the current version of the software. Most notable is its inability to deal with large shapefiles. Currently, shapefiles over one megabyte can cause the browser to hang. Search results are color-coded to advise the user. Green layers are small and should preview quickly. Yellow layers are larger but should preview without too much delay. Layers in red represent shapefiles over a megabyte and should not be previewed. Even these large layers can be downloaded easily downloaded, just not previewed.
My thesis code is not production quality.
During a crawl, only spatial resources in shapefiles are discovered and their associated metadata must be in FGDC or ISO19115. It would be trivial to add support for KML and KMZ files. Support for other file formats and metadata standards could also be integrated.
Crawling based OGC protocols such as Get Capabilities and CSW could be added.
The ranking of the search results could be based on the “Page Rank” of the page that linked to the zip file.
Semi-spatial data such as web pages about places could be ingested and searched spatially.
I now work for Voyager Search (http://voyagersearch.com/). We are investigating how some of these ideas could be incorporated into their existing products. The code created for my thesis has been released under the GPL.
Some data an organization provides may be more critical or widely used. This data could be available via an OGC compliant server while other, less critical data is made available only via HTTP GET.