Author Archives: SpacemanSteve

Thesis Update

One of the goals of my Masters Thesis is to make the web’s spatial data resources discoverable from a single portal.  It builds on the earlier work of the entire OpenGeoPortal team.  Two ideas supporting the thesis are discussed in the following paragraphs.  They are: crawling the web for spatial data and parsing and rendering shapefiles in a web browser.  It should be noted that during the development of the OGC protocols, the approaches discussed below would not have been possible.  The early web held many more limitations.  It is only due to recent enhances in browser performance and advances in the HTML5 standard that allow these new approaches to be prototyped and analyzed.

If you would like to talk about or criticize these ideas, you can contact me at Stephen.McDonald@tufts.edu.

Crawling the web for spatial data

Research universities employ OpenGeoPortal (OGP) to make their GIS repositories easily accessible.  OGP’s usefulness is partially a function of the data sets it makes available.  OGP, like most GIS portals, relies on a somewhat manual ingest process.  Metadata must be created, verified and ingested into an OGP instance.  This process helps assure very high quality metadata.  However this manual nature limits the rate at which data can be ingested.  If the goal is to build an index of the world’s spatial resources, a partially manual ingest process will not scale.  Instead it must be extremely automated.  To achieve this goal, OGP Ingest now supports data discovery via web crawling.

Ingest via small-scale crawling

The easiest way for an organization to make their spatial data available on the web is with zip files.  Typically, each zip file contains an individual shp file, an shx file, an xml file with metadata and any other files related to the data layer.  For example, the UN’s Office for the Coordination of Humanitarian Affairs put their data on the web in this fashion (http://cod.humanitarianresponse.info/).  OGP’s ingest process can automatically harvest these kind of spatial data sites.  Via the OGP Ingest administration user interface, an OGP administrator enters the URL of a web site holding zipped shape files.  The OGP crawler processes the site searching all its pages for HTML anchor links to zip files.  Each individual zip file is analyzed for spatial resources.  OGP ingest code iterates over the elements of each zip file to determine if it has a spatial resource.  The user specified metadata quality standard is applied to the discovered resources to determine if they should be ingested.  This allows an information rich site with potentially thousands of data layers to be completely and easily ingested simply by providing the URL of the site.

OGP’s ingest crawl code is built on Crawler4j.  This crawl library follows robots.txt directives.  The number of http requests per second is limited so as to not significantly increase the load on the data server.  The crawler identifies itself as OpenGeoPortal (Tufts, http://www.OpenGeoPortal.org).

Ingest via large-scale crawling

Using Crawler4j works well for crawling a single site with a few hundred pages.  However, the web is an extremely large data store and well beyond the crawl capabilities of a single server.  Instead, it must be crawled using a large cluster of collaborating servers.  A proof-of-concept system capable of crawling large portions of the web was developed using Hadoop, Amazon Web Services and CommonCrawl.org’s database.  It successfully demonstrated the ability to crawl a massive number of web pages for spatial data.

Crawling based ingest versus traditional ingest

Crawling for data and metadata has several advantages over a more traditional ingest strategy.  Since crawling often finds both data and metadata, the URL for the actual data is known and it can be used for data preview and downloading.  The bounding box and other attributes of the metadata can be verified against the actual data during ingest.  A hashcode of the shapefile can be computed and used to identify duplicate data.  A hashcode of the different elements in the shapefile can be used to identify derivative data that share a common geometry file.  Crawling builds on an extensive technology base.  Small scale crawling builds on Crawler4j, large scale crawling can economically leverage large clusters via AWS Electric MapReduce and CommonCrawl.org’s database of web pages.  Data providers need only stand up an Apache web server, they can limit crawling of data by maintaining a robots.txt file.  Under appropriate circumstances, metadata can be enhanced using information scrapped from HTML pages.  The importance of a layer can be estimated in part by the page rank of the layer linking to the zip file and the overall site rank of the data server.

Browser based rendering of spatial data

Crawling for spatial data makes it searchable in OGP.  However that is not enough.  OGP also provides the ability to preview and download data layers.  Traditionally, these preview functions are provided by the OGP client code making OGC WMS requests to a map data server like GeoServer.  GeoServer processes the shape files and delivers the corresponding requested image tiles.  Each organization providing GIS data layers serves them using an instance of an OGC compliant data server (GeoServer) often with some additional supporting spatial data infrastructure (e.g., ESRI SDE or PostGIS).  However, data discovered by an OGP crawl resided on an HTTP server and are not known to be accessible via OGC protocols.  To make these layers previewable, code was developed to completely parse and render shapefiles in the browser.   This works in coordination with a server side proxy.  When the user requests a layer preview, the URL for the discovered zip file is retrieved from the repository of ingested layers.  OGP browser code makes an ajax request to the OGP server passing the URL of the zip file.  The OGP server uses an HTTP GET request to create a local cache copy of the zip file from the original data server.  Then it responds to the ajax request with name of the shape file.  The client browser then requests the shp and spx files from the OGP server, parses them as binary data and uses OpenLayers to render the result.  The parsing is provided by Thomas Lahoda’s ShapeFile JavaScript library.  It makes XMLHttpRequests for the shapefile elements and parses the result as a binary data stream.  After parsing, a shapefile is rendered by dynamically creating a new OpenLayers layer object and populating it with features.  Note that the cache copy on the OGP server is needed because JavaScript code can not make a cross-domain XMLHTTPRequest for a data file.

Advantages of browser based parsing and rendering

Browser based rendering of shapefiles has very different performance characteristics from traditional image tile rendering.  This performance has not yet been analyzed.  It is anticipated the performance for files over a couple megabytes will be slower because the entire zip file must be copied first to the OGP server acting as a proxy and then on to the client’s web browser.  For very large data files, from dozens to hundreds of megabytes, the data transfer time would make the client side parsing and rendering impractical.  Naturally, as networks and brwosers get faster, the time for this will decrease.   There are a variety of approaches that might improve performance for large files.  For example, one could use a persistent cache and create the local cache copy either during the discovery crawl phase or not delete the cache file after an initial request.  One could increase performance for especially large shape files by generating a pyramid of shape files at different levels of detail when it is put in the cache and having the client request the appropriate level.

There are several advantages to moving the parsing and rendering of shape files to the client.  First, the technology stack of the data server is simplified.  Traditionally, an institution serving spatial data might use GeoServer as well as ESRI SDE on top of MS SQL all running on Tomcat and Apache.  With client side parsing and rendering, an institution can serve spatial data using only Apache.  This simplified server infrastructure leads to reduced operational overhead.  The traditional approach requires publishing layers in GeoServer and keeping GeoServer synchronized with the spatial data infrastructure behind it.  The client parsing and rendering solution requires only keeping an HTML page with links to the zip files up to date.  A second advantage is scalability.  The traditional approach requires an application data server to convert the shape files to image tiles.  This centralizes a computationally significant task.  Whenever the data is needed at a different scale or with different color attributes, OGC requests are created and server load is generated.  When all parsing and rendering is done on the client, the data server is simply providing zip files.  This allows it to serve many more users.  As OGP requires scaling or changing a layer’s color it does not need to contact the server.  Since the browser has the entire shape file, it can locally compute any needed transform.  For efficiency, these transformations can leverage the client’s GPU.  Finally, since any transform it possible it becomes possible to create a fully featured map building application in a browser.

Performance Measurement

Preliminary tests were run using the “Ground Cover Of East Timor” layer from the UN’s Food And Agriculture Organization.  The zipped shape file is at http://www.fao.org:80/geonetwork/srv/en/resources.get?id=1285&fname=1285.zip&access=private.  This file was used for initial testing because it is small and also easy to find using OGP.  During this test, a tomcat server was run on a MacBook Pro laptop running over the Tufts University wireless network in Somerville, MA, USA.  A Chrome browser instance connected to the OGP server using localhost.  The FAO server is located in Rome, Italy.

Zip file size: 702k
Time to create local cache copy (on localhost): 2.9  to 3.9 seconds
Unzipped shp file size: 920k
Unzipped shx file size: 828bytes
Time to transfer shp and shx files to browser and parse them: .3 seconds
Time to render in OpenLayers: 1.3 seconds
Total time: 4.5 seconds to 5.5 seconds

The following screenshot shows the “Ground Cover Of East Timor” layer previewed in OpenGeoPortal.  The Chrome Developer Tools panel at the bottom show the requests made by the browser.  The first is an ajax request to create a local copy of the shape file on the OGP server.  The server code makes an HTTP GET request to FAO’s server in Rome for the zip file, unzips the contents and returns name(s) of the shapefile(s).  The next two requests in the network tab show the XMLHttpRequests calls from Thomas Lahoda’s shapefile library.  This demonstrates the client is directly processing shapefiles rather than making WMS requests.

When parsing and rendering spatial data such as shapefiles in the browser, there is an initial time penalty for transferring all the original, raw data to the browser.  This initial time is potentially offset by handling all future scaling events within the browser instead of making tile requests over the network.  The profile for the potential higher initial load time versus the repeated requests for tiles may change as crawl based ingest makes spatial resources from servers around the globe available in a single portal.  In the above example, only a single HTTP GET request was issued to FAO’s Rome based server rather than repeated WMS requests to it.

April Boston Meeting

Stacey Maples from the Yale University Map Department got the ball rolling for a regional, one day OGP  meeting. We’re meeting on April 13th at MIT.  This is a regional meeting for schools in the northeast that are starting to set up a production system.  Our agenda is now available.

 

On Line Technical Chat (2012-02-22)

A bunch of OGP developers are getting together on Wednesday, February 22th at 4pm Eastern Time. If you want to join us, email Steve McDonald (Stephen.McDonald@tufts.edu). Here’s our agenda:

  • Welcome
    • Introduce yourself and what you’re up to
  • OGP 1.1 Release
    • Server URLs: Config file changes and Solr Location field
    • GeoServer proxy for restricted layers
    • URL shortening
    • Save Image
    • Improved “Loading” indicator
    • Improved client side login, added logout
    • Download clean up
    • More POM file dependencies
    • Harvard PIN Authentication
    • Harvard’s link to a layer, sending arguments to OGP
    • Searching by solr keywords (e.g., LayerId:Tufts.Buildings)
    • Local JavaScript file loading (institutionSpecificJavaScript in ogpConfig.json needs documentation)
    • Shard evaluation support
    • OpenGeoIngest: Location field, test data generation
  • OpenGeoPortal 2012
    • April 13th all day meeting in Boston
  • Search
  • Future Plans
    • Raster ingest support

 

Larger Scale Search

Can OpenGeoPortal’s search scale to meet the communities future needs?  To answer this question, I investigated scaling both vertically and horizontally.

What if a Solr instance had to handle many more layers?  Most OGP Solr instances currently provide access to under 10,000 GIS layers.  Searches typically execute in under 10 milliseconds.  To test with more layers, I wrote code to create new layers by processing my existing layers and changing their LayerId field.  With this approach I created a Solr instance containing over 60,000 layers.  Running on a development VM with 1.5 GB of RAM and only a single core, search times increased to 200 milliseconds.  This is sufficient and provides very quick searches.

Having a single Solr instance wil eventually become a performance limitation.  When there are too many layers for a single machine to search quickly, it is important to distribute the search query across multiple machines.   Fortunately, this functionality is built in to Solr.  The metadata is broken into multiple pieces called “shards”, each shard might hold, for example, 150,000 layer.   Each shard runs on its own Solr instance, on its own server.  A query is sent to one Solr instance.  This Solr instance shares the query with all the other shards, combines all the search results and returns them.  I have run tests where OGP connects to two Solr instances.  To maximize network latency, I used one at server Tufts and one at Berkeley.  Although the time perform a search slowed to 200 milliseconds, that is still fast enough.

Based on these tests, I estimate a single, very powerful server can probably handle over 100,000 layers.  Perhaps it could handle twice that.  When the limit of a single machine is reached, we can scale out to multiple shards running on multiple servers.  If I had to, I imagine I could build a Solr repository to handled half a million layers.

On-Line Tech Talk Notes

Thanks to the 12 people from 8 universities that joined our inaugural OGP Tech Talk.

Berkeley’s project is to add OGP support for layers with a bounding box but that are not georeferenced was well received.  To support the effort, Tufts will add a new field to the Solr schema.  This will be in the OGP point release created at the end of January.  Support for non-georeferenced could also help us deal with collection level data.  The roll-out of this change to the Tufts production server will be coordinated other OGP sites relying on the Tufts Solr instance.

Tufts will initiate a discussion about moving from setting map server locations in a config file to using the existing “location” field in the Solr schema.

Getting OGP authentication to work with central authentication servers is a challenge.  We need to document thoroughly Harvard’s solution and understand the issues with Berkeley’s system.  This issue is critical because authentication is required for access to licensed/restricted data.  Dave Siegel at Harvard has started to document his solution at http://code.google.com/p/opengeoportal/wiki/PINAuthentication .  Berkeley will be sending code that demonstrates their authentication issue for troubleshooting.

There is very broad interest in shared cloud based OGP resources.  To quote MIT, “instead of figuring out how and when to harvest metadata from many sites/systems, we’d like to explore all sites contributing metadata to one shared site that can be easily searched by all
instanced of OpenGeoPortal”.  Tufts will take the lead by generating a beta proposal and starting a discussion within the community.  The current OGP functionality of each institution controlling whose data they see will be preserved.  It will likely require metadata standards.  We envision administration access similar to the OGP source repository; staff from multiple institutions will have commit privileges.   Such a system will eventually require we handle cross-institution de-duplication.

We will likely hold another OGP Tech Talk meeting at the end of January.  It may use a telephone based conference call to provide high quality audio and Adobe Connect for chat and screen sharing.  Please let Steve know if that would be a problem for you.

Additionally, feel free to use the mailing list for questions, meeting agenda items and announcements.

On-line Technical Chat

A bunch of OGP developers are getting together on Wednesday, January 11th at 4pm Eastern Time.  If you want to join us, email Steve McDonald (Stephen.McDonald@tufts.edu).  Here’s our agenda:

  • Status Update – what people are doing
  • Berkeley’s Project
  • Solr Schema Changes
  • Questions – Protected content programming

Here’s the details on Berkeley’s project:

Here at Berkeley, we have about 40,000 scanned maps, mostly from topographic map sets. These maps have to be georeferenced, which is a long process, but we can assign bounding boxes to each file/sheet. We would like to make these maps available to our patrons for searching and visualizing while we are working on georeferencing them. We have managed to get non-georeferenced images into Geoserver, so we can use the same server and access method for the non-georeferenced images as for the georeferenced ones.

PROPOSAL

We propose that we add a new key value pair to the Solr record. The key would be ‘georeferenced’ and the values would be true or false.

The Solr record would have a bounding box, but the bounding box would be approximate.

A layer whose Solr record ‘georeferenced’ field has a value of false would behave in every way like a regular layer except that
on preview, the layer would open in a new openlayer window.

Since the URL for accessing a non-georeferenced layer is different from a regular WMS call, there will also be some code that will deal with that, but it can easily work off the ‘georeferenced’ metadata tag. (here is an example

<a href=”http://linuxdev.lib.berkeley.edu:8080/geoserver/UCB/wms?service=WMS&version=1.1.0&request=GetMap&layers=UCB:images&CQL_FILTER=PATH=’furtwangler/17076013_02_015a.tif’&styles=&bbox=0.0,-65536.0,65536.0,0.0&width=512&height=512&srs=EPSG:404000&format=application/openlayers“>17076013_02_015a.tif</a>

notice the CQL_FILTER argument. This is a working URL, though the Geoserver is on a development system and may go down without notice.)

DEVELOPMENT

Since we have a pressing need for this modification, we would of course do the development and make the code available as an OGP module.