Crawling Rockland Site

Data from the Rockland County (NY) GIS site is now available at http://www.WorldWideGeoWeb.com.  The following image shows some of the search results after a crawl based ingest of the Rockland data site:

RocklandSearchResults

Even though no web services may be available, these search results can still be previewed. The following image shows the Rockland boundaries for State Senate:

RocklandStateSenate

Here’s a screenshot with two separate bus routes previewed.

RocklandBusPreview

On the Rockland County site, many bus routes are stored in a single zip file.  These are individually searchable and previewable on WorldWideGeoWeb.

Directed crawling of new sites presents new challenges and reveals limitations in the existing code.  Several changes were made to successfully crawl Rockland, NY data at https://geopower.jws.com/rockland/DataPage.jsp .

The Rockland site does not contain links to zip files.  A typical link to a data file is https://geopower.jws.com/rockland/DownloadData.jsp?pck_oid=2464.  The crawl code was changed change to support links to servlets rather than simple zip files.

The Rockland metadata files contain minimal, often cryptic titles; for example,“monsey2” and “TZX”.  These titles are not sufficient.  Fortunately, they can be augmented with information scraped from the crawled web page.  Specifically, the text from the anchor tag linking to DownloadData.jsp servlet is concatenated to the title field in the xml metadata file.  This creates user friendly titles, for example, “monsey2: TOR Bus Routes” and “TZX: Tappan Zee Express Bus Route”.

The Rockland site contains zip files that hold multiple shapefiles.  For example, the file TOR.zip contains 6 separate bus routes, each in a separate shapefile.  Each shapefile is ingested as a separate entity so it can be independently searched and previewed.  The 28 links on the Rockland data page expand into 47 searchable, previewable spatial resources.  Note that the OpenGeoPortal download operation pulls down entire shape files from the Rockland server, not the individual shapefiles.

Since the Rockland site only supports secure connections, the ingest code was enhanced to support https.