Getting documents into Solr

NOTE: These are just some remarks about some ways to import records into Solr that might be helpful to others. –Chris 

If you download and use the ogpIngest tool from our github repository, you can import records directly from our solr instance at http://geodata.tufts.edu/solr using the ‘Ingest Records from Remote Solr Instance’ page, even if you are not using ogpIngest to ingest your own records.  This was the tool that I used to import records from our old solr 1.4 instance to our solr 4.0 instance.  I also use it quite a bit to grab records to put into test solr instances.

Alternatively, you can produce a csv data dump of current Tufts records like so:

http://geodata.tufts.edu/solr/select?q=Institution:Tufts&rows=10000&wt=csv

MassGIS is “Institution:MassGIS”

Of course, you can use whatever query parameters you wish.  So you could add something like “+AND+!Access:Restricted” to the query to filter out restricted layers, or ‘+AND+!DataType:”Paper Map”+AND+!DataType:Raster’ to filter out scanned maps and rasters.  ’rows=10000′ is arbitrary (just some number equal to or larger than the number of returned documents), since the default value for rows returned is 10.

The resulting CSV document can be imported into solr using curl via the /update method.

See: http://wiki.apache.org/solr/UpdateCSV

The CSV request handler should be enabled by default in Solr 4.3. The main thing I see to watch out for is to add “&overwrite=true” to avoid duplicate data sets.  Alternatively, delete your current records for an institution before you import the new ones.

Since there is no way to provide document or field index-time boosts with the CSV format, however many indices do not utilize that feature.

This method may not always work well, but we’re not currently doing any index time boosts.

Don’t forget to commit and optimize after you’re done.