CSpace-Django webapps - setting up Solr4

Ed. note: This page is in the process of being edited. The first portion has been reviewed and tested. Further down the page, things get rougher. Any of it might change. Please use with caution.

Introduction

Solr is a powerful, blazingly fast, full-featured search engine built on top of Lucene.http://lucene.apache.org/solr/. The UC Berkeley CollectionSpace deployment team has chosen Solr to power its Django-based webapps.

>>> See PAHMA-874 for some background on this subject.

You may wish to read up on Solr before diving in. Here is a good place to start:

http://lucene.apache.org/solr/tutorial.html

Basic Solr

CollectionSpace webapps use version 4 of Solr ("Solr4"). You can quickly get a plain version of Solr4 running on your computer to see it at work. On most computers, esp. Linux and Mac systems, you can execute the following commands in a Terminal window to install:

# Plan to put your solr installation in some suitable place. 
# For now your home dir is fine; is can be moved lock-stock-and barrel
cd ~ 
curl -O http://archive.apache.org/dist/lucene/solr/4.4.0/solr-4.4.0.tgz
tar -xzvf  solr-4.4.0.tgz
# Rename the folder 'solr4' for simplicity's sake
mv solr-4.4.0 solr4
cd solr4
# the "example" code is already working and ready to go in the vanilla distribution.
cd example
java -jar start.jar

At this point you should be able to visit:

http://localhost:8983/solr/

and see the admin interface. Play around if you like; when you are done, hit ^C in the window where you started Solr to stop the server.

NOTE: On the IST VMs, port 8983 is normally blocked at the firewall, so you cannot access the admin interface; if you can "remote in" to the machine you may be able to use the remote app's browser to access this page.

Soup-to-Nuts: Installing, Configuring, and Loading Data into Solr4 (CollectionSpace version)

There are a variety of ways to load data into your Solr4 instance. Most of them involve POSTing data to the Solr4 instances REST API via HTTP.

The following instructions describe the steps assume that the Solr4 instance is a Linux (RedHat 6) machine, and that the data is being uploaded from the same server (i.e. "localhost"). Of course, it is possible to access the Solr4 instance from another system (if the firewall rules permit), in which case the process will be a bit different.

The instructions should work just fine with most versions of Mac OS X.

We will use the example of Solr4 instance now in development for UC Berkeley's Phoebe A Hearst Museum of Anthropology (PAHMA) as the model for how to install Solr4 and upload data into it. The process involves creating an installation directory, downloading and installing Solr4, modifying the configuration to accommodate the PAHMA CollectionSpace dataset.

The customization for PAHMA requires components from the UC Berkeley CollectionSpace Deployment GitHub Tools repository, so we will be making a local clone of that repository. (The Tools directory contains the Solr schema file, along with scripts extracting data from PAHMA's CollectionSpace database and importing that data into the Solr core.)

The steps are as follows:

Check the environment to ensure we can do the install
Install and test solr4 (essentially the same steps as above)
Configure solr4 for the target institution (PAHMA)
Obtain and load the PAHMA Data Extract into solr4
Verify the correct loading of the data

Check first to see if Solr is already installed a running

Before installing Solr, be sure that Solr is not already running on the server. Look for the java process start.jar. if it is active, some version of solr is installed and running. You may certainly stop this server and start over, but consider verifying that you won't be ruining someone else's work if you do so.

The same process can be used to stop Solr if you have previously "daemonized" it (i.e. started with nohup .... &).

# look for a process running Jetty... probably this is Solr
ps aux | grep start.jar

# sample response
root     18699 62.3  4.1 2648100 162360 pts/1  Sl+  19:07   0:06 java -Xmx512m -jar start.jar
root     18730  0.0  0.0 103252   860 pts/1    S+   19:07   0:00 grep start.jar

# kill process with pid 18699. Need to sudo if it is running as root.
sudo kill 18699

Solr installation and customization

The script follows this plan:

Find a suitable location to install Solr4. On IST VMs, this is /usr/local/share. On other systems, it can be most anywhere you like (provided you have write access to the directory, of course!). These instructions mostly assume solr's home is /usr/local/share. So... change to the directory which will be Solr4's home:
```
cd /usr/local/share

# or
 
cd ~
```

Get the solr4 tarball, untar it and install the application in a solr4/ directory within the share/ directory. This is just as shown above.

wget    http://archive.apache.org/dist/lucene/solr/4.4.0/solr-4.4.0.tgz
#or, if you don't have wget installed:
curl -O http://archive.apache.org/dist/lucene/solr/4.4.0/solr-4.4.0.tgz
#
tar -xzvf solr-4.4.0.tgz
# Copy the version number for future reference.
echo solr-4.4.0 > solr-4.4.0/VERSION
# Rename the folder 'solr4' for simplicity's sake
mv solr-4.4.0 solr4
# Delete the tarball to keep our directory clean
rm -r solr-4.4.0.tgz
cd solr4

For now, we are using all the goodies that come with the vanilla Example datastore, so we simply rename the example directory provided and add the PAHMA-specific stuff (schema, etc.) on top of it.:

cp -r example/ pahma   # Use your CollectionSpace tenant name here if you have one, rather than 'pahma'
cd pahma   # Again, use your tenant name instead of 'pahma'

# Clean up some unnecessary directories
rm -rf exampledocs/
rm -rf example-DIH/
rm -rf example-schemaless/
# NB: here we making a "single core" solr4 deployment...
cd solr
mv collection1/ pahma-metadata  # your tenant name instead of 'pahma'
cd pahma-metadata/  # your tenant name instead of 'pahma'
# tell solr the name of our core (you could use vi... ;-)
perl -i -pe 's/collection1/pahma-metadata/g' core.properties

Finally, get PAHMA-specific stuff (schema, etc.) from the GitHub repository and merge it into the Solr configuration:

cd /home/developers/  # assuming you want to put the local repository at the same level as solr4
git clone https://github.com/cspace-deployment/Tools.git
# if you've been putting this where we said, this copy command will work. Otherwise, adjust accordingly.
cp -r Tools/datasources/ucb/multicore/pahma/metadata/conf/* solr4/pahma/solr/pahma-metadata/conf/

Now we can start the solr instance we have just created, using the built-in jetty server:

cd /usr/local/share/solr4/pahma  # use your tenant name
# start single core solr instance (under jetty) in the background
nohup java -Xmx512m -jar start.jar &

This last command should have started solr running "daemonized" in the background. Let's check...

You can visit:
http://localhost:8983/solr/
or, if this URL is not accessible: open a new shell and ssh to your server. Then run:
```
ps aux | grep start.jar
tail /usr/local/share/solr4/pahma/nohup.out
```

The ps aux command should reveal a running java process named start.jar:

root     18699 62.3  4.1 2648100 162360 pts/1  Sl+  19:07   0:06 java -Xmx512m -jar start.jar
root     18730  0.0  0.0 103252   860 pts/1    S+   19:07   0:00 grep start.jar

The final lines of the log file nohup.out should include output along the lines of:

3963 [searcherExecutor-4-thread-1] INFO  org.apache.solr.core.SolrCore  - QuerySenderListener sending requests to Searcher@4655af22 main{StandardDirectoryReader(segments_1:1)}
3969 [coreLoadExecutor-3-thread-1] INFO  org.apache.solr.core.CoreContainer  - registering core: pahma-metadata
3971 [main] INFO  org.apache.solr.servlet.SolrDispatchFilter  - user.dir=/usr/local/share/solr4/pahma
3972 [main] INFO  org.apache.solr.servlet.SolrDispatchFilter  - SolrDispatchFilter.init() done
4008 [main] INFO  org.eclipse.jetty.server.AbstractConnector  - Started SocketConnector@0.0.0.0:8983
4021 [searcherExecutor-4-thread-1] INFO  org.apache.solr.core.SolrCore  - [pahma-metadata] webapp=null path=null params={event=firstSearcher&q=static+firstSearcher+warming+in+solrconfig.xml&distrib=false} hits=0 status=0 QTime=51
4022 [searcherExecutor-4-thread-1] INFO  org.apache.solr.core.SolrCore  - QuerySenderListener done.
4022 [searcherExecutor-4-thread-1] INFO  org.apache.solr.handler.component.SpellCheckComponent  - Loading spell index for spellchecker: default
4022 [searcherExecutor-4-thread-1] INFO  org.apache.solr.handler.component.SpellCheckComponent  - Loading spell index for spellchecker: wordbreak
4023 [searcherExecutor-4-thread-1] INFO  org.apache.solr.core.SolrCore  - [pahma-metadata] Registered new searcher Searcher@4655af22 main{StandardDirectoryReader(segments_1:1)}

Loading and Reloading (CollectionSpace) Data into Solr

You now have a running, customized, but empty Solr core.

Here at UCB, Solr is loaded with a denormalized, exported subset of the data in the CollectionSpace database. The data is in "csv format", and multiple-value fields appear throughout. Solr4's "csv import handler" is used to POST the (rather large) input file to the server.

The instructions in this section assume you are merely interested in obtaining the existing data for your institution into a Solr4 datastore; ergo, once you have configured your Solr4 instance as described above, you need only obtain the latest dump and load it into your instance.

If you want to learn how to create a solr datasource "from scratch", jump to the next section "Creating data for loading into Solr"

These instructions assume:

You have access to the server where the nightly .csv file dumps are created, or you can get the data from a trusted crony.
"Access" means that you can scp the file to the server you want to upload the data to (contact John, Glen, or Rick to get access, or the file, if you can't get it yourself).
That the conventions for the location and naming of the .csv dumps for your institution conform to the practices for the existing UCB deployments (i.e. files in /home/developers/<institution>, filename is of the form 4solr.<institution>.metadata.csv)
You have enough disk space to hold the .csv dump and the loaded solr datastore. This seems to be around a gigabyte or two for the existing UCB deployments.
Solr4 is running and configured as described above.

So, the following two commands perform these two steps.

# this works for the pahma dataset and for ucjeps. You may need to make your own data for other cases...
scp dev.cspace.berkeley.edu:/home/developers/pahma/4solr.pahma.metadata.csv .
# post the data to the solr server.
time curl 'http://localhost:8983/solr/pahma-metadata/update/csv?commit=true&header=true&trim=true&separator=%7C&f.blobs_ss.split=true&f.blobs_ss.separator=,' --data-binary @4solr.pahma.metadata.csv -H 'Content-type:text/plain; charset=utf-8'

Be patient! The two commands above may each take a few minutes to complete!

If you wish to reload a solr core that already has data in it (e.g. if you want to get more recent data), consider the following:

You probably should "zap" the existing datastore. While the new data will overwrite (and replace) the old data, the CSIDs is being used as a key in Solr and these are not really permanent identifiers in CSpace and so unhappy persistence of old data is possible. To clear things out requires two POSTs from the command line (or a browser, depending); see the section below "Deleting Data from Solr".
If there are new fields or field definitions you may need to update your Solr schema. Persumably whoever made the new .csv file also made a revised schema.xml and this is checked into GitHub.

Deleting Data From Solr

See, e.g. http://wiki.apache.org/solr/FAQ#How_can_I_delete_all_documents_from_my_index.3F

You can also delete individual records (see e.g. http://lucene.apache.org/solr/tutorial.html)

# clear out ALL the existing data.
curl "http://localhost:8983/solr/pahma-metadata/update" --data '<delete><query>*:*</query></delete>' -H 'Content-type:text/xml; charset=utf-8'  
curl "http://localhost:8983/solr/pahma-metadata/update" --data '<commit/>' -H 'Content-type:text/xml; charset=utf-8'

Customizing your Solr datasource

Tweaking the Current Paradigm

We are working on automating this process: the installdatasource.sh script in GitHub will be the means of doing this. Stay tuned for details.

As before, the instructions below assume you're running on a "standard" IS&T RHEL6 server.

Now that you have a working Solr datasource, with lots of juicy data in it, you may find yourself wondering how you might augment what's in there, or fix problems.

Assuming that you are working from an existing datastore, it is a relatively simple matter to add or remove fields and push these changes into your datastore.

In general terms, the process is as follows:

Modify the SQL statements used to extract the appropriate data from your CollectionSpace instance and make a .csv file
If you're going use the Solr4 .csv API, you'll need to specify in the header line of the .csv file what the names and types of the fields are. For simplicity, at UCB we use the "dynamic field" capability of solr to name the fields at load time, as this reduces the amount of specification required in the schema.xml file. Note, however, that this expediency normally does NOT eliminate the need to customize the schema.xml file for your data. You would be wise to read up on the details of how that API works at: http://wiki.apache.org/solr/UpdateCSV
(If you're not, you'll need to write or find something to POST JSON or XML to another REST point...)
Depending on what data you are working with, you may need to edit solrconfig.xml or schema.xml. If the modifications only add data and you can accept the search properties of the input field specification, then you may be able to get away without making modifications. However, if you want to do faceted search on the fields or to create variations that have interesting properties, you'll need to do a bit of editing.
POST the data to your running solr server.

Enjoy the speed and ease of accessing your data under Solr!

A good way to understand how to wrangle your own data for Solr is to study the existing approach used for the UCB deployments. Three deployments (UCJEPS, PAHMA, and UCBG) have Solr datasources in some stage of development and use, and all use the same basic approach described above.

Below is a explanation of how the nightly Solr update for PAHMA works. Hopefully, understanding the operation of this script will give the reader some insight into how Solr is being used and what is possible. This is implemented in a single shell script, called makeCsv2.sh

https://github.com/cspace-deployment/Tools/blob/master/datasources/pahma/makeCsv2.sh

What is does is the following:

Runs a query to extract 31 columns of "object-level metadata".
Munges that file to ensure that all the rows have exactly 31 columns, etc. (the data is a bit dirty coming from CSpace)
Runs a query to extract media information for the objects.
Runs a short perl script that attaches the medial info to the object metadata.
Replaces the existing header with a special "solr-aware" header
Imports the file into Solr via the .csv API

Defining the solr schema

https://github.com/cspace-deployment/Tools/blob/master/datasources/ucb/multicore/ucjeps/metadata/conf/schema.xml

- John writes:
  "[S]tart with the schema declaration for the ucjeps-metatdata solr core at:[https://github.com/cspace-deployment/Tools/blob/master/datasources/ucb/multicore/ucjeps/metadata/conf/schema.xml|https://github.com/cspace-deployment/Tools/blob/master/datasources/ucb/multicore/ucjeps/metadata/conf/schema.xml]
  In particular:
  * 249-274 lines
  * definitions of fieldType 'string' and 'text_general', which are the two we are currently using.
  * understanding 'dynamic field definitions'"

What is solrconfig.xml?

Solr config:https://github.com/cspace-deployment/Tools/blob/master/datasources/ucb/multicore/ucjeps/metadata/conf/solrconfig.xml

CollectionSpace Deployments