UCJEPS Large Volume Data Loading Notes

Week of Jan 31 - Feb 4:

There was a problem with the loan out status vocabulary, as declared in cspace-config.xml.  Vocabularies declared in that file cannot have refnames as the short identifier cannot be parsed from the refname.  Once that problem was identified, loading of the vocabularies with the Java client was done without difficulty.

The organization names and personal names where slow, but not unacceptable. Loading 8,779 names took just under an hour and 49,455 organizations took approximately 6 hours. There were  no errors reported during that process.  369,943 collection objects where loaded in just over  55 hours.  The rate of loading averaged 0.48 seconds during the first 5 hours, but then it declined to 0.6 seconds per record.  It did not continue to decline but remained more-or-less constant for the rest of the loading period.

The JVM memory pools were monitored in the JMX Console, but the results were ambiguous and, without more information about how to interpret them, I don't consider that information useful.  Using OS tools, I could see that free system memory dropped to below 15 megabytes - a slim margin, but it was almost enough.  The process only hiccoughed once, there was a single error message indicating a timeout, followed by a retry.

With (nearly) 400,000 records loaded  the "My CollectionSpace" page takes 70 seconds to load, but searches are surprisingly fast and displaying an object record is also quick, on the order of 1-2 seconds.  

Week of Feb 21 - Feb 25

Note that some keywords have changed in the relations_common table.  collection-object became collectionobjects and loanout became loansout. If the data being loaded does not reflect this change then relations between collection objects and loans will not appear in the UI.

Some indexes where added and seemed to help performance, though it is difficult to get an accurate metric because of the query caching used by the mysql server. I added an index to the hierarchy table on the name column, another index to the collectionobjects_common table on the objectnumber column, and two indexes to the relations_common table - one on the two columns ( documentid1, documenttype1 ) and another on the columns ( documentid2, documenttype2 ).

A persistent problem is the way some pre-installed vocabularies are implemented through the cspace-config file, since they do not accept or work properly with refnames.