2010 Leadership Grant

Deploying Collections

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

This is a top level page for documentation related to web applications ("webapps") that UC Berkeley is developing has developed for its CollectionSpace deployments. As of version 4.4 of CollectionSpace, these webapps are a (loosely integrated) part of CollectionSpace.

Extending CollectionSpace with Django-based webapps

The UC Berkeley CollectionSpace deployment team has extended the functionality of CollectionSpace using lightweight web applications built within the Python-based Django framework. Berkeley has created a reusable Django project that authenticates against CollectionSpace, providing a starting point for further app development. With this code, available in GitHub, Berkeley has simplified the creation of custom CollectionSpace webapps.

At this point, UC Berkeley has produced a dozen such webapps. Screenshots for a few of them are shown below:

  • A public search portal for the UC and Jepson herbaria collections. The UCJEPs Public "search portals" for:
    • The University and Jepson Herbaria collections. The UCJEPS search portal opens up the herbaria's vast collection of plant specimens to researchers and other interested parties.
    • The Berkeley Art Museum art collection. The BAMPFA search portal opens up the art museum's vast collection to researchers and other interested parties.
    • The UC Botanical Garden's collection. The UCBG search portal opens up the garden's vast collection of plants to researchers and other interested parties.
    • The Hearst Museum of Anthropology. The PAHMA search portal opens up the garden's vast collection of plants to researchers and other interested parties.
  • A point-and-click interface to a variety of "non-CSpace" reports for the UC Botanical Garden. (login required)
  • A The Bulk Media Uploader ("BMU"), a quick-and-easy interface for PAHMA to upload images and associate them to collection objects.  (login required)A quick-and-easy generalized search interface (to be used by PAHMA and UCBG initially.)

 

...

 

...

A few details about these webapps

The UCJEPS search portal queries portals query a Solr datastore that holds a nightly export of data from CollectionSpace. The speed of Solr and the denormalized nature of the datastore allow the portal to return 100s 1,000s of records in secondsunder a second.

Included within the search portal is a second webapp called imageserver. The Imageserver makes authenticated requests to the UCJEPS CSpace instance to access specimen images. For each search result returned, i.e, for each specimen (cataloging) record that matches the query parameters, A REST Services call retrieves images from related media handling records.

The Botantical Botanical Garden's iReports webapp is a "standard component" which will eventually become part of the basic suite. It provides a means to access the iReports for an institution installed on a CSpace server which require parameters that CSpace proper cannot provide (i.e. non-CSID values such as text input).

The "Bulk Media Upload" webapp addresses a long-standing need to be able to upload batches of image files and connect them to collection objects. While the approach taken for the implementation supporting PAHMA requires adhere to specific conventions (e.g. the image files must be named using the exact value for the Museum Number of the object), the actual application is tiny and easy to apply to has been applied to (and customized for) other deployments.

Background

...

Functionality requirements

...

Features

  • High performance: the webapps are lightweight, fast applications and have a small memory footprint. Having said that, it is the responsibility to see that those characteristics are maintain within any particular app.
  • Security: Webapp authors are mostly responsible for security; for example, if SQL is used, the app author must prevent SQL injection.  Must   
  • At UCB, webapps run under https, and this is generally a good practice. Some applications will require login (This is configured in webserver, generally Apache).
  • Applications require can required authentication with CSpace credentials.  Others will be public portals that will use a proxy CSpace login with appropriate permissions(e.g. public portals) require no login, but can (and do) have contextual features that are available to authenticated users.
  • Searching:
    • Will be Is supported using Solr, Postqres, or (in principle) NXQL queries as appropriate..
    • Performance will certain need to be considered in how queries are done.We will need to allow hierarchical searching : Postgres is generally too slow to support public access, and generally public access to the Postgres database, even if ready-only and restricted, should be considered dangerous.
    • Some webapps implement hierarchical searching, a feature not available in CSpace proper (e.g., "find all specimens within the genus Phlox", "find all artifacts from Colombia").
    • Term completion or type-ahead will be needed in some search fields.
    Images: images
    • is implemented for all Solr fields and a number of Postgres fields (e.g. authority terms, museum numbers, etc.).
  • Images: 
    • Images may or may not
    need to
    • be made publicly accessible (depending on the webapp).
    This issue will require some analysis.
    • The imageserver supports limited constraints on image access (e.g. different derivatives may be restricted to authenticated users or hidden altogether.)
    • The imageserver supports file-based cacheing, which can reduce the load on CSpace servers (which are slow to render images).
  • Save results as data file: list results should be available as a text and be downloaded in (.csv) file for download.

Open questions

  • Do files for further processing with speadsheets or other software.

Design, Implementation, and Deployment Considerations

  • Should web apps run on the CollectionSpace application server or on separate VMs?
  • Do we Should apps query the Nuxeo database directly or build out a snapshot elsewhere? (i.e. pSQL, NXQL, or Solr)?
  • Django has an ORM.   Should we use it, or write raw SQL?  There  This has not been implemented as a means to access CSpace Postrgres database directly. So far, access is via raw SQL.  There seems to be some significant discussion of the advantages and disadvantages (e.g. vs. SQLAlchemy, here, here).
  • If performing SQL queries directly, credentials need to be proxied and secured.  Postgres views can provide some isolation of the data.
  • Do we need to perform pagination of In the Portals, large search results ?  What are paginated, and there is a maximum limit.  What do other sites do?
  • For our first prototype application, should we demonstrate hierarchical searching, or should we start with the simplest scenario
  • When there are multiple images related to a collection object, should we show only the "preferred" image (PAHMA customization?) or show them in some order with a prev-next widget?There are several "helper apps" that developers may wish to use – the imageserver, of course, to access images and other media; the suggest apps for obtain term-matching / autosuggest values, and so on.

...