UC Berkeley web applications

This is a top level page for documentation related to web applications ("webapps") that UC Berkeley has developed for its CollectionSpace deployments. As of version 4.4 of CollectionSpace, these webapps are a (loosely integrated) part of CollectionSpace.

Extending CollectionSpace with Django-based webapps

The UC Berkeley CollectionSpace deployment team has extended the functionality of CollectionSpace using lightweight web applications built within the Python-based Django framework. Berkeley has created a reusable Django project that authenticates against CollectionSpace, providing a starting point for further app development. With this code, available in GitHub, Berkeley has simplified the creation of custom CollectionSpace webapps.

At this point, UC Berkeley has produced a dozen such webapps. Screenshots for a few of them are shown below:

  • Public "search portals" for:
    • The University and Jepson Herbaria collections. The UCJEPS search portal opens up the herbaria's vast collection of plant specimens to researchers and other interested parties.
    • The Berkeley Art Museum art collection. The BAMPFA search portal opens up the art museum's vast collection to researchers and other interested parties.
    • The UC Botanical Garden's collection. The UCBG search portal opens up the garden's vast collection of plants to researchers and other interested parties.
    • The Hearst Museum of Anthropology. The PAHMA search portal opens up the garden's vast collection of plants to researchers and other interested parties.
  • A point-and-click interface to a variety of "non-CSpace" reports for the UC Botanical Garden. (login required)
  • The Bulk Media Uploader ("BMU"), a quick-and-easy interface for PAHMA to upload images and associate them to collection objects.  (login required)

 

 

A few details about these webapps

The search portals query a Solr datastore that holds a nightly export of data from CollectionSpace. The speed of Solr and the denormalized nature of the datastore allow the portal to return 1,000s of records in under a second.

Included within the search portal is a second webapp called imageserver. The Imageserver makes authenticated requests to the UCJEPS CSpace instance to access specimen images. For each search result returned, i.e, for each specimen (cataloging) record that matches the query parameters, A REST Services call retrieves images from related media handling records.

The Botanical Garden's iReports webapp provides a means to access the iReports installed on a CSpace server which require parameters that CSpace proper cannot provide (i.e. non-CSID values such as text input).

The "Bulk Media Upload" webapp addresses a long-standing need to be able to upload batches of image files and connect them to collection objects. While the approach taken for the implementation supporting PAHMA requires adhere to specific conventions (e.g. the image files must be named using the exact value for the Museum Number of the object), the actual application is tiny and has been applied to (and customized for) other deployments.

Background

John Lowe developed a set of applications in 2012-2013 in Python using CGI for PAHMA in order to meet some rapidly evolving needs related to the major move that museum is conducting.  In about March 2013, the UCB team decided that it was time to select a more enabling framework for web applications and build an environment that would provide an excellent platform for web applications that connect to our CollectionSpace instances.  The framework selected was Django. Richard Millet then built a project using Django's "authentication backend" to permit apps to authenticate with CollectionSpace servers. That project is called cspace_django_project.

The cspace_django_project serves as the starter project for local, custom CSpace-Django projects. Using Git and GitHub, local CollectionSpace instances can fork the code to their own repository, clone it, and create a custom project – containing one or more web applications – by making modifications to the clone. The cspace_django_project, a fork of which will reside unchanged in each deployer's repository, can serve as the conduit for general bug fixes and enhancements.

A set of wiki pages, linked to in the next section, documents the procedures involved in extending a CollectionSpace instance using Django webapps.

Installation

Features

  • High performance: the webapps are lightweight, fast applications and have a small memory footprint. Having said that, it is the responsibility to see that those characteristics are maintain within any particular app.
  • Security: Webapp authors are mostly responsible for security; for example, if SQL is used, the app author must prevent SQL injection.  
  • At UCB, webapps run under https, and this is generally a good practice. (This is configured in webserver, generally Apache).
  • Applications require can required authentication with CSpace credentials.  Others (e.g. public portals) require no login, but can (and do) have contextual features that are available to authenticated users.
  • Searching:
    • Is supported using Solr, Postqres, or (in principle) NXQL queries.
    • Performance will certain need to be considered in how queries are done: Postgres is generally too slow to support public access, and generally public access to the Postgres database, even if ready-only and restricted, should be considered dangerous.
    • Some webapps implement hierarchical searching, a feature not available in CSpace proper (e.g., "find all specimens within the genus Phlox", "find all artifacts from Colombia").
    • Term completion or type-ahead is implemented for all Solr fields and a number of Postgres fields (e.g. authority terms, museum numbers, etc.).
  • Images: 
    • Images may or may not be made publicly accessible (depending on the webapp). The imageserver supports limited constraints on image access (e.g. different derivatives may be restricted to authenticated users or hidden altogether.)
    • The imageserver supports file-based cacheing, which can reduce the load on CSpace servers (which are slow to render images).
  • Save results as data file: list results and be downloaded in (.csv) files for further processing with speadsheets or other software.

Design, Implementation, and Deployment Considerations

  • Should web apps run on the CollectionSpace application server or on separate VMs?
  • Should apps query the Nuxeo database directly or build out a snapshot elsewhere? (i.e. pSQL, NXQL, or Solr)?
  • Django has an ORM.  This has not been implemented as a means to access CSpace Postrgres database directly. So far, access is via raw SQL.  There seems to be some significant discussion of the advantages and disadvantages (e.g. vs. SQLAlchemy, here, here).
  • If performing SQL queries directly, credentials need to be proxied and secured.  Postgres views can provide some isolation of the data.
  • In the Portals, large search results are paginated, and there is a maximum limit.  What do other sites do?
  • There are several "helper apps" that developers may wish to use – the imageserver, of course, to access images and other media; the suggest apps for obtain term-matching / autosuggest values, and so on.

The following links illustrate some of the efforts to implement Django-based webapps that support CSpace deployments

  • Preliminary wireframes for a prototype UCJEPS portal, with notes about UCBG differences.
  • Design and implementation of the Bulk Media Upload facility, currently used by PAHMA.
  • "Generalized Web Portal" design and implementation.