Handling Service Interruptions

This is an in-process document, based in part on experience on supporting museums using CollectionSpace at the University of California, Berkeley.

In its current form, this is offered as an "early access" sharing of notes, that may be helpful for other institutions who may wish to set up a program for handling incident responses/support requests. (This document will be generalized and expanded over time.)

First line of support

Prepare for service interruptions by having published procedures and making sure that the appropriate team members are familiar with them and their roles.  Ongoing support procedures include:

  • monitoring email lists (at UCB, monitor the two lists cspace-ucb-deployments@lists.berkeley.edu and cspace-support@lists.berkeley.edu.
  • (and ensuring everyone who should be on these lists is on, and perhaps also on the cspace-support list.)
  • learning how to and being ready to restart tomcat and postgres.
  • tracking down expert help as needed. (At UCB: Ray or Aron probably)
  • note that these activities can probably be done remotely, but a clear chain of communication to management, the team, and users is important.

Incident Handling Process

The following notes assume that CollectionSpace is running on a Unix-like system (e.g. Linux or Mac OS X).

In addition, references to a runtc script for stopping and starting Tomcat, below, are local to UC Berkeley. For their generic equivalents, please see Starting Up CollectionSpace Servers and Shutting Down CollectionSpace Servers.

Here is the incident handling process we have been following at UCB. Note that this process is more about system freezes, failures, and other emergencies than bug reports.

  • Respond to customer saying we have received their message (or inform them that we have identified a problem). Try to give some indication of time or at least say you will get back in touch. Ask for more information as needed.
  • More than 95% percent of problems have followed a familiar pattern (which we actually haven't seen for awhile) where a tomcat restart will be sufficient, but first, some subset of the following can help track down the status of the system. All those steps can take a long time! Be sure to assess the impact on the customer to see how urgent the problem is.
  • ssh in to the server and check the following:
    • Run top to see what processes are using memory
    • Run free -m to see memory usage
    • Run df -h to see how full disk space is
    • Run ps auxww | grep tomcat to see if tomcat is running
    • Run ps auxww | grep postgres to see what postgres processes are running (One of the processes will have a path that contains a Postgres log file)
    • Check logs such as catalina.out to see what's going on
    • The following psql statement will tell you if there are long postgres queries running:
      select * from pg_stat_activity where state <> 'idle';
      
      and if you identify a long-running or renegade query you need to kill, you might, with abundant caution:
      select pg_cancel_backend(pid);
      
  • To restart tomcat
    • cd $CATALINA_HOME/bin (is this necessary?)
    • sudo runtc stop
    • wait for that to complete
    • sudo runtc start (for dev systems this is generally sudo runtc jpda, a development mode)
    • wait for that to complete
    • try to login to the CSpace system. Try a command or two.
  • To restart postgres
    • sudo /etc/init.d/postgresql-9.2 restart
  • Confirm that you can login.
  • Communicate back to customer
  • If you have not gotten the system back up and running within 15 minutes, inform the customer that you are still working on the problem. Try to give an estimate, but do not leave the customer hanging. Check in frequently (every 15 minutes for a major service interruption)
  • Escalate as needed. Click here for contact numbers for responsible team members. Ask your colleagues politely for their phone number.
  • Capture log data and error messages for review. In particular, consider saving the relevant portions of the following logs:
    • $CATALINA_HOME/logs/catalina.out
    • $CATALINA_HOME/logs/cspace-services.log
    • ...
  • Study logs to see what caused the problem in the first place. Create Jira tickets if appropriate.