Skip to end of metadata
Go to start of metadata

A number of organizations have expressed an interest in providing CollectionSpace as a hosted service, either for a set of internal customers (e.g., across the University of California system, or under a governmental umbrella), or in a Software-as-a-Service (SaaS) model for paying customers. Project partners and early adopters have some experience running production instances, however, none have scaled up to to a SaaS offering for many instances. To plan such an offering, providers need to know technical and staff requirements for a hosted service. This document gathers notes and experience to date, and some projections of how the experience with mid-sized to large deployments can be extrapolated to the needs of small to mid-sized museums.

It is assumed that during a transition/migration phase, new customers will need access to development servers that may not run as efficiently as a production instance (since they need to be isolated to provide the flexibility to reboot or even re-install, import data en masse, etc.). However, once any customization is complete and data migration has been worked through, the instance will transition to a production environment. Thus, we describe needs for production instances as well as development instances in the discussion below. An additional class of instance is also described for experimental use either as a demo server, or as a sandbox for initial investigation by prospective users. 

Tech requirements

Assumptions

The figures below generally assume a current, mid-range compute node, but could also be a mid-range VM. Assume about $1000-1500/yr either way (exclusive of extra storage, backups, etc.).

We generally want to separate the database server from the application (Tomcat) server. The application server does not generally have to scale teh same way that the database server does, with the size of the collections. The exception to this is that the app server requirements will vary with how much media processing is done on each tenant (image upload, which then leads to generation of derivatives, which consumes memory and CPU).

 
To scale way up and for best performance and reliability, the database server should have fast disk (ideally RAID 10 spread across a few disks) for the DB storage, and separate storage for the logs. This storage should not be mixed with the media storage (it would be too expensive anyway). The application server needs potentially large NAS for media, but this generally need not be very fast (CollectionSpace does not serve up lots and lots of media). This storage and the associated backup should be calculated separately, and added to the figures below. Note that if a museum wants to use their live CollectionSpace instance to serve the needs of a public collections browser, the associated load on the application server and storage array will undermine the scaling assumptions below. The load from a popular web site might well shift a relatively small collection into the "Very Large" category if the web traffic was high.

Integration of backup and archival depends upon the SLA you are providing, and should be separately handled for media vs. DB.

Using multiple containers (Tomcat instances running on the same server/VM) is less resource-intensive than one VM per CSpace instance, but somewhat more resource-intensive than multi-tenant (within a single container/VM). Using multiple containers or VMs simplifies audit, backup, restore, etc. of media storage. You must decide how to balance complexity of configuration and management against the technical resource costs.

The table below are current estimates of the hardware requirements for different sizes of collections. It will make sense to gather like-sized instances onto a single VM/server, just as is generally done with cloud VM services. The high-end of the categories are based upon actual experience, but the low end are extrapolations.

Technical requirements by collection size and activity

Museum Category

#records (all types)

#concurrent users

#requests/min

DB Cache Mem

#tenants/DB server

DB server

App Server Mem

#tenants/App Server

App Server

Notes

Very Small

< 10K

1 to 2

< 5

1 GB

8 to 16

4 CPU,
8 GB

1 GB

8 to 16

4 CPU,
4 GB

Assumes a multi-tenant app server.

Small

10K to 100K

2 to 4

< 10

1 to 2 GB

4 to 8

4 CPU,
16 GB

1 GB

4 to 8

4 CPU,
4-8 GB

Assumes a multi-container app server.

Medium

100K to 500K

3 to 5

< 20

2 to 4 GB

4 to 8

4 CPU,
16 to 32 GB

1 GB

4 to 8

4 CPU,
4-8 GB

Assumes a multi-container app server.

Large

500K to 2M

5 to 10

< 30

8 to 16 GB

2

4 CPU,
16 to 32 GB

2-4 GB

2

4 CPU,
4-8 GB

Assumes a multi-container app server.

Very Large

> 2M

5 to 20

> 30

16 to 32 GB

1

4 CPU,
16 to 32 GB

4 GB

1

4 CPU,
4-8 GB

 Should consider a separate VM for the database server

Staff

System administration (OS) skills are required for basic setup, configuration, tuning, and maintenance of the underlying OS environment, networking, security and hardening, etc. These are generic to the OS and environment; CollectionSpace should not add any unusual requirements above and beyond the generic for this role.

Database administration (DBA) skills are required for basic setup, configuration, tuning, and maintenance of the PostgreSQL platform. Common scripts can automate most of the DB configuration, backup, etc., however, CollectionSpace can place a heavy load upon the DBMS for certain operations, reports, etc. The regular tuning, optimization, and management of key PostgreSQL performance configuration will require an experienced DBA to do well. The better the DB is tuned, the leaner the server can run (without this, just adding more compute and memory resources is a common approach).

*Tomcat administration *skills are required for basic setup, configuration, and maintenance of the Tomcat application server. Common scripts can automate most of the configuration, etc., however, an admin must have familiarity with the Tomcat server, and basics of multiple-container configuration, performance configuration for Java and the containers, etc. CollectionSpace should not add any unusual requirements above and beyond the generic for this role.

*CollectionSpace administration *skills are required for basic setup, configuration, and maintenance of the CollectionSpace application. Common scripts can automate most of the configuration, etc., however, an admin must have familiarity with configuration of CollectionSpace UI (HTML+javascript+json), application (java, XML), and services (java, XML) layers, to address common issues and routine configuration changes requested by users. An in depth knowledge of these layers may be deferred to dedicated customization and migration specialists, which will mitigate the amount of training needed in this role for a hosting service provider. 

Notes

There should probably be a basic model for media storage allowed for each level, above which level additional storage (and backup) could be purchased. 

Additional services might be developed such as workflows and connectors to push media and other document resources directly into an archival and preservation service, or closer integration of CollectionSpace with a Digital Asset Management system. The discussion above is limited to a fairly "vanilla" deployment of CollectionSpace, without allowance for either the technical or staffing/expertise needs of such additional integrations.

Note again that the estimates above assume that CollectionSpace is used for collections management only. If it is to be used for public access, or as an information resource for some other demanding service, the requirements could be significantly impacted. SLAs should be defined to note this.