PAHMA data migration ETL work

This page contains information about data ETL work for PAHMA's deployment of CollectionSpace.

Current Plan

Please see our current list of questions and meeting notes.

Total fields in TMS and TMSThesaurus: 3242 (3111 + 131)
Fields to be mapped to CSpace from TMS and TMSThesaurus: 763 (699 + 64) or less

Plan is to do data migration in three phases:

  1. Earlier than main migration; can be done several weeks in advance of "go live" date
  2. Main migration; must be done in very limited time window (while TMS system is shut down?)
  3. Later migration; can be done anytime after "go live" date
    Michael T. Black has divided all TMS fields to be mapped into three groups, determining when they will be migrated:
  1. Types of data targeted for earlier migration:
    • Persons/Organizations
    • Contact information
    • Controlled lists
    • Bibliographic references
    • Storage locations
    • Place authority
    • Culture authority
    • Ethnographic use code authority
  2. Types of data targeted for main migration:
    • Cataloging info
    • Sub-objects (Components)
    • Relationships (object-object)
    • Groups
    • Dimensions
    • Containers
    • Media handling
    • Acquisitions (Accessions)
    • Object exit (Deaccessions)
    • Intake
    • Loans
    • NAGPRA information
    • all cross-references to data in earlier migration
  3. Types of data targeted for later migration:
    • Shipments
    • Exhibits
    • Versioning/audit info
    • Insurance and valuation
    • Object use
    • Conservation

Status update - 10-August-2011

  • Michael and Susan have already done a huge amount of work.  Susan's Kettle files include notes, but Michael is the person who understands the TMS data model. 
  • Most existing Kettle jobs are pretty good.  The exception is object cataloging data, which will need some significant updating.
  • All jobs need to be studied to check assumptions and look for limits that were put in place to create sample data sets.
  • Most input and output data files are in the files that Susan provided.  Some might be misplaced and need to be recreated.
  • Rather than rewrite Kettle jobs in Talend, we will probably take Kettle CSV outputs and parse them through Talend into the appropriate XML format.
  • Michael can create views in MSSQL as needed.
  • If we can figure out how to create CSIDs (UUIDs) in MSSQL, that might help data migration.
  • Some of the files from Susan's work are no longer needed.  E.g., there are jobs that are now becoming simple controlled lists.

ETL Jobs

Job

Description

Inputs

Outputs

Notes

tms_obj_v0.1_all.ktr

Object records first job

11 TMS queries

text

CSpace 1.3 schema.  For Object records run this job first.  See merge_object_data.ktr for notes on sequence for running object transformations, starting with tms_obj_v0.1_all (collection site is a job with 3 transformations)
reference and inscription are done as updates after objects loaded

tms_obj_collsite1.ktr

Object records -- collection site job#1

3 TMS queries (object site, provenance from object, textentries)

matching text file & non-mtching tex tfile

Extract sitename/sitenote/provenance and merge with textentry places

tms_obj_collsite2.ktr

Object records -- collection site job#2

3 Thesaurus files (tmsthesaurus..terms, tms..thesxrefs, tmsthesaurus..termmaster)

text file

Take records failed to match with textentry places (last step, tms_obj_collsite1.ktr) and merge them with thesaurus lookup (provenance & card header geog info)

tms_obj_collsite3.ktr

Object records -- collection site job#3

2 text file

text file

Merge the last two steps to create a single collection site  file with objectid, sitename & sitenote

coll_site_job.kjb

Sequences "collection site" job runs

3 Kettle jobs for "collection site"

resulting file from tms_obj_collsite3.ktr

Enforce the 3 jobs for object "collection site" to be run in sequential manner

tms_obj_collector.ktr

Object records -- collector

4 TMS queries (object collector, collection date, field collection number, date)

text file "tms_obj_collector"

Extract/merge to create file w/ objectid, collectorRefName, collectionDate, fieldnum, collectionnote

tms_obj_production.ktr

Create object records' production info

4 TMS queries (production person/org, date, reason-tourist, place-minting) and TSM Thesaurus of culture term

text file
"tms_obj_prod"

Merge object production person/org RefName/role with production date (made/photo/pub/mint/ruler), culture, reason and minting place

merge_object_data.ktr

Creates final Object records

4 text files (main object, collection site, collector, production)

text
"tms_obj"

run this after creating:
1. main obj data
2. collectors (using id output from 1 as input)
3. collection sites (using job that calls three transformations)
(usig ids output from 1 as input to tms_obj_collsite1)
4. tms_obj_production (using id output from 1 as input)
BE SURE TO CHANGE OBJECT INPUT AND OUTPUT EXTENSION
n=140K (4/1/2011)

reference.ktr

Creates object's reference string

3 TMS queries (referencemaster, refobjxref, object) and a CSID text file

text
"tms_obj_ref"

Creating "ref" by concatenating title/subtitle/placepublished/copyright... in referencemaster & refobjxref DBs, then merge/join w/ "ref" (created by normalizing pubreference & bibliography) from "object" DB.  At last step, merge in CSID.(link key objectid=tmsID).
n=3598 (4/1/2011)

tms_inscription.ktr

Creates objects inscription information

3 TMS queries (TextEntries, Constituents, object) and a CSID text file

text
"tms_inscription"

Merge object's (1st/single) TextTypeID/TextEntry (i.e.Inscription Content)/Inscriber/MTB_dated/MTB_Language/MTB_Interpretation/MTB_...) with CSID (link key objectid=tmsID).
n=1052 (4/1/2011)

media.ktr

Creates Media Handling records

 

text

CSpace 1.6 schema.

acq_xml.ktr

Creates Acquisition records (PAHMA accessions)

8 TMS queries, 1 txt file of accession methods from another job

XML file (/home/sstone/cspace_general/import/acq_import.xml), CSID output (/tmp/acq_csids)

Note: Collector refname should be collecting event here.
Note: This outputs 10,000 acquisition xml records per file without adding top-level element.

loan_in.ktr

Creates Loan In records

 

text

CSpace 1.3 schema

loan_out.ktr

Creates Loan Out records

 

text

CSpace 1.3 schema

loc_move.ktr

Creates Location and Movement Control records

 

text

CSpace 1.3 schema

object_exit.ktr

Creates Object Exit records

 

text

CSpace 1.3 schema

 

 

 

 

 

obsolete jobs

 

 

 

 

person_authority.ktr
N.B. please see updated info

Creates Person authority records

3 TMS queries (person, birth, death)
N.B. these are obsolete

person.txt (7012 records in file dated Feb 11, 2011)

Mapping is pretty good.  Legacy data has many duplicates with alternative spellings.  MTB has created a field that identifies the preferred name for each record.  We should take advantage of that to help create synonym relationships. Job can select distinct Salutation and NameTitle values too.
Note: Michael will move birth and death dates from condates into constituents (or will have these dates manually re-entered post-migration).
Note: Don't really want to remove these delete ones (filter out records marked delete). They should be reviewed.

org_authority.ktr
N.B. please see updated info

Creates Organization authority records

2 TMS queries (org, foundation date)
N.B. these are obsolete

org.txt (1577 records in file dated Feb 14, 2011)

Mapping is pretty good.  Legacy data has many duplicates with alternative spellings.  MTB has created a field that identifies the preferred name for each record.  We should take advantage of that to help create synonym relationships. Can produce Org Functions values too.
Note: No End Date Info for Organizations in ConDates. This info will be moved into constituents.
Note: need remarks field in org authority