2010 Leadership Grant

Deploying Collections

Skip to end of metadata
Go to start of metadata

With this iteration, we are targeting the data migration mappings for the migration of SAGE data into a CollectionSpace instance, based on the version 2.4 release of CollectionSpace. This launch is intended to support the full migration of the UC Botanical Garden SAGE system into a production CollectionSpace system.  Changes from the previous mapping are in blue.

  File Modified
Microsoft Excel Sheet cspace_sage_geography.xls SAGE geography (Place), v1 Aug 13, 2012 by Chris Hoffman
Microsoft Excel Sheet cspace_sage_gardenlocation.xls SAGE garden locations (storage locations), v1 Aug 13, 2012 by Chris Hoffman
Microsoft Excel Sheet cspace_sage_xdb_screens.xls List of screens from legacy XDB system Aug 13, 2012 by Chris Hoffman
Microsoft Excel Sheet cspace_sage_agent.xls SAGE persons and organizations, v1 Aug 13, 2012 by Chris Hoffman
Microsoft Excel Sheet cspace_sage_taxonomy.xls SAGE taxonomy map, v1 Aug 13, 2012 by Chris Hoffman
Microsoft Excel Sheet cspace_sage_accession.xls SAGE accessions map, v3 Oct 11, 2012 by Chris Hoffman

Authorities

Person (default person vocabulary) - Lam (SQL ready), Talend dev (Yuteh)

Tentatively, we will have one person vocabulary, mapping the agents from SAGE that are also in the persons table.

There might be some need to produce merge parts, for instance if we need to create, preferred and non-preferred terms for individual records.

Field mapping

SQL

Notes

  •  

Talend/ETL

bgimportperson1 0.2: Revising earlier talend job (based on 1.8) to change refname format, remove ampersand handling, and include collectionspace_core:uri element

New Talend job by Yuteh: BG_person 0.1

MERGE: Person Terms (repeating group) - Lam (SQL ready), Talend job (yuteh)

E.g., Different name forms (preferred, alias)

Field mapping

SQL

Notes

Talend/ETL

New Talend job by Yuteh: BG_personTerm 0.1

Organization (default Org vocabulary will be UCBG institutions) - Lam (SQL ready), Talend dev (yuteh)

We will have the following three organization vocabularies:

  • organizations (institutions from SAGE organizations plus the voucher institutions)
  • collectors (coming from the separate SAGE table)
  • groups (mostly KEW names for nomenclature)

Field mapping

SQL

Notes

Talend/ETL

New Talend job by Yuteh: BG_org_org 0.1

MERGE: Organization Terms (repeating group) - Lam (SQL ready), Talend dev (Yuteh)

Field mapping

SQL

Notes

Talend/ETL

New Talend job by Yuteh: BG_org_orgTerm 0.1

MERGE: Organization Associated Persons (repeating group) - Lam (SQL ready), Talend dev (Yuteh)

Field mapping

SQL

Notes

Talend/ETL

New Talend job by Yuteh: BG_org_orgAssocPsn 0.1

Collectors (Organization vocabulary) - Lam (SQL ready), Talend dev (Yuteh)

Note: Only one name in SAGE so no repeating groups

Field mapping

SQL

Notes

Talend/ETL

New Talend job by Yuteh: BG_org_collector 0.1

Groups (Organization Vocabulary) - Lam (SQL ready), Talend dev (Yuteh)

Field mapping

SQL

Notes

Talend/ETL

New Talend job by Yuteh: BG_org_group 0.1

MERGE: Groups Terms (repeating group) - Lam (SQL ready), Talend dev (Yuteh)

Field mapping

SQL

Notes

Talend/ETL

New Talend job by Yuteh: BG_org_groupTerm 0.1

MERGE: Groups Associated Persons (repeating group) - Lam (SQL ready), Talend dev (Yuteh)

Field mapping

SQL

Notes

Talend/ETL

New Talend job by Yuteh: BG_org_groupAssocPsn 0.1

Taxonomy (default taxonomy vocabulary for verified, and possibly unverified, names) - Lam (SQL ready), Yuteh (Talend)

Note: This query requires a table be refreshed on SAGE, cspace_lexicon_cv.

Field mapping

SQL

Notes

  • Not getting author shortIDs in view, just names. For now putting them in Taxon Notes. Adjust view or get them separately in SQL.
  • Do we need to adjust the format of scientific names?
  • Lots of other Sage data for Names to add
  • SQL query above is a union of names without scope notes and names with scope notes.  the query that added an outer join with the scope notes table took over 50 minutes to run, so it was easier to split it out into a union of two queries.
  • SQL query above excludes the following node_types: bioname, common, na, root.  It also excludes TRE, TRV, and TNA terms. TNA terms will be added as an alternate name of type 'no author name'.  TRE names are all bionames, so the bioname exclusion will also exclude TRE names.
  • This query requires a table be refreshed on SAGE, cspace_lexicon_cv.

Talend/ETL

bgtaxonimport2 0.2

MERGE: Taxon Terms (repeating group) - Lam (SQL ready), Yuteh (Talend)

Field mapping

SQL

Note: This query requires a table be refreshed on SAGE, cspace_lexicon_cv.

Notes

  • Note: This query requires a table be refreshed on SAGE, cspace_lexicon_cv.

Talend/ETL

MERGE: Taxon Authors (repeating group) - Lam (SQL ready)

Field mapping

SQL

Notes

Talend/ETL

MERGE: Taxon Common Names (repeating group) - Lam (SQL ready)

Field mapping

Icon

Need to get relationship type to and map to values in CSpace: Preferred, Secondary, Rejected

SQL

Notes

Talend/ETL

MERGE: Taxon Attributes (repeating group) - Lam (SQL ready)

E.g., dimensions, habitat, etc.

Field mapping

SQL

Notes

Talend/ETL

MERGE: Taxon Related Terms (repeating group) - Lam (SQL ready)

Might be needed for relationships to synonyms, unverified terms, verified terms (?), to hybrid parent names (?), etc.

Field mapping

SQL

Notes

Talend/ETL

Taxonomic Hierarchy (hierarchical relations) - Lam (SQL ready)

Based on parent taxa identified in taxon table

Field mapping

SQL

Notes

Talend/ETL

Common Taxonomic Names (common vocabulary in taxonomy authority) - Lam (SQL ready), Yuteh (Talend)

Icon

Should get Access Code in this SQL too, right?

Field mapping

SQL

Notes

exact duplicates (across term, access_code, valid_catalog_term_flag):

select count(*), term
from lexicon where node_type = 'common'
group by term, access_code, valid_catalog_term_flag
having count(*) > 1
order by term

2 Arnica
2 California-Lilac
2 crimson glory vine
2 Hakeke, Mountain Holly
2 Horoeka, Lancewood
2 Horopito, Pepper Tree
2 Lion's Spoor
2 Monkeyflower
2 Mountain Rimu, Pigmy pine
2 Mountain totara, Snow totara
2 Papapa, Snowberry
2 Tanekaha, Celery pine
2 Tarata, Lemonwood
2 Wi Kura, Narrow-leaved Snow Tussock

These two names are term duplicates (access_code or valid_catalog_term_flag are different)

2 SECTION CHONIASTRUM
2 ti kauka, Cabbage tree

Here are the duplicates that are referenced in lexcon_term_relation for common names:

select distinct l.lexicon_id, l.term
from lexicon l, lexicon_term_relation ltr
where l.lexicon_id = ltr.lexicon_id
and relation_type in ('TCN', 'TCNS', 'TCNR')
and l.term in (
        select term
        from lexicon
        where node_type = 'common'
        group by term
        having count(*) > 1)
order by term, lexicon_id;

40000115 Arnica
40002353 Arnica
40000552 California-Lilac
40000588 California-Lilac
40006165 Crimson Glory Vine
40001587 Hakeke, Mountain Holly
40001588 Hakeke, Mountain Holly
40001687 Horoeka, Lancewood
40001688 Horoeka, Lancewood
40001690 Horopito, Pepper Tree
40001691 Horopito, Pepper tree
40024379 Lion's Spoor
40024418 Lion's Spoor
40002311 Monkeyflower
40002314 Monkeyflower
40002384 Mountain Rimu, Pigmy pine
40002385 Mountain Rimu, Pigmy Pine
40002391 Mountain totara, Snow totara
40002392 Mountain totara, Snow totara
40002615 Papapa, Snowberry
40002616 Papapa, Snowberry
40003741 Tanekaha, Celery pine
40003742 Tanekaha, Celery pine
40003751 Tarata, Lemonwood
40003752 Tarata, Lemonwood
40003809 Ti kauka, Cabbage tree
40004151 Wi Kura, Narrow-leaved Snow Tussock
40004152 Wi Kura, Narrow-leaved snow tussock

Note: 40004151 and 40004152 are duplicates because of case insensitivity in Sybase db.

Talend/ETL

Plant Sales taxonomic vocabulary (plantsales vocabulary in taxonomy authority)

We will be constructing a taxonomy vocabulary (third instance) called Plant Sales that will be used to support the Pot Tag procedure.  This will be made up of the unique taxonomic names in SAGE's volunteer_label table (using the formatted name field).

Question: Will we also use family names from the volunteer_label table?

Field Mapping

...

SQL

Notes

...

Talend/ETL

...

Garden Locations (Storage locations vocabulary) - SQL ready, Talend (Chris)

Field mapping

SQL

Notes

  • Location Types: bed, section, collection
  • Location Type format: <locationType>urn:cspace:botgarden.cspace.berkeley.edu:vocabularies:name(locationtype):item:name(drawer)'Drawer'</locationType>
  • Vocabulary CSID to persist: c2833911-4bca-49aa-a3f3
  • Refname format: <refName>urn:cspace:botgarden.cspace.berkeley.edu:locationauthorities:name(location):item:name(crhbed11350962164634)'crhbed1'</refName>

Talend/ETL

bgimportgardenlocation 0.2 in Chris's Talend botgarden repository

Icon

Rename output file from ucbggardenlocation job to gardenlocout.txt!!

Garden Locations Hierarchy (hierarchical relationships) - SQL ready, Talend (Chris)

Field mapping

SQL

Notes

Talend/ETL

Icon

Rename output file from ucbggardenlocation job to gardenlocout.txt!!

ucbglocrelations v0.1in Chris's Talend botgarden repository

Icon

Job creates 571 relations, but one needs to be removed.  garden0/'none' has no broader record but has output in the file.  Remove by hand.  Looked like:

Place Authority - Lam (SQL), Yuteh (Talend)

Field mapping

SQL

Notes

Talend/ETL

MERGE: Place terms (repeating group)

Field mapping

SQL

Notes

Talend/ETL

Place hierarchy (hierarchical relationships)

Field mapping

SQL

Notes

Talend/ETL

Research Projects (Concept vocabulary) - Chris (SQL ready), Talend (Yuteh)

Field mapping

SQL

Notes

  • refname format: urn:cspace:botgarden.cspace.berkeley.edu:conceptauthorities:name(research_ca):item:name(research11350934474341)'research1'
  • vocabulary CSID to persist: deb53f97-9829-47b9-99cb

Talend/ETL

New Talend job by Yuteh: BG_concept_researchProj 0.1

Classes (Concept vocabulary) - Chris (SQL ready), Talend (Yuteh)

Field mapping

SQL

Notes

  • Institution (inst_id) is treated as the parent institution of the department (org_id).  However, this does not necessarily match what is recorded in the agent tables.  We will put the institution name in the scope notes field.
  • refname format: urn:cspace:botgarden.cspace.berkeley.edu:conceptauthorities:name(class_ca):item:name(crhtestclass11350934263503)'crhtestclass1'
  • vocabulary CSID to persist: 96141839-4bac-4e3d-a511

Talend/ETL

New Talend job by Yuteh: BG_concept_class 0.1

Propagation medium (Concept vocabulary) - Chris (SQL ready), Talend (Yuteh)

Field mapping

SQL

Notes

  • Construct the shortidentifier as 'medium' + an incrementing number, in Talend
  • Refname format: urn:cspace:botgarden.cspace.berkeley.edu:conceptauthorities:name(concept):item:name(medium11350937075031)'medium1'
  • Vocabulary CSID to persist: bbe39bb2-a622-4570-a946

Talend/ETL

New Talend job by Yuteh: BG_concept_propgMedium 0.1

Conservation Category (concept vocabulary) - Chris (SQL ready), Talend (Yuteh)

Field mapping

SQL

Notes

Conservation Categories are codes that are tied to Conservation Organizations.  Each organization can have its own list of conservation codes, and these can and do change over time.  Most have a code of 'rare'. 

Need to figure out what display name will be.  Should the display name in the vocabulary be like: "EN IUCN-1997", "IUCN-1997: Endangered - meets criteria for endangered and is facing a high risk of extinction in the wild".  Think about how term completion will work.  Might need to derive some organization abbreviations for these display names.  Consider these "endangered" categories from different organizations (org, code, description):

Talend/ETL

New Talend job by Yuteh: BG_concept_conservCatg 0.1

Collection Objects

Accessions - SQL ready (Chris), Talend dev (Chris), Data botgarden-dev loaded (Chris)

Field mapping

SQL

Notes

  • Exclude from requestedBy:
    • urn:cspace:botgarden.cspace.berkeley.edu:personauthorities:name(person):item:name(0)'Owner, Database'
  • Exclude from identBy: 
    • urn:cspace:botgarden.cspace.berkeley.edu:personauthorities:name(person):item:name(0)'Owner, Database'
    • urn:cspace:botgarden.cspace.berkeley.edu:null:item:name(null)'null'

Talend/ETL

ucbgaccessions 0.5 in Chris's Talend repository, botgarden

Run Steps

  • Merge into accessions the fragments for:
    • determinations
    • collection sites
    • accession use
  • Change "schema2" and "schema3" to "schema"
  • Import

MERGE: Taxonomic Determinations and Hybrid Parents (repeating group) - SQL ready to test (Chris), Talend dev-sample data ready (Chris)

It will be easier to extract Hybrid Parents at the same time as determinations in this case. Talend can handle a single repeating group. Data will be denormalized in SQL via unions and then grouped by Talend. This is complicated by the fact that some identifications come from the accession table; others come from plant_identity.

Field mapping

SQL

Notes

So this is now determinations plus hybrids in one big denormalized set.  Psuedo-SQL:

1 for all accessions (including the 418 that have no accepted record in plant_identity): get plant_identity records where FHP = -5 and MHP = -5 (one row per plant_identity record if no hybrid parent; multiple rows per accession, usually one with accepted_flag=1; order by accession_number, accepted_flag desc)
union
2 for all accessions (including the 418 that have no accepted record in plant_identity): get plant_identity records where FHP <> -5 or MHP <> -5 (usually two rows per plant_identity record if any hybrid parent; multiple rows/pairs per accession with one pair having accepted flag =1.  Order by accession_number, accepted_flag desc, parent_type female before male)
union
3. for the 418 accessions that have no accepted record in plant_identity, get the accession record where FHP = -5 and MHP = -5 (one row per accession record if no hybrid parent; set accepted flag=1? or not; maybe indicate this is a record from accession table; no accepted record in plant_identity)
union
4. for the 418 accessions that have no accepted record in plant_identity, get the accession records where FHP <> -5 or MHP <> -5 (usually two rows per accession record if any hybrid parent; order by accession_number, parent_type female before male; set accepted flag=1)

There are 12 non-preferred terms used throughout this set of 51K determinations. These need a different refname than is stored in the database. Rather than code that into the Talend job, I will just edit the merge file by hand to edit in the non-preferred term's refname.

Talend/ETL

Chris ucbgdeterminations 0.5 in Chris's Talend repository botgarden

MERGE: Collection Sites (not repeating but easier to extract separately) - SQL dev (Chris), Talend dev-sample data ready (Chris)

Icon

Including significant cleanup of geography data from SAGE.

Field mapping

These are the rules we are using for cleaning up the geographic data and collection sites in SAGE:

For accessions from years 2000 and later:
  • If Collection Site = 'Earth' or Hort, and Accession Notes has parentheses
    • Populate the new Taxonomic Range field (in Collection Object, Locality Info) with the Accession Notes info in parentheses (Taxonomic Range data come from accession table, parenthetical text in accession.accession_notes)
    • Field collection place is blank
    • Field collection county, Field collection state, Field collection country are blank
  • If Collection Site = some known non-Earth, non-Hort location
    • Taxonomic range information is blank
    • Field collection place gets the Collection Site value (which will be controlled by the Place Authority) (Field Collection Place data come from collection_site table, via geog_lexicon.geog_term)
    • Field collection county, state, and country will be blank in the cataloging record but will be available via the Place Authority by importing hierarchical Place Authority records.
For accessions from years 1999 and earlier:
  • If any of the county, state, country fields has parentheses, use data from
    • Taxonomic Range data come from Original Accession Geography, the accession_loc table (accession_loc.country_id, accession_loc.state_id, accession_loc.county_id). Strip parentheses, concatenate, and put into the new Taxonomic range field.
    • Field collection place is blank
    • Field collection county, Field collection state, Field collection country are blank
  • If all three do not have parentheses, and collection site is not Earth or Hort., use data from Collection Site
    • Taxonomic range field is blank
    • Field Collection Place data come from collection_site table (geog_lexicon.geog_term) Collection Site
    • Field collection county, state, and country will be blank in the cataloging record but will be available via the Place Authority by importing hierarchical Place Authority records.
  • If all three do not have parentheses; collection site = Earth or Hort.; and there are some legitimate non-range values in country, state, or county
    • Taxonomic range field is blank
    • Field collection place is blank
    • Field collection county, Field collection state, Field collection country get the values from Original Accession Geography (accession_loc)

Note: If collection site = 'hort.' or if country = 'HORT.' then the cultivated checkbox should be checked, and provenance drop down should be "not wild source".

SQL

Notes

Sites can have hort and range..

County, state, country only used in a small handful of cases where they provided useful information not in collection site or taxonomic range.  Concatenated onto verbatim locality field.

Talend/ETL

ucbgcollectionsite 0.6 in Chris's Talend repository botgarden

MERGE: Use of Accessions (repeating group) - SQL ready (Chris), Talend dev-sample data ready (Chris)

Research projects and classes

Field mapping

SQL

Notes

URN format for materialType: <materialType>urn:cspace:botgarden.cspace.berkeley.edu:vocabularies:name(materialtype):item:name(bulb)'bulb'</materialType>

Talend/ETL

ucbgaccessionuse 0.1 in Chris's Talend repository botgarden

Procedures

Plant Locations (LMI procedure, including current location and history/version) and Relationships to Objects - SQL ready (Chris), Talend (Yuteh)

Icon

See attached accessions map, current_location and accession_history tables

Field mapping

SQL

Initially, this SQL just gets the current location records, plant_location table.

Notes

Accession History records are being loaded as versioned LMI records.  We still need to know how to do that.

Talend/ETL

Icon

Uses output from ucbgaccessions Talend job that contains CSIDs and Object Numbers. Rename most recent output file to ucbg-accessions-ids.txt

Icon

Also uses a file sage_createupdateby_lookup.txt that contains a few user email addresses to decode the staff ID for the createdBy and updateBy info

ucbgplantlocations 0.1

Run steps

  • Movements: Change "schema2" to "schema"
  • Movements: Import
  • Move-to-object relationships: Import
  • Object-to-move relationships: Import

Vouchers and Voucher-Object relationships - SQL ready (Chris), Talend dev (Lam)

Field mapping

Using Loan Out for Voucher

SQL

Notes

Still need to get UCJEPS accession number for some number of vouchers

  • Uses ID output from job that creates garden locations vocabulary.  Make sure file is renamed to ucbg-gardenloc-ids.txt
  • Uses ID output from job that creates accessions.  Make sure that file is renamed to ucbg-accessions-ids.txt

Talend/ETL

ucbgvoucher 0.1

Run steps

  • Vouchers: Change "schema2" and "schema3" to "schema"
  • Vouchers: Import
  • Voucher-to-object relationships: Import
  • Object-to-voucher relationships: Import

Propagation  SQL ready (Chris), REST Import development (John)

Field mapping

SQL

Control lists:

  • Pot Type: select pot_type from pot_type_code order by pot_type;
  • Pot Size: select pot_size from pot_size_code order by pot_size;
  • Treatment Type: select treatment_type_name from treatment_type order by treatment_type_name;

Notes

See Botanical propagation data analysis

Prop Type:

  • lookup table: prop_type_code
  • propagation.prop_type = pro_type_code.prop_type
  • 21 values in prop_type_code

Reason for Prop: propagation.purpose

  • database has a rule that limits the values to: 'class use', 'research', 'garden collection',
    'distribution', 'conservation'
  • 5 distinct values in propagation.purpose

Extra Seeds: propagation.extra_seed_flag

  • boolean (0,1)

Activity Type:

  • lookup table: activity_type_code
  • activity_type_code.activity_type where growth_history.activity_type_id = activity_type_code.activity_type_id
  • 24 values in activity_type_code

Medium:

  • lookup table: medium
  • growth_history.medium_name = medium.medium_name
  • 1302 values in medium

Pot Type:d

  • lookup table: pot_type
  • growth_history.pot_type = pot_type_code.pot_type
  • 35 values in pot_type_code

Pot Size:

  • lookup table: pot_size_code
  • pot_size_code.pot_size where growth_history.pot_size_id = pot_size_code.pot_size_id
  • 69 values in pot_size_code

Treatment Type:

  • lookup table: treatment_type
  • treatment_type.treatment_type_name where treatment.treatment_type_id = treatment_type.treatment_type_id
  • 91 values in treatment_type

Concentration: treatment.concentration

  • varchar(10) field
  • no rules or constraints on this field
  • 80 distinct values in treatment.concentration

Talend/ETL

MERGE: Propagation Activities (repeating group) SQL ready (Chris), REST import (John)

Field mapping

SQL

Notes

See Botanical propagation data analysis

Relationships between Accessions and Propagations; REST import (John)

Field mapping

SQL

Notes

Talend/ETL

Talend/ETL

Distributions and Distribution-Accession relationships (using Object Exit procedure) - SQL dev (Chris), Talend dev (Chris), Data botgarden-dev loaded (Chris)

Field mapping

SQL

Notes

  • Uses ID output from job that creates accessions.  Make sure that file is renamed to ucbg-accessions-ids.txt

Talend/ETL

ucbgdistributions 0.1

Run steps

  • Distributions: Change "schema2" to "schema"
  • Distributions: Import
  • Distribution-to-object relationships: Import
  • Object-to-distribution relationships: Import

Pot Tags (Volunteer Labels) - Lam (SQL), Chris (Talend)

Field mapping

SQL

Notes

Talend/ETL

ucbgpottags 0.1 creates pot tag records and relationship records to collection objects

  • No labels