UCJEPS-CollectionSpace data mapping, v2.3

With this iteration, we are targeting the data migration mappings for the migration of SMaSCH data into a CollectionSpace instance, based on the version 2.3 release of CollectionSpace. Although there will be additional changes supported in the future, this is the major migration and launch release.

  File Modified

Microsoft Excel Sheet cspace_smasch_authorities_map.xls Authories, owned by Chris, with taxonomy

May 09, 2012 by Chris Hoffman

Microsoft Excel Sheet cspace_smasch_accession_map.xls Chris: Accessions, etc., with minor updates

May 10, 2012 by Chris Hoffman

Microsoft Excel Spreadsheet cspace_smasch_loanin_map.xlsx Add Major Group field

May 29, 2012 by Chris Hoffman

Microsoft Excel Spreadsheet cspace_smasch_loanout_map.xlsx Add Major Group field

May 29, 2012 by Chris Hoffman

Cataloging

Core Specimen data

See attachment for accession mapping. The grid below is not complete.

Field mapping

SMaSCH

CollectionSpace object schema

CS User Interface Title (** = different from UI)

Note

accession.accession_id

object_number

Accession Number **

Change field title

institution.inst_name

collection

Collection

Dropdown (Collection values)

accession.coll_num_prefix + coll_number + coll_num_suffix

other_number

Number

Moved to Field Collection Number.

"collector number"

other_number_type

Number Type

Not needed if we move this to Field Collection Number.

accession.notes

comments

Comments


objkind.kind

form (Object Description Information)

Form

Dropdown (Form values)

taxon_fullname.fullname

title

Taxon **

Change field title

accession.phenology (decoded)

phase (Object Description Information)

Phenology **

Change field title. Dropdown (Phenology values)

accession.early_jdate (calc)

field_collection_date_earliest

Field Collection Date Earliest

Custom 

accession.late_jdate (calc)

field_collection_date_latest

Field Collection Date Latest

Custom

accession.datestring

field_collection_date

Field Collection Date

We are converting this to the structured date format

committee.committee_abbr

field_collection_collector

Field Collection Collector


accession.loc_lat_decimal

field_loc_lat_decimal

Field Location Latitude Decimal

 

accession.loc_long_decimal

field_loc_long_decimal

Field Location Longitude Decimal


accession.loc_place (calc)

field_collection_place (?)

Field Location Verbatim**

Will map to verbatim field in schema extension for now.  When the Place Authority is available, we will work on creating authority records as well.

accession.loc_county

field_loc_county

Field Location County

Custom

accession.loc_state

field_loc_state

Field Location State

Custom

accession.loc_country

field_loc_country (temp custom)

Field Location Country

Custom

accession.loc_elevation

field_loc_elevation (temp custom)

Field Location Country

Custom

accession.coll_num_prefix + coll_number + coll_num_suffix

?

Field collection number


committee.committee_abbr

?

Field collection collector


"catalog date"

date_association

Date Association

Fixed text. Do not load into 1.0.  May end up in Associated Date information eventually, or create date/modify date.

accession.catalog_date

catalog_date

Date Text

Fixed text. Do not load into 1.0.  May end up in Associated Date information eventually, or create date/modify date.

SQL query

Using three views to get subsets of the records, identical in their selects, except for the beginning of their accession number:

  • cspace_collobj_uc_2_3b
  • cspace_collobj_jeps_2_3b
  • cspace_collobj_others_2_3b

These views make the minimal joins needed since refnames (collectors, collection number assignors, filed_in_cabinet genus) will be looked up using text files coming from earlier Talend jobs. 

ETL

  • Talend: GET_COLLOBJ_UC, GET_COLLOBJ_JEPS, GET_COLLOBJ_OTHERS
  • Inputs
    • cspace_collobj_uc/jeps/others views
    • ucjeps.cspace.berkeley.edu: finds accessions already in CSpace (so we don't add those a second time)
    • all org-collectors to get refnames for collectors (ucjepsorgs-all.txt -- this file created by combining two files by hand)
    • all org-collectors to get refnames for collection number assignors (ucjepsorgs-all.txt -- this file created by combining two files by hand)
    • all taxon-taxon records to get refnames for filed_in_cabinet values (smasch-taxonids.txt -- this file created by combining two files by hand)
    • locations_from_smasch.txt -- a text file provided by Dick Moe that combines several locality fields in SMaSCH into one location field for each accession.  Dick uses this to produce location data for CCH.
  • Outputs
    • objectsincspacealready-others.txt -- objects already in CSpace that will need to be updated separately
    • collobj-others-add.xml -- main XML output to be merged with other files
    • collobject-others-ids.txt -- text file with legacy_id, CSID, uri, and some refnames for bookkeeping

Notes

Controlled list values

See these links

Cataloging data sets for dropdowns and controlled lists
  1. Collection values:
    1. Jepson Herbarium
    2. University Herbarium, University of California, Berkeley (UC)
    3. University of California
    4. University of California- Santa Barbara
    5. Herbarium (UCSC)
    6. DHN
  2. Form values:
    1. Illustration
    2. Mounted on Paper
    3. Photocopy
    4. Photograph
    5. Stored in a Box or Bag
  3. Phase (Phenology) values:
    1. Cone
    2. Flowering
    3. Flowering/Fruiting
    4. Fruiting
    5. Spores/Sporangia
    6. Vegetative (non-reproductive)
  4. Loaned Object Status values:
    1. Active (Unknown)
    2. Active (All Out)
    3. Active (All In)
    4. Active (Partial)
    5. Active (Discrepancy)
    6. Cancelled (Unknown)
    7. Cancelled (All Out)
    8. Cancelled (All In)
    9. Cancelled (Partial)
    10. Cancelled (Discrepancy)
  5. Other Number Type values:
    1. Other herbaria accession number
    2. Copied from accession number
    3. Exsiccatae number
    4. Project number
    5. Genbank number
    6. Destructive sampling number
    7. Internal cross reference
    8. Miscellaneous

MERGE: Scientific Determination History (repeating within Cataloging schema)

Combines the name as filed and the annotation_history table from SMASCH.

Field mapping

See the determinations worksheet in the attached accessions mapping document

SQL

With some dates fixed, all determinations are now in cspace_determ_2_3.

Variant determinations: vname_id field recalculated to be "unverified"||vname_id in order to avoid collisions with taxon_id

ETL

Talend job MERGE_COLLOBJ_DETERMINATIONS (0.2)

Run job three times to modify SQL on view to get subsets: UC, JEPS, OTHER. These will map against Collection Object subsets.

JEPS:
where object_number like 'JEPS%'
order by object_number, legacy_id

UC:
where substring(object_number, 1, patindex('%[A-Z][0-9]%', object_number)) = 'UC'
order by object_number, legacy_id

OTHER:
where substring(object_number, 1, patindex('%[A-Z][0-9]%', object_number)) in ('UCSC', 'UCLA', 'UCSB', 'DHN')
order by object_number, legacy_id

Inputs

  • org_determiners (org-determiners vocabulary refnames)
  • all_taxon (text file with default taxonomy refnames) and taxon_unverified (text file with unverified/verified taxonomy refnames) are merged (tUnite)

Notes

View modified to remove time from aa.determined_date.

MERGE: Type Kind information (repeating within Cataloging schema)

Field mapping

See the type kind worksheet in the attached accessions mapping document

SQL

Using this query with view

SELECT
    legacy_id spectype_id,
    rtrim(object_number) accession_id,
    type_kind,
    REFERENCE,
    figure,
    determined_by_id,
    determined_by,
    notes,
    basionym_id,
    verified_basionym
FROM
    dbo.cspace_spectype_2_3
where object_number <> ''
order by object_number, legacy_id desc

create view cspace_spectype_2_3 as
select
    ti.typeinfo_id as legacy_id,
    rtrim(ti.accession_id) as object_number,
    tk.kind as type_kind,
    case
        when ti.place_pub is not null and ti.place_pub != ''
                and ti.year_pub is not null and ti.year_pub not in ('', '0', 'unkn')
            then ti.place_pub + ', ' + ti.year_pub
        when (ti.place_pub is null or ti.place_pub = '')
                and ti.year_pub is not null and ti.year_pub not in ('', '0', 'unkn')
            then year_pub
        else null
    end as reference,
    ti.figure as figure,
    ti.typ_committee_id as determined_by_id,
    c.committee_abbr as determined_by,
    case
        when ti.notes is null and v.descr is not null
            then 'Type Voucher: ' + v.descr
        when ti.notes is null and v.descr is null
            then null
        when ti.notes is not null and v.descr is not null
            then ti.notes + '; Type Voucher: ' + v.descr
        when ti.notes is not null and v.descr is null
            then ti.notes
    end as notes,
    ti.basionym_id as basionym_id,
    tf.fullname as verified_basionym
from
    types_info ti,
    typekind tk,
    committee c,
    voucher v,
    taxon_fullname tf
where ti.typekind_id *= tk.typekind_id
and ti.typ_committee_id *= c.committee_id
and ti.accession_id *= v.accession_id
and v.vouchkind_id = 56
and ti.basionym_id *= tf.taxon_id

ETL

Talend job MERGE_COLLOBJ_SPECTYPE

Notes

MERGE: Text annotations (repeating within Cataloging schema)

Note that in SMaSCH, a table named voucher was used to capture text notes of various kinds.  Many of these are being cleaned up and put in other places, but some will remain.

Field mapping

See vouchers worksheet in attached accession mapping document

SQL

SELECT
    object_number,
    annotation_type,
    annotation_note,
    annotation_date
FROM
    dbo.cspace_anno_2_3
ORDER by object_number, annotation_date
create view cspace_anno_2_3 as
select
    rtrim(v.accession_id) as object_number,
    vk.kind as annotation_type,
    v.descr as annotation_note,
    v.voucher_date as annotation_date
from
    voucher v,
    vouchkind vk
where v.vouchkind_id *= vk.vouchkind_id
and v.vouchkind_id in (24, 29, 41, 56, 73, 78, 79)
union
select
    rtrim(ti.accession_id) as object_number,
    'holotype location' as annotation_type,
    ti.inst_abbr as annotation_note,
    ti.mod_date as annotation_date
from types_info ti
where ti.inst_abbr is not null
and ti.inst_abbr != ''
union
select
    aa.accession_id as object_number,
    'additional taxa' as annotation_type,
    tf.fullname as annotation_note,
    aa.folder_name_date as annotation_date
from
    annotation_asfiled aa,
    taxon_fullname tf
where aa.addl_taxon_id *= tf.taxon_id
and addl_taxon_id is not null and addl_taxon_id != 0

ETL

Using the SQL above directly for now.  Rtrim the accession_id.

Talend job MERGE_COLLOBJ_ANNOTATIONS

Notes

MERGE: Hybrid Parents (repeating within Cataloging schema)

From annotation_asfiled table

Field mapping

See annotations tab in attached accessions mapping document

SQL

SELECT
annotation_asfiled.accession_id, annotation_asfiled.parent_name_id,
  (case when parent_name_modifier_kind.taxon_modifier=' ' then null else
 parent_name_modifier_kind.taxon_modifier end) as parent_modifier
FROM
    annotation_asfiled, parent_name_modifier_kind
where annotation_asfiled.parent_name_modifier_id = parent_name_modifier_kind.taxon_modifier_id and
parent_name_id <> 0
order by accession_id, parent_seqno

ETL

Talend job MERGE_COLLOBJ_HYBRIDPARENTS

Notes

MERGE: Other Numbers (repeating within Cataloging schema)

From accession and voucher tables

Field mapping

See voucher worksheet in attached accession mapping document.

SQL

SELECT
    legacy_id,
    other_number,
    other_number_type
FROM
    dbo.cspace_othernum_2_3
ORDER by legacy_id
create view cspace_othernum_2_3 as
select
    rtrim(v.accession_id) as legacy_id,
    v.descr as other_number,
    case
        when v.vouchkind_id = 45 then
            'GenBank Code'
        when v.vouchkind_id = 55 then
            'Other Label Number'
        when v.vouchkind_id = 71 then
            'U.C. Botanical Garden Accession Number'
    end as other_number_type
from
    voucher v
where v.vouchkind_id in (45, 55, 71)

ETL

Talend job MERGE_COLLOBJ_OTHERNUM

Notes

45 = 'genbank code', 55 = 'other label numbers', 71 = 'U.C. Botanical Garden'

Merge collection objects and load

UPDATE: Collection Object - URI (done)

Use SQL to update collectionspace_core:uri for existing collection objects.

Update collection object records already in ucjeps.cspace (Phase II)

There are approximately 99 collection objects in ucjeps.cspace already.  All are UC.  Check to see if any have been modified in CSpace and update as needed.

Procedures

Insert loan object status values into dynamic controlled list

UCJEPS needs approximately 12 new values in a dynamic controlled list that is used by both loans in and loans out.  These can be loaded via POST commands and should be ready to go.

Loans Out

Field mapping

See attachment for field mapping. This grid is outdated.

SMaSCH

CollectionSpace loans out schema

CS User Interface Title (** = different from UI)

Note

loan_event.uc_loan_num + loan_event.jeps_loan_num

Loan out number

 

smasch has two types of loan identifiers.
Loan Number relates accessions in a single collection that make up a loan.
Loan ID relates loans from each collection that make up a loan.
"Loan Number" = collection + loan_num
"Loan ID" = loan_id

loan_event.curr_inst

Borrower

 

Probably should keep this as borrowing institution (organization authority)

loan_event:         
  curr_for_fname,
  curr_for_lname,
  curr_for_position

Borrower's contact

 

Probably should keep this as borrowing agent (person authority)

loan_event.inhouse_notes + loan_event.noteworthy_inclusions

Loan out note

 

 

loan_event.loan_status +
loan_detail.return_status

Loaned object status

 

It looks like this field is not currently active.  smasch keeps track of two status types:
loan status ('Active' or 'Cancelled')
return status ('All in', 'All out', 'Partial', 'Discrepancy', 'Unknown')

Loan date?

 

Loan Out Date

Does SMaSCH have a loan date?

SQL query

view:

Notes

MERGE: Loan Out Items (repeating group within Loan Out procedure records)

From loan_accn_ids table

Field mapping

SQL

ETL

Notes

MERGE: Loan Taxonomy (repeating group within Loan Out procedure records)

From several fields in loan_event table

Field mapping

SQL

ETL

Notes

MERGE: Loan Out Returns (repeating group within Loan Out procedure records)

From return_detail table

Field mapping

SQL

ETL

Notes

MERGE: Loan Out Transfers (repeating group within Loan Out procedure records)

From transfer_event table and possibly from loan_event table (orig_inst, orig_for_lname, orig_for_fname, orig_for_position)

Field mapping

SQL

ETL

Notes

Merge loan out information and load

Object to Loan Out Relationships (Phase II)

These will be added as detail items.  Relationship records might not be needed.

Relationships between UC and JEPS loans that have the same loan_id

Relationship records need to be created between separate loan out records to handle UC and JEPS loans that have the same loan_id.

Field mapping

SQL

ETL

Notes

Borrows (Loans In)

Field mapping

See attachment for field mapping.

SQL

Notes

MERGE: Loan In Taxonomy (repeating group within Loan In procedure records)

From several fields in borrow table

Field mapping

SQL

ETL

Notes

MERGE: Loan In Returns (repeating group within Loan In procedure records)

From borrow_returns table

Field mapping

SQL

ETL

Notes

MERGE: Loan In Transfers (repeating group within Loan In procedure records)

From several fields in borrow_xfer_out table

Field mapping

SQL

ETL

Notes

Merge loans in information and load

Handle unmatched borrow_return records

There are 20 records in borrow_return whose borrow_id does not match with one in the main borrow table.  They look like they could match on lender_loan_num and institution. It will probably be easiest to update these by hand.

Handle unmatched borrow_xfer_out records

There are3 records in borrow_xfer_out whose borrow_id does not match with one in the main borrow table.  They look like they could match on lender_loan_num and institution. It will probably be easiest to update these by hand.

Location (Location and Movement procedure)

Use annotation_asfiled.collectionkind_id (Name of Collection, e.g., Main, Reference, Type) to indicate storage of specimen

Field mapping

SQL

ETL

Notes

Object-Location Relationship records

Field mapping

SQL

ETL

Notes

Vocabularies (load these first)

Person vocabulary (default person-person)

Field mapping

See agents worksheet in authorities mapping file for current field mapping.

SQL query

View: cspace_person_2_3.

Note: View modified to exclude the persons from loans and borrows.  Those are now handled separately.

ETL

Talend job GET_PERSON_ADDS adds new person records.

Updates to be found later.

Obsolete 5/18: Talend job diff_person3 compares SMaSCH view (cspace_persons_2_0) to ucjeps.cspace (persons_common) on legacy_id, displayname, bio_notes, birthdate, and deathdate.  May 12: Found 3 new records and 2 updates. This can also be confirmed by viewing person table in SMaSCH in table order, noting the records added or updated in reverse chronological order.

Talend job GET_PERSON_V1 (Hoffman) was used to load data initially:

  • Input: Query view
  • See  SMaSCH data mapping notes for some special data handling
  • Output: XML file for loading (batches of 3000)
  • Output: List of CSIDs (to facilitate deleting)
  • Output: CSV file of basic identifiers (for record keeping and as input to one of the Org loading jobs).
  • 8945 rows

Notes

Status: SMaSCH persons records already loaded, plus persons derived from borrow and loan tables. However, these might have been updated: Person records might be updated, and borrow and loan records might be added or changed.  Person records last queried from SMaSCH on 3/8/2012 at 9:32.

Creates URI

5/18/2012

New view persons_2_3 includes people from borrows, loans, transfers, etc, so new approach required.

Excluding records where forename and surname are both null.

Too many non-unique names picked up by the select distinct statements for the trans_person table.  Need to fix.

Those records are static from transaction tables, so they will have to be refreshed when it's time to switchover!

ADD records can be identifying by comparing legacy_id in SMaSCH view to person_common.shortidentifier in CSpace.

UPDATES can be found two ways: 1) using my person_diff approach AFTER new records have been added or 2) doing a table scan on smasch.person and looking in database order.

ucjepspersonids-all.txt is used by many other jobs.  Create it by concatenating the ID's output from original load (in CRH archive2 directory) with new adds from this batch.

UPDATE: Person - URI (done)

Existing person records need collectionspace_core:uri.  REST update does not work on the collectionspace_core schema, so this has to be done in SQL.

Delete loan persons loaded already into ucjeps.cspace

See if any are used already.  Can use short identifiers in a certain range.

Add person records parsed from various loans and borrows tables

In SMaSCH, there are several unstructured person fields.  These are being parsed and then loaded as unique person authority records.

Organization-Collectors vocabulary (default organization vocabulary)

From SMaSCH committee table where committee_func= "coll"

Field mapping

See attachment (agents worksheet in authority mapping file) for current field mapping

SQL query

Use view: cspace_org_2_0 but limit to func="coll"; link to committee table to get collection number assignor (links back into committee where func=coll) and data source and to limit the query to get collection committee records added since last load.

ETL

Talend job GET_COLLECTOR_ADDS, just finds adds. 

Talend job GET_COLLECTOR_UPDATES finds 10 records to update (by hand presumably, output format is for Import Service do not run).

Talend job GET_ORG_V2 (Hoffman) was used to load initially:

  • See SMaSCH data mapping notes for some special handling
  • Input: Query view
  • Input: ucjepspersonout-ids.txt from GET_PERSON_V1 (to get refnames of organization contacts from person authority)
  • Output: XML file for loading (batches of 4000)
  • Output: List of CSIDs (to facilitate deleting)
  • Output: CSV file of basic identifiers (for record keeping).
  • 49845 rows

Talend job (Cheng):

  • Input: Text file from Dick Moe, non-colliding-collectors.in, containing names of seaweed collectors
  • 1546 rows
  • Already loaded into ucjeps.cspace

Notes

Need to update existing collector committees also with collection number assignor

Creating URI in ADD and UPDATE

UPDATE: Organization - URI (done)

Existing organization records need collectionspace_core:uri.  REST update does not work on the collectionspace_core schema, so this has to be done in SQL.

UPDATE: Organization - collection number assignors

Update existing collectors with their collection number assignor from SMaSCH.  REST api.

Organization-Nomenclature Committee vocabulary (organization-nomenclature vocabulary)

From SMaSCH committee table where committee_func= "nomen".

Field mapping

See notes on Organization-Collectors above.

SQL query

ETL

Talend job GET_ORG_NOMEN, gets all nomenclature committees and adds to new org-nomenclature vocabulary

Notes

Creates URI

Need to remove existing nomenclature records from default organization vocabulary and update refnames as needed.

Organization-Determination Committee vocabulary (organization-determination vocabulary)

From SMaSCH committee table where committee_func= "annot".

Field mapping

See notes on Organization-Collectors above.

SQL query

select legacy_id, display_name, long_name, short_name,
foundation_place, contact_name, org_function,
substring(contact_name, (patindex('%SMASCH%', dbo.cspace_org_2_0.contact_name)+6),
((patindex('%)%', dbo.cspace_org_2_0.contact_name))-(patindex('%SMASCH%',
  dbo.cspace_org_2_0.contact_name)+6))) contact_id, coll_num_person_id, data_src_name
from dbo.cspace_org_2_0, dbo.committee, data_source
where legacy_id=convert(varchar,
  committee_id) and dbo.committee.data_src_id=data_source.data_src_id and
 committee_func = 'annot

ETL

Talend job GET_ORG_DETERMINATION gets all determination committees for adding to new org-determination vocabulary

Notes

Creating URI

Need to remove existing determination records from default organization vocabulary and update refnames as needed.

Organization-Type Assertion Committee vocabulary (organization-typeassertion vocabulary)

From SMaSCH committee table where committee_func= "typ".

Field mapping

See notes on Organization-Collectors above.

SQL query

select legacy_id, display_name, long_name, short_name,
foundation_place, contact_name, org_function,
substring(contact_name, (patindex('%SMASCH%', dbo.cspace_org_2_0.contact_name)+6),
((patindex('%)%', dbo.cspace_org_2_0.contact_name))-(patindex('%SMASCH%',
  dbo.cspace_org_2_0.contact_name)+6))) contact_id, coll_num_person_id, data_src_name
from dbo.cspace_org_2_0, dbo.committee, data_source
where legacy_id=convert(varchar,
  committee_id) and dbo.committee.data_src_id=data_source.data_src_id and
 committee_func = 'typ

ETL

Talend job GET_ORG_TYPEASSERTION gets all committees of function "typ" for loading into new org-typeassertion vocabulary

Notes

Creates URI

Need to remove existing type assertion records from default organization vocabulary and update refnames as needed.

Clean up organization-organization (collectors) vocabulary in CSpace

Remove nomen, determ, type, and inst values loaded previously.

Organization-Institution vocabulary (organization-institute vocabulary)

Field mapping

See the agents worksheet in the authorities mapping attachment

Use the institute table

SQL query

Access the institute table directly, including the contact schema fields.

Might need to transform the short identifiers and refnames for institutes whose abbreviation starts with an asterisk. No, they seem to act OK.

SELECT dbo.institute.inst_abbr,
		dbo.institute.inst_name,
		dbo.institute.inst_address1,
		dbo.institute.inst_address2,
		dbo.institute.inst_city,
		dbo.institute.inst_state,
		dbo.institute.inst_zipcode,
		dbo.institute.inst_country,
		dbo.institute.location,
		dbo.institute.exch_bal_total,
		dbo.institute.exch_bal_vasc,
		dbo.institute.exch_bal_myc,
		dbo.institute.exch_bal_phyc,
		dbo.institute.exch_bal_bry,
		dbo.institute.exch_bal_types,
		dbo.institute.exch_bal_correction,
		dbo.institute.areas_they_want,
		dbo.institute.taxa_they_want,
		dbo.institute.areas_they_have,
		dbo.institute.taxa_they_have,
		dbo.institute.swap_for_lit,
		dbo.institute.avg_desid,
		dbo.institute.avg_specs,
		dbo.institute.phone,
		dbo.institute.FAX,
		dbo.institute.URL,
		dbo.institute.email
FROM	dbo.institute

ETL

Talend job GET_ORG_INSTITUTION creates XML outputs for org-institution vocabulary and contacts schema.

Notes

Creating URI in Institution and Contact

Need to remove existing institution records from default organization vocabulary and update refnames as needed.

Add institutions parsed from loans and borrows tables

There are approximately 12 institutions in various loans and borrows tables whose abbreviation does not match any in the institute table.

Common Names vocabulary (taxon-common vocabulary)

Note: Load before taxon-taxon vocabulary

Use the common_name table.

Create collectionspace_core:URI

Field mapping

See the taxonomy worksheet in the attached authorities mapping attachment

SQL

SELECT
    common_name_id,
    name common_name,
    dbo.common_name.data_src_id,
    dbo.data_source.data_src_name
FROM
    dbo.common_name, dbo.data_source
where dbo.common_name.data_src_id = dbo.data_source.data_src_id
order by common_name_id

ETL

Talend job GET_TAXONOMY_COMMON

Notes

Controlled list values for the Name Source field on taxonomy vocabularies (taxon-taxon, taxon-common, and taxon-unverified) are documented at (they are identical to the display value):

Taxonomy vocabulary (default taxon-taxon vocabulary)

NOTE: Need to load the taxon-common vocabulary first!

The taxon_name table is the primary source for this information. Some information coming from taxon_fullname and taxon_noauth_name. Note the repeating information for Taxonomy Authors and Common Names described below.

Note: Check to see if inAuth value is identical on ucjeps2 and ucjeps.cspace.  They are: 87036424-e55f-4e39-bd12

Field mapping

See the taxonomy worksheet in the authorities mapping attachment

SQL

Using the view cspace_taxonomy_2_3.

ETL

Talend job GET_TAXONOMY_V1 (0.2)

Input: basionym_lookup.txt (replace with all_taxon?)

Notes

Basionym values handled (skipped if 0 or equal to taxon_id).

Controlled list values for the Name Source field on taxonomy vocabularies (taxon-taxon, taxon-common, and taxon-unverified) are documented in

UPDATE: Taxon - URI

Existing taxonomy records need collectionspace_core:uri.  REST update does not work on the collectionspace_core schema, so this has to be done in SQL.

9292 records have a valid basionym

MERGE: Common Names for Taxonomic Terms (repeating within Taxonomy schema for taxon-taxon vocabulary)

Field mapping

See the taxonomy worksheet in the authorities mapping attachment

Use the smasch_common_names table.

Get Common Name refnames from the taxon-common output

SQL

SELECT
    taxon_id,
    common_name_id
FROM
    dbo.smasch_common_names
order by taxon_id

ETL

Talend job MERGE_TAXON_COMMON

Notes

MERGE: Authors for Taxonomic Terms (repeating within Taxonomy schema for taxon-taxon vocabulary)

Field mapping

See the taxonomy worksheet in the authorities mapping attachment.

From taxon_name table (or a new view)

Get author refnames from the org-nomenclature output

SQL

SELECT
    taxon_id,
    author_id,
    'author' author_type
FROM
    dbo.taxon_name
where author_id > 0
union
SELECT
    taxon_id,
    ascr_auth_id,
    'ascribed author' author_type
FROM
    dbo.taxon_name
where ascr_auth_id > 0
union
SELECT
    taxon_id,
    p_author_id,
    'parenthetical author' author_type
FROM
    dbo.taxon_name
where p_author_id > 0
union
SELECT
    taxon_id,
    p_ascr_auth_id,
    'parenthetical ascribed author' author_type
FROM
    dbo.taxon_name
where p_ascr_auth_id > 0
order by taxon_id, author_type

ETL

Talend job MERGE_TAXON_AUTHORS

Notes

Taxonomy Hierarchical Relations

Field mapping

See taxonomy worksheet in authorities mapping file.

Use taxon_name fields (taxon_id and parent_id) to create hierarchical relationship records.  Get their refnames from the taxon-taxon output.

SQL

Notes

Merge default taxonomy information and load

Unverified Taxonomic Names vocabulary (taxon-unverified vocabulary)

Field mapping

See the taxonomy worksheet in the authorities mapping attachment

Use variant_taxon_name table.

SQL

SELECT
    vname_id,
    variant_name,
    dbo.variant_taxon_name.data_src_id,
    dbo.data_source.data_src_name
FROM
    dbo.variant_taxon_name, dbo.data_source
where dbo.variant_taxon_name.data_src_id = dbo.data_source.data_src_id
order by vname_id

ETL

Talend job GET_TAXONOMY_UNVERIFIED

Notes

Changing legacy_id to "unverified" + vname_id so that taxon-variant and taxon-taxon lookup files can be combined for MERGE_COLLOBJ_DETERMINATIONS