CineFiles Data Mapping

Notes, queries, and mapping documents related to the migration of CineFiles data into CollectionSpace will be documented here.

Data mapping: See the Data Mapping Google Doc for field-based mapping

BAMPFA-1: Jira task for ER diagrams, spreadsheets for field mapping, etc.

Initial observations based on team discussion on 20 June 2013

Authorities

Citations - Authority Source vocabulary - SQL: Chris, Talend: Chris

Create new citation vocabulary to hold authority source terms

Field mapping

SQL

Using a CSV file exported from dbVisualizer as the data source (14 records)

Notes

Talend/ETL

Person - default person vocabulary - SQL: Glen, Talend: Glen

We will have one person vocabulary, mapping the names from Sybase that are person records.

There might be some need to produce merge parts, for instance if we need to create, preferred and non-preferred terms for individual records.

Field mapping

SQL

SELECT
   name_id,
   ( convert( varchar(15), names.name_id), '') as shortIdentifier,
   ISNULL( fname, '' ) as foreName,
   ISNULL( mname, '' ) as middleName,
   lname as surName,
   ISNULL( gen_suffix,'' ) as nameAddition,

   fname + substring(' ', sign(datalength(fname)), 1) +
   mname + substring(' ', sign(datalength(mname)), 1) +
   lname + substring(' ', sign(datalength(gen_suffix)), 1) +
   gen_suffix as displayName,

   ISNULL(convert( char(4), datepart( year, dob )), '' ) as birthYear,
   dob as earliestScalarBirthDate,
   dateadd( second, -1, dateadd( year, 1, dob )) as latestScalarBirthDate,

   ISNULL(convert( char(4), datepart( year, dod )), '' ) as deathYear,
   dod as earliestScalarDeathDate,
   dateadd( second, -1, dateadd( year, 1, dod )) as latestScalarDeathhDate,

   ISNULL( city, '' ) as city,
   ISNULL( state, '' ) as state,
   cntry_id,
   ISNULL( biog, '' ) as bioNote,
   name_type,
   ISNULL( convert( char(1), code ), '' ) as accessCode,
   entered,
   modified,
   ISNULL( entered_by, '' ) as enteredBy,
   ISNULL( modified_by, '' ) as modifiedBy,
   ISNULL( verified_by, '' ) as verifiedBy,
   ISNULL( src_id, '' ) as authSource,
   ISNULL( note, '' ) as nameNote,
   ISNULL( url, '' ) as URL

FROM
   names

WHERE
   name_type = 1
   AND
   name_id > 1

Notes

Used ISNULL function everywhere that it made sense to avoid the endless checking for nulls in Talend.
Aliased files names to match the import tags they will be mapped to.
Name recort 1 is a dummy record so it is excluded.

Note: Access Code mapping.  These are the option values mapped in to CollectionSpace:

0 <option id="PFA Staff Only">PFA Staff Only</option>
1 <option id="In House Only">In House Only</option>
2 <option id="Campus (UCB)">Campus (UCB)</option>
3 <option id="Education (.edu)">Education (.edu)</option>
4 <option id="World">World</option>
5 World
Note that 5 exists in the database as a convenience but basically also maps to World. See http://issues.collectionspace.org/browse/BAMPFA-36 for more information.

Talend/ETL

MERGE: Person names - SQL: Glen, Talend: Glen

In order to incorporate primary and alternate names.

Field mapping

SQL

SELECT
   name_id,
   0 as altname_id,
   ISNULL( convert( varchar( 15 ), name_id ), '' ) as shortIdentifier,
   ISNULL( fname, '' ) as foreName,
   ISNULL( mname, '' ) as middleName,
   lname as surName,
   ISNULL( gen_suffix, '' ) as nameAddition,
   fname + substring( ' ', sign( datalength( fname ) ), 1 ) +
   mname + substring( ' ', sign( datalength( mname ) ), 1 ) +
   lname + substring( ' ', sign( datalength( gen_suffix ) ), 1 ) +
   gen_suffix as displayName,
   '' as srcID,
   '' as  nameNote
FROM
   names

WHERE
   name_type = 1
   AND
   name_id > 1

UNION

SELECT
   a.name_id,
   a.altname_id,
   ISNULL( convert( varchar( 15 ), a.name_id ), '' ) as shortIdentifier,
   ISNULL( a.fname, '' ) as foreName,
   ISNULL( a.mname, '' ) as middleName,
   a.lname as surName,
   ISNULL( a.gen_suffix, '' ) as nameAddition,
   a.fname + substring( ' ', sign( datalength( a.fname ) ), 1 ) +
   a.mname + substring( ' ', sign( datalength( a.mname ) ), 1 ) +
   a.lname + substring( ' ', sign( datalength( a.gen_suffix ) ), 1 ) +
   a.gen_suffix as displayName,
   ISNULL( a.src_id, '' ) as srcID,
   ISNULL( a.note, '' ) as nameNote

FROM altnames a, names n
     HAVING ( n.name_type = 1
        AND  n.name_id = a.name_id
        AND  n.name_id > 1  )

ORDER BY name_id, altname_id

Notes

Includes preferred and non-preferred names in one query. A '0' value for altname_id indicates a preferred name.

Talend/ETL

Organization - corporate names vocabulary - SQL: Glen, Talend: Glen

Use the default org vocabulary for Sybase corporate names

There might be some need to produce merge parts, for instance if we need to create, preferred and non-preferred terms for individual records.

Field mapping

SQL

SELECT
   name_id,
   ISNULL( convert( varchar(15), names.name_id), '') as shortIdentifier,
   lname as displayName,
   ISNULL( city, '' ) as foundingCity,
   ISNULL( state, '' ) as foundingState,
   cntry_id,

   ISNULL( convert( char(4), datepart( year, dob )), '' ) as foundingYear,

   ISNULL( dob, NULL ) as earliestScalarBirthDate,
   dateadd( second, -1, dateadd( year, 1, dob )) as latestScalarBirthDate,

   ISNULL( biog, '' ) as bioNote,
   ISNULL( note, '' ) as nameNote,
   ISNULL( src_id, '' ) as authSourceID,

   ISNULL( convert( char(1), code ), '' ) as accessCode,
   ISNULL( url, '' ) as URL,

   entered,
   modified,
   ISNULL( entered_by, '' ) as enteredBy,
   ISNULL( modified_by, '' ) as modifiedBy

FROM
   names

WHERE
   name_type = 2 AND name_id > 1

ORDER BY name_id

Notes

Talend/ETL

Organization - committees vocabulary - SQL: Chris, Talend: Chris

Create a new org vocabulary for Sybase committee names

Field mapping

SQL

SELECT name_id, lname, name_type, code, src_id, note,
entered, entered_by, modified, modified_by
FROM names
where name_type=3
order by lname

Notes

Talend/ETL

MERGE: Corporate names - SQL: Glen, Talend: Glen

In order to incorporate primary and alternate names.

Field mapping

SQL

SELECT
   name_id,
   0 as altname_id,
   ISNULL( convert( varchar( 15 ), name_id ), '' ) as shortIdentifier,
   ISNULL( fname, '' ) as foreName,
   ISNULL( mname, '' ) as middleName,
   lname as surName,
   ISNULL( gen_suffix, '' ) as nameAddition,
   fname + substring( ' ', sign( datalength( fname ) ), 1 ) +
   mname + substring( ' ', sign( datalength( mname ) ), 1 ) +
   lname + substring( ' ', sign( datalength( gen_suffix ) ), 1 ) +
   gen_suffix as displayName,
   '' as srcID,
   '' as nameNote

FROM
   names

WHERE
   name_type = 1 AND name_id > 1

UNION

SELECT
   a.name_id,
   a.altname_id,
   ISNULL( convert( varchar( 15 ), a.name_id ), '' ) as shortIdentifier,
   ISNULL( a.fname, '' ) as foreName,
   ISNULL( a.mname, '' ) as middleName,
   a.lname as surName,
   ISNULL( a.gen_suffix, '' ) as nameAddition,
   a.fname + substring( ' ', sign( datalength( a.fname ) ), 1 ) +
   a.mname + substring( ' ', sign( datalength( a.mname ) ), 1 ) +
   a.lname + substring( ' ', sign( datalength( a.gen_suffix ) ), 1 ) +
   a.gen_suffix as displayName,
   ISNULL( a.src_id, '' ) as srcID,
   ISNULL( a.note, '' ) as nameNote

FROM
   altnames a, names n

HAVING ( n.name_type = 1
   AND n.name_id = a.name_id
   AND n.name_id > 1 )

ORDER BY
   name_id, altname_id

Notes

Talend/ETL

MERGE: Committee Members for committees (org vocabulary) - SQL: Chris, Talend: Chris

Field mapping

SQL

SELECT cm.name_id committeeId, rtrim(cm.src_id) src_id,
   p.fname + substring(' ', sign(datalength(p.fname)), 1) +
   p.mname + substring(' ', sign(datalength(p.mname)), 1) +
   p.lname + substring(' ', sign(datalength(p.gen_suffix)), 1) +
   p.gen_suffix as member_displayName,
   ISNULL( convert( varchar(15), p.name_id), '') as member_shortIdentifier
FROM committeemembers cm, names c, names p
where cm.name_id=c.name_id and cm.member_id=p.name_id
and c.name_type=3 and cm.member_id<>1
order by c.name_id, member_displayName

Notes

Talend/ETL

Contact schema records for Organization - corporate names vocabulary - SQL: Glen, Talend: Glen

We do have URLs for approximately 200 corporate name records. Those are imported as a separate job with a pointer to the org record that the contact info should be associated with. See http://wiki.collectionspace.org/display/collectionspace/Organization+Service+REST+APIs#OrganizationServiceRESTAPIs-ContactCRUDLservices for more information.

Field mapping

SQL

...

Notes

Talend/ETL

Concept - genre vocabulary - SQL: Chris, Talend: Chris

Use default concept vocabulary to hold genres

Field mapping

SQL

SELECT g.genre_id,
   'genre'||convert(varchar, genre_id) shortid,
   g.genre,
   rtrim(src_id) authsource,
   rtrim(note) note
FROM genres g
where genre_id <> 1
order by genre

Notes

Talend/ETL

Concept - subject vocabulary - SQL: Chris, Talend: Chris

Create new concept vocabulary to hold subject terms

Field mapping

SQL

SELECT  subj_id,
        'subject'||convert(varchar, subj_id) shortid,
        subj,
        rtrim(src_id),
        note
FROM subjects
where subj_id <> 1
order by subje

Notes

Talend/ETL

MERGE: Alternate genre terms - SQL: Chris, Talend: Chris

Field mapping

SQL

SELECT g.genre_id,
   'genre'||convert(varchar, genre_id) shortid,
   g.genre,
   rtrim(src_id) authsource,
   rtrim(note) note,
   0 sortid
FROM genres g
where genre_id <> 1
union
select gl.genre_id,
      'genre'||convert(varchar, genre_id) shortid,
      gl.genre_term,
      null authsource,
      null note,
      gl.altgenre_id sortid
from genres_list gl
where altgenre_id <> 0
order by genre_id, sortid

Notes

Talend/ETL

Works - Film Vocabulary - SQL: Lam, Talend: Lam

Use default Works vocabulary to hold films

Merge groups will be needed

Field mapping

SQL

select
    f.film_id as legacy_film_id,
    'pfafilm' + convert(varchar(10), f.film_id) as short_identifier,
    substring(
        substring(
            f.prefix + ' ', 1,
            datalength( f.prefix + ' ' ) -
            sign( patindex( '%[''-]', f.prefix ))),
            sign( patindex( '%[^^I ]%', f.prefix )), 10 ) + f.title as display_name,
    'urn:cspace:cinefiles.cspace.berkeley.edu:vocabularies:name(worktype):item:name(film)''Film''' as work_type,
    f.note as history_notes,
    f.entered as created_at,
    case
        when f.entered_by = 'lt' then 'Linda Tadic'
        else
            eb.fname + substring( ' ', sign( datalength( eb.fname ) ), 1 )
            + eb.mname + substring( ' ', sign( datalength( eb.mname ) ), 1 )
            + eb.lname
    end as created_by,
    f.modified as updated_at,
    case
        when f.modified_by = 'lt' then 'Linda Tadic'
        else
            mb.fname + substring( ' ', sign( datalength( mb.fname ) ), 1 )
            + mb.mname + substring( ' ', sign( datalength( mb.mname ) ), 1 )
            + mb.lname
    end as updated_by,
    'urn:cspace:cinefiles.cspace.berkeley.edu:workauthorities:name(work):item:name(pfafilm'
        + convert(varchar(10), f.film_id) + ')'''
        + substring(
        substring(
            f.prefix + ' ', 1,
            datalength( f.prefix + ' ' ) -
            sign( patindex( '%[''-]', f.prefix ))),
            sign( patindex( '%[^^I ]%', f.prefix )), 10 ) + f.title
        + '''' as refname
from films f, pfastaff eb, pfastaff mb
where f.entered_by *= eb.initials
and f.modified_by *= mb.initials
and f.film_id > 1
order by legacy_film_id;

Notes

Talend/ETL

MERGE: Work Term Names (Film Titles) - SQL: Lam, Talend: Lam

Order: Original titles, English Titles, Alternate Titles

Encountered problem with $ in refname.  Four records that contain the $ character in the title failed to load.  Glen's investigation pinpointed the problem to the $ character in the refname.  The occurrence of the $ in other fields seem to be okay.  Glen substituted '$' for the $ character, and loading still failed.  Removing the $ from the refname, but leaving the $ in other fields resulted in a successful load.

Field mapping

SQL

-- film term names
-- from original titles from films table
-- union English titles from films table
-- union alternate titles from alttitles table
-- sort order: original, English, alternate
select
    f.film_id as legacy_film_id,
    'pfafilm' + convert(varchar(10), f.film_id) as short_identifier,
    case when f.prefix is null or f.prefix = '' then f.title
        else substring(
            substring(
                f.prefix + ' ', 1,
                datalength( f.prefix + ' ' ) -
                sign( patindex( '%[''-]', f.prefix ))),
                sign( patindex( '%[^^I ]%', f.prefix )), 10 ) + f.title
    end as term_display_name,
    ltrim(rtrim(f.title)) as term_name,
    ltrim(rtrim(f.prefix)) as term_qualifier,
    'Original Title' as term_type,
    'complete' as term_status,
    null as term_language,
    1 as preferred_flag,
    1 as pos,
    ltrim(rtrim(f.src_id)) as term_source_id,
    'urn:cspace:cinefiles.cspace.berkeley.edu:citationauthorities:name(citation):item:name('
        + f.src_id + ')''' + a.source + '''' as term_source_refname,
    null as term_source_note
from
    films f,
    authsources a
where f.src_id *= a.src_id
and f.film_id > 1
union
select
    f.film_id as legacy_film_id,
    'pfafilm' + convert(varchar(10), f.film_id) as short_identifier,
    case when f.eng_prefix is null or f.eng_prefix = '' then f.eng_title
        else substring(
            substring(
                f.eng_prefix + ' ', 1,
                datalength( f.eng_prefix + ' ' ) -
                sign( patindex( '%[''-]', f.eng_prefix ))),
                sign( patindex( '%[^^I ]%', f.eng_prefix )), 10 ) + f.eng_title
    end as term_display_name,
    ltrim(rtrim(f.eng_title)) as term_name,
    ltrim(rtrim(f.eng_prefix)) as term_qualifier,
    'English Title' as term_type,
    'complete' as term_status,
    'urn:cspace:cinefiles.cspace.berkeley.edu:vocabularies:name(languages):item:name(ENG)''English''' as term_language,
    0 as preferred_flag,
    2 as pos,
    ltrim(rtrim(f.src_id)) as term_source_id,
    'urn:cspace:cinefiles.cspace.berkeley.edu:citationauthorities:name(citation):item:name('
        + f.src_id + ')''' + a.source + '''' as term_source_refname,
    null as term_source_note
from
    films f,
    authsources a
where f.src_id *= a.src_id
and f.eng_title is not null
and f.eng_title != ''
and f.film_id > 1
union
select
    a.film_id as legacy_film_id,
    'pfafilm' + convert(varchar(10), a.film_id) as short_identifier,
    case when a.prefix is null or a.prefix = '' then a.title
        else substring(
            substring(
                a.prefix + ' ', 1,
                datalength( a.prefix + ' ' ) -
                sign( patindex( '%[''-]', a.prefix ))),
                sign( patindex( '%[^^I ]%', a.prefix )), 10 ) + a.title
    end as term_display_name,
    ltrim(rtrim(a.title)) as term_name,
    ltrim(rtrim(a.prefix)) as term_qualifier,
    'Alternate Title' as term_type,
    'complete' as term_status,
    null as term_language,
    0 as preferred_flag,
    3 as pos,
    ltrim(rtrim(a.src_id)) as term_source_id,
    'urn:cspace:cinefiles.cspace.berkeley.edu:citationauthorities:name(citation):item:name('
        + a.src_id + ')''' + auth.source + '''' as term_source_refname,
    a.note as term_source_note
from
    alttitles a,
    authsources auth
where a.src_id *= auth.src_id
and a.title is not null
and a.title != ''
and a.film_id > 1
and a.alttitle_id not in (
    select alt.alttitle_id from alttitles alt, films f
    where substring(
        substring(
            alt.prefix + ' ', 1,
            datalength( alt.prefix + ' ' ) -
            sign( patindex( '%[''-]', alt.prefix ))),
            sign( patindex( '%[^^I ]%', alt.prefix )), 10 ) + alt.title
    = substring(
        substring(
            f.prefix + ' ', 1,
            datalength( f.prefix + ' ' ) -
            sign( patindex( '%[''-]', f.prefix ))),
            sign( patindex( '%[^^I ]%', f.prefix )), 10 ) + f.title
    and alt.film_id = f.film_id)
and a.alttitle_id not in (
    select alt.alttitle_id from alttitles alt, films f
    where substring(
        substring(
            alt.prefix + ' ', 1,
            datalength( alt.prefix + ' ' ) -
            sign( patindex( '%[''-]', alt.prefix ))),
            sign( patindex( '%[^^I ]%', alt.prefix )), 10 ) + alt.title
    = substring(
        substring(
            f.eng_prefix + ' ', 1,
            datalength( f.eng_prefix + ' ' ) -
            sign( patindex( '%[''-]', f.eng_prefix ))),
            sign( patindex( '%[^^I ]%', f.eng_prefix )), 10 ) + f.eng_title
    and alt.film_id = f.film_id)
order by legacy_film_id, pos;

Notes

Talend/ETL

MERGE: Work Creators (Film Directors) - SQL: Lam, Talend: Lam

Creators = Directors

Field mapping

SQL

select
    f.film_id as legacy_film_id,
    'pfafilm' + convert(varchar(10), f.film_id) as short_identifier,
    f.name_id as legacy_name_id,
    ltrim(rtrim(case
        when n.name_type = 1 then
            n.fname + substring(' ', sign(datalength(n.fname)), 1) +
            n.mname + substring(' ', sign(datalength(n.mname)), 1) +
            n.lname + substring(' ', sign(datalength(n.gen_suffix)), 1) + n.gen_suffix
        when n.name_type = 2 then n.lname
        when n.name_type = 3 then n.lname
    end)) as creator_display_name,
    'urn:cspace:cinefiles.cspace.berkeley.edu:vocabularies:name(workcreatortype):item:name(director)''Director''' as creator_type,
    case
        when n.name_type = 1 then
            'urn:cspace:cinefiles.cspace.berkeley.edu:personauthorities:name(person):item:name(person'
            + convert(varchar(10), n.name_id) + ')'''
            + ltrim(rtrim(n.fname + substring(' ', sign(datalength(n.fname)), 1)
                + n.mname + substring(' ', sign(datalength(n.mname)), 1)
                + n.lname + substring(' ', sign(datalength(n.gen_suffix)), 1) + n.gen_suffix))
            + ''''
        when n.name_type = 2 then
            'urn:cspace:cinefiles.cspace.berkeley.edu:orgauthorities:name(committee):item:name(committee'
            + convert(varchar(10), n.name_id) + ')'''
            + ltrim(rtrim(n.lname)) + ''''
        when n.name_type = 3 then
            'urn:cspace:cinefiles.cspace.berkeley.edu:orgauthorities:name(corporation):item:name(organization'
            + convert(varchar(10), n.name_id) + ')'''
            + ltrim(rtrim(n.lname)) + ''''
    end as creator_refname
from
    films f,
    names n
where f.name_id *= n.name_id
and f.film_id > 1
and f.name_id > 1
union
select
    fd.film_id as legacy_film_id,
    'pfafilm' + convert(varchar(10), fd.film_id) as short_identifier,
    fd.name_id as legacy_name_id,
    ltrim(rtrim(case
        when n.name_type = 1 then
            n.fname + substring(' ', sign(datalength(n.fname)), 1) +
            n.mname + substring(' ', sign(datalength(n.mname)), 1) +
            n.lname + substring(' ', sign(datalength(n.gen_suffix)), 1) + n.gen_suffix
        when n.name_type = 2 then n.lname
        when n.name_type = 3 then n.lname
    end)) as creator_display_name,
    'urn:cspace:cinefiles.cspace.berkeley.edu:vocabularies:name(workcreatortype):item:name(director)''Director''' as creator_type,
    case
        when n.name_type = 1 then
            'urn:cspace:cinefiles.cspace.berkeley.edu:personauthorities:name(person):item:name(person'
            + convert(varchar(10), n.name_id) + ')'''
            + ltrim(rtrim(n.fname + substring(' ', sign(datalength(n.fname)), 1)
                + n.mname + substring(' ', sign(datalength(n.mname)), 1)
                + n.lname + substring(' ', sign(datalength(n.gen_suffix)), 1) + n.gen_suffix))
            + ''''
        when n.name_type = 2 then
            'urn:cspace:cinefiles.cspace.berkeley.edu:orgauthorities:name(committee):item:name(committee'
            + convert(varchar(10), n.name_id) + ')'''
            + ltrim(rtrim(n.lname)) + ''''
        when n.name_type = 3 then
            'urn:cspace:cinefiles.cspace.berkeley.edu:orgauthorities:name(corporation):item:name(organization'
            + convert(varchar(10), n.name_id) + ')'''
            + ltrim(rtrim(n.lname)) + ''''
    end as creator_refname
from
    filmdirectors fd,
    names n
where fd.name_id *= n.name_id
and fd.film_id > 1
and fd.name_id > 1
order by legacy_film_id, creator_display_name;

Notes

Talend/ETL

MERGE: Work Publishers (Film Production Companies) - SQL: Lam, Talend: Lam

Publishers = Production Companies

Field mapping

SQL

select distinct
    fp.film_id as legacy_film_id,
    'pfafilm' + convert(varchar(10), fp.film_id) as short_identifier,
    fp.name_id as legacy_name_id,
    ltrim(rtrim(case
        when n.name_type = 1 then
            n.fname + substring(' ', sign(datalength(n.fname)), 1) +
            n.mname + substring(' ', sign(datalength(n.mname)), 1) +
            n.lname + substring(' ', sign(datalength(n.gen_suffix)), 1) + n.gen_suffix
        when n.name_type = 2 then n.lname
        when n.name_type = 3 then n.lname
    end)) as publisher_display_name,
    'urn:cspace:cinefiles.cspace.berkeley.edu:vocabularies:name(workpublishertype):item:name(productioncompany)''Production Company''' as publisher_type,
    case
        when n.name_type = 1 then
            'urn:cspace:cinefiles.cspace.berkeley.edu:personauthorities:name(person):item:name(person'
            + convert(varchar(10), n.name_id) + ')'''
            + ltrim(rtrim(n.fname + substring(' ', sign(datalength(n.fname)), 1)
                + n.mname + substring(' ', sign(datalength(n.mname)), 1)
                + n.lname + substring(' ', sign(datalength(n.gen_suffix)), 1) + n.gen_suffix))
            + ''''
        when n.name_type = 2 then
            'urn:cspace:cinefiles.cspace.berkeley.edu:orgauthorities:name(committee):item:name(committee'
            + convert(varchar(10), n.name_id) + ')'''
            + ltrim(rtrim(n.lname)) + ''''
        when n.name_type = 3 then
            'urn:cspace:cinefiles.cspace.berkeley.edu:orgauthorities:name(corporation):item:name(organization'
            + convert(varchar(10), n.name_id) + ')'''
            + ltrim(rtrim(n.lname)) + ''''
    end as publisher_refname
from
    filmprodcos fp,
    names n
where fp.name_id *= n.name_id
and fp.film_id > 1
and fp.name_id > 1
order by legacy_film_id, publisher_display_name;

Notes

Talend/ETL

MERGE: Work Dates (Film Years) - SQL: Lam, Talend: Lam

Field mapping

SQL

select
    f.film_id as legacy_film_id,
    'pfafilm' + convert(varchar(10), f.film_id) as short_identifier,
    f.year as display_date,
    convert(varchar(4), f.year) + '-01-01T00:00:00Z' as earliest_scalar,
    f.year as earliest_year,
    1 as earliest_month,
    1 as earliest_day,
    convert(varchar(4), f.year)  + '-12-31T23:59:59Z' as latest_scalar,
    f.year as latest_year,
    12 as latest_month,
    31 as latest_day
from
    films f
where f.film_id > 1
and f.year is not null
union
select
    fy.film_id as legacy_film_id,
    'pfafilm' + convert(varchar(10), fy.film_id) as short_identifier,
    fy.year as display_date,
    convert(varchar(4), fy.year) + '-01-01T00:00:00Z' as earliest_scalar,
    fy.year as earliest_year,
    1 as earliest_month,
    1 as earliest_day,
    convert(varchar(4), fy.year)  + '-12-31T23:59:59Z' as latest_scalar,
    fy.year as latest_year,
    12 as latest_month,
    31 as latest_day
from
    filmyears fy
where fy.film_id > 1
and fy.year is not null
order by legacy_film_id, display_date;

Notes

Talend/ETL

MERGE: Work Languages (Film Languages) - SQL: Lam, Talend: Lam

Field mapping

SQL

select
    f.film_id as legacy_film_id,
    'pfafilm' + convert(varchar(10), f.film_id) as short_identifier,
    f.lang_id as legacy_lang_id,
    l.lang_tla as language_code,
    ltrim(rtrim(l.lang)) as language_name,
    'urn:cspace:cinefiles.cspace.berkeley.edu:vocabularies:name(languages):item:name(' +
        l.lang_tla + ')''' + ltrim(rtrim(l.lang)) + '''' as language_refname
from
    films f,
    languages l
where f.lang_id = l.lang_id
and f.film_id > 1
and l.lang_id > 1
union
select
    fl.film_id as legacy_film_id,
    'pfafilm' + convert(varchar(10), fl.film_id) as short_identifier,
    fl.lang_id as legacy_lang_id,
    l.lang_tla as language_code,
    ltrim(rtrim(l.lang)) as language_name,
    'urn:cspace:cinefiles.cspace.berkeley.edu:vocabularies:name(languages):item:name(' +
        l.lang_tla + ')''' + ltrim(rtrim(l.lang)) + '''' as language_refname
from
    filmlangs fl,
    languages l
where fl.lang_id = l.lang_id
and fl.film_id > 1
and l.lang_id > 1
order by legacy_film_id, language_name;

Notes

Talend/ETL

MERGE: Work Countries (Film Countries) - SQL: Lam, Talend: Lam

Field mapping

SQL

select
    f.film_id as legacy_film_id,
    'pfafilm' + convert(varchar(10), f.film_id) as short_identifier,
    f.cntry_id as legacy_country_id,
    c.cntry_code as country_code,
    ltrim(rtrim(c.country)) as country_name,
    'urn:cspace:cinefiles.cspace.berkeley.edu:vocabularies:name(country):item:name(' +
        ltrim(rtrim(c.cntry_code)) + ')''' + ltrim(rtrim(c.country)) + '''' as country_refname
from
    films f,
    countries c
where f.cntry_id = c.cntry_id
and f.film_id > 1
and c.cntry_id > 1
union
select
    fc.film_id as legacy_film_id,
    'pfafilm' + convert(varchar(10), fc.film_id) as short_identifier,
    fc.cntry_id as legacy_country_id,
    c.cntry_code as country_code,
     ltrim(rtrim(c.country)) as country_name,
    'urn:cspace:cinefiles.cspace.berkeley.edu:vocabularies:name(country):item:name(' +
        ltrim(rtrim(c.cntry_code)) + ')''' + ltrim(rtrim(c.country)) + '''' as country_refname
from
    filmcountries fc,
    countries c
where fc.cntry_id = c.cntry_id
and fc.film_id > 1
and c.cntry_id > 1
order by legacy_film_id, country_name;

Notes

Talend/ETL

MERGE: Work Genres (Film Genres) - SQL: Lam, Talend: Lam

Repeating: concept-genre vocabulary

Field mapping

SQL

select
    fg.film_id as legacy_film_id,
    'pfafilm' + convert(varchar(10), fg.film_id) as short_identifier,
    fg.genre_id as legacy_genre_id,
    ltrim(rtrim(g.genre)) as genre_name,
    'urn:cspace:cinefiles.cspace.berkeley.edu:conceptauthorities:name(genre):item:name(genre'
        + convert(varchar(4), fg.genre_id) + ')'''
        + ltrim(rtrim(g.genre))
        + '''' as genre_refname
from
    filmgenres fg,
    genres g
where fg.genre_id = g.genre_id
and fg.film_id > 1
and g.genre_id > 1
order by legacy_film_id, genre_name;

Notes

Talend/ETL

MERGE: Work Subjects (Film Subjects) - SQL: Lam, Talend: Lam

Repeating: concept-subject (default) vocabulary

Field mapping

SQL

select
    fs.film_id as legacy_film_id,
    'pfafilm' + convert(varchar(10), fs.film_id) as short_identifier,
    fs.subj_id as legacy_subj_id,
    ltrim(rtrim(s.subj)) as subject_name,
    'urn:cspace:cinefiles.cspace.berkeley.edu:conceptauthorities:name(subject):item:name(subject'
        + convert(varchar(6), fs.subj_id) + ')'''
        + ltrim(rtrim(s.subj))
        + '''' as subj_refname
from
    filmsubjs fs,
    subjects s
where fs.subj_id = s.subj_id
and fs.film_id > 1
and s.subj_id > 1
order by legacy_film_id, subject_name;

Notes

Talend/ETL

Cataloging - documents

Collection Objects - documents - SQL: Chris, Talend: Chris

Field mapping

SQL

SELECT
  d.doc_id,
  title,
  rtrim(prefix) prefix,
  case when (prefix like '%''' or prefix like '%-' or prefix is null) then prefix||title
       else prefix||' '||title end as docDisplayName,
  type_id,
  pages numberObjects,
  pg_info pagination,
  case when d.src_id <> 1 then d.src_id end as srcShortId,
  case when d.src_id <> 1 then n.name_type end as srcNameType,
  n.fname + substring( ' ', sign( datalength( n.fname ) ), 1 ) +
    n.mname + substring( ' ', sign( datalength( n.mname ) ), 1 ) +
    n.lname + substring( ' ', sign( datalength( n.gen_suffix ) ), 1 ) +
    n.gen_suffix as srcDisplayName,
  date_string,
  early_date,
  late_date,
  case when (early_date>2415021 and early_date<2456602) then dateadd(dd,(early_date - 2415021),
    convert(date, '1900-01-01', 103))
    end as begin_date_scalar,
  case when (early_date>2415021 and early_date<2456602) then datepart(year, dateadd(dd,
 (early_date - 2415021),
    convert(datetime, '1900-01-01', 103)))
    end as begin_date_year,
  case when (early_date>2415021 and early_date<2456602) then datepart(month, dateadd(dd,
 (early_date - 2415021),
    convert(datetime, '1900-01-01', 103)))
    end as begin_date_month,
  case when (early_date>2415021 and early_date<2456602) then datepart(day, dateadd(dd,
 (early_date - 2415021),
    convert(datetime, '1900-01-01', 103)))
    end as begin_date_day,
  case when (late_date>2415021 and late_date<2456602) then dateadd(dd,(late_date - 2415021),
    convert(date, '1900-01-01', 103))
    end as end_date_scalar,
  case when (late_date>2415021 and late_date<2456602 and early_date<>late_date) then datepart(year,
  dateadd(dd,(late_date - 2415021),
    convert(datetime, '1900-01-01', 103)))
    end as end_date_year,
  case when (late_date>2415021 and late_date<2456602 and early_date<>late_date) then
 datepart(month, dateadd(dd,(late_date - 2415021),
    convert(datetime, '1900-01-01', 103)))
    end as end_date_month,
  case when (late_date>2415021 and late_date<2456602 and early_date<>late_date) then datepart(day,
  dateadd(dd,(late_date - 2415021),
    convert(datetime, '1900-01-01', 103)))
    end as end_date_day,
  sort_date,
  d.code,
  d.note briefDescription,
  case when cast_cr=1 then 'true' else 'false' end as cast_cr,
  case when tech_cr=1 then 'true' else 'false' end as tech_cr,
  case when bx_info=1 then 'true' else 'false' end as bx_info,
  case when filmog=1 then 'true' else 'false' end as filmog,
  case when dist_co=1 then 'true' else 'false' end as dist_co,
  case when prod_co=1 then 'true' else 'false' end as prod_co,
  case when cost=1 then 'true' else 'false' end as cost,
  case when illust=1 then 'true' else 'false' end as illust,
  case when biblio=1 then 'true' else 'false' end as biblio,
  d.entered,
  d.entered_by,
  d.modified,
  d.modified_by,
  u.src_id urlsource,
  n2.name_type urlnametype,
  u.docurl,
  n2.lname urlsourcename
FROM docs d
left outer join names n on n.name_id=d.src_id
left outer join docurls u on (d.doc_id=u.doc_id and u.note is null and u.src_id<>486)
left outer join names n2 on n2.name_id=u.src_id
where d.doc_id<>1
order by d.doc_id

Notes

This job currently excludes records with bad dates. Michael is fixing these but need to confirm. Also, the Talend job should be extended to reuse the same CSIDs to facilitate Glen's work on Media Handling (i.e., it will be good if the document CSIDs persist though new records entered in CineFiles will need new CSIDs of course).

Talend/ETL

MERGE: Document Subjects - SQL: Chris, Talend: Chris

Repeating: Concept-subject vocabulary

Field mapping

SQL

SELECT doc_id, d.subj_id, s.subj
FROM docsubjs d
inner join subjects s on s.subj_id = d.subj_id
where d.subj_id<>1
order by doc_id

Notes

Talend/ETL

MERGE: Name Subjects - SQL: Chris, Talend: Chris

Repeating: person-person vocabulary and org-corporate vocabulary

Field mapping

SQL

SELECT nd.doc_id, nd.namesubj_id,
case when n.name_type=2 then 'organization'
    when n.name_type=1 then 'person'
    when n.name_type=3 then 'committee' end as subjectVocab,
    ltrim(n.fname + substring( ' ', sign( datalength( n.fname ) ), 1 ) +
    mname + substring( ' ', sign( datalength( n.mname ) ), 1 ) +
    lname + substring( ' ', sign( datalength( n.gen_suffix ) ), 1 ) +
    n.gen_suffix) as displayName,
    case when n.name_type=1 then 'personauthorities' else 'orgauthorities' end as authtype
FROM namedocs nd
left outer join names n on n.name_id=nd.namesubj_id
where nd.namesubj_id<>1 and nd.doc_id<>1
order by nd.doc_id

Notes

Talend/ETL

MERGE: Film Subjects - SQL: Chris, Talend: Chris

Repeating: works-film vocabulary

Field mapping

SQL

SELECT fd.doc_id, fd.film_id, f.title,
    'urn:cspace:cinefiles.cspace.berkeley.edu:workauthorities:name(work):item:name('
        + convert(varchar(10), f.film_id) + ')'''
        + substring(
        substring(
            f.prefix + ' ', 1,
            datalength( f.prefix + ' ' ) -
            sign( patindex( '%[''-]', f.prefix ))),
            sign( patindex( '%[^^I ]%', f.prefix )), 10 ) + f.title
        + '''' as refname
FROM filmdocs fd
left outer join films f on fd.film_id=f.film_id
where fd.film_id <> 1 and fd.doc_id <> 1
order by fd.doc_id

Notes

Talend/ETL

MERGE: Document Languages - SQL: Chris, Talend: Chris

Repeating: Language values (is this a URN-formatted name?)

Field mapping

SQL

SELECT dl.doc_id, dl.lang_id, l.lang_tla, lang
FROM doclangs dl
left outer join languages l on dl.lang_id=l.lang_id
where dl.lang_id <> 1 and l.lang_id <> 1
order by dl.doc_id

Notes

Talend/ETL

MERGE: Document authors - SQL: Chris, Talend: Chris

Repeating: From person though there are 16 docauthor records in Sybase that are Corporate Names.

Field mapping

SQL

SELECT doc_id,
a.name_id authorShortId,
case when n.name_type=2 then 'organization'
    when n.name_type=1 then 'person' end as authorVocab,
n.fname + substring( ' ', sign( datalength( n.fname ) ), 1 ) +
    mname + substring( ' ', sign( datalength( n.mname ) ), 1 ) +
    lname + substring( ' ', sign( datalength( n.gen_suffix ) ), 1 ) +
    n.gen_suffix as authDisplayName
FROM docauthors a
left outer join names n on n.name_id=a.name_id
where a.name_id<>1
order by doc_id

Notes

Talend/ETL

Procedures

Media Handling - document pages - SQL: TBD, Talend: TBD

Field mapping

SQL

SELECT
   doc_id,
   convert( varchar(15), doc_id ) as shortIdentifier,
   pages,
   page,
   convert( varchar(5), page ) as pageNumber,
   convert( char(5), datepart(year, entered)) +
      substring( convert( varchar(50), entered, 7), 1,
                 charindex( ',', convert( varchar(50), entered, 7))-1) as displayDate,
   datepart( year, entered ) as earliestYear,
   datepart( month, entered ) as earliestMonth,
   datepart( day, entered ) as earliestDay,
   entered as earliestScalarDate,
   dateadd(day, 1, entered) as latestScalarDate,
   entered,
   modified,
   ISNULL(entered_by, '') as enteredBy,
   ISNULL(modified_by, ISNULL(entered_by, '')) as modifiedBy,
   src_id,
   convert( varchar(15), src_id) as publisherIdentifier

FROM docs d, pages p

WHERE
   d.pages >= p.page
   AND doc_id > 1

Notes

The query includes a join on a synthetic table that only contains a list of page numbers. By joining on the 'pages' table I get a row for each page up to the number of "pages' indicated in the docs table.  As usual, doc_id 1 is a dummy value.

Talend/ETL

Media loading

Start by getting the CSID and short title (documentID + page number) from the media_common table.

(code}
select m.title,
h.name as CSID,
regexp_matches( m.title, '^\d{1,5}.p\d+$') as mediatitle
from hierarchy h, media_common m
where h.id = m.id
and (m.blobcsid is null or length(m.blobcsid) = 0)

Save the output into a file (e.g. "media.list"). The first field is used to sort numerically and can then be removed. The sorting step is useful for uploading image files in smallish blocks with contiguous document IDs.

Nowrun the upload script with the "media.list"

BASEURL="https//cinefiles.cspace.berkeley.edu/cspace-services/media"
TYPE="Content-type: application/xml"
USER="admin@cinefiles.cspace.berkeley.edu:xxxxxxx"
IMGDIR="/my/image/file/location"
OUTFILE="curl.out"

while read F
do
CSID=${F%% *}
IDENT=${F##* }

# Select the best version of the file
FILE="${IDENT}.600col.tif"
[ -f $IMGDIR/$FILE ] || FILE="${IDENT}.450col.tif"
[ -f $IMGDIR/$FILE ] || FILE="${IDENT}.400col.tif"
[ -f $IMGDIR/$FILE ] || FILE="${IDENT}.300col.tif"
[ -f $IMGDIR/$FILE ] || FILE="${IDENT}.150col.tif"
[ -f $IMGDIR/$FILE ] || FILE="${IDENT}.600gray.tif"
[ -f $IMGDIR/$FILE ] || FILE="${IDENT}.450gray.tif"
[ -f $IMGDIR/$FILE ] || FILE="${IDENT}.400gray.tif"
[ -f $IMGDIR/$FILE ] || FILE="${IDENT}.300gray.tif"
[ -f $IMGDIR/$FILE ] || FILE="${IDENT}.400dpi.tif"

# Did we find one?
[ -f $IMGDIR/$FILE ] || continue

URL="${BASEURL}/$CSID?blobUri=file://$IMGDIR/$FILE"

LOGFILE="$OUTFILE.$IDENT"

curl -X POST -i -u "$USER" -H "$TYPE" $URL -o $LOGFILE
mv $IMGDIR/$FILE $IMGDIR/done
done < media.list

The media record title should reflect the original file name of the attached blob, but when blobs are
attached to an existing media record in a batch, the title does not get updated. After importing blobs, the media records can be updated in the database to set the title correctly.

BEGIN;
update media_common set title = mbt.filename
from utils.media_blob_titles mbt
where media_common.blobcsid = mbt.blobcsid
and media_common.title similar to '0123456789.p0123456789'
and mbt.filename like media_common.title||'.%';
COMMIT;

Note that it is a good idea to wrap the update in a transaction. I included both the BEGIN and COMMIT statements here, but in practice the results should be reviewed before issuing the COMMIT.

Relationships: Media Handling to Collection Objects

Lam is running the Talend job originally developed by Yuteh for PAHMA and UCJEPS