Data Migration using Talend Open Studio - DRAFT

Overview

This page is a place-holder for collecting best practices and sharing examples related to data migration using Talend Open Studio.

It is still in DRAFT state, which means that the information contained here does not necessarily represent a recommended approach for data migration.

Creating and managing jobs

This page will not (unless it's requested) provide a step-by-step approach about how to create a job, insert Talend components and create connections between them (hint: right click on a module and select Row). Talend has many resources to help get started, including the User Guide , the Reference Guide and the Talend community.

Note: Because it's basically a Java code generator, there's as many ways to migrate data in Talend as there are ways to write the code by hand to achieve the same. Shown below is just one of these approaches.

Example Jobs

Approach A : Using CollectionSpace JAXB libraries (CS v1.3)

The main benefit of this approach is that the xml formatting is handled by the existing libraries, which saves manually duplicating and maintaining all the tag names. The main drawback is that some Java coding is required to map the fields.

1. Job Overview

The following job contains an simple example of creating two CollectionSpace Person records in the appropriate XML format. The job (which is attached here) is shown below.

2. Input from the existing system

The 'existing system' in this case is simplified to two csv input files containing the following data....

tFileInputDelimited_1 contains a reference to file containing records consisting of (person_id, surname)

1,Christensen
2,Hammershøi

tFileInputDelimited_2 contains a reference to file containing records consisting of (person_id, occupation)

1,painter
1,teacher
1,poet
2,painter

Note: tFileInputDelimited inputs are used just to create a simple example. In practice it's more likely another input component (ie. connectors to other systems) would be used which would interface directly to a database.

The file schemas are created by hand in the component properties. The schema for tFileInputDelimited_1 (id, surname) is shown below. The schema for tFileInputDelimited_2 contians 'occupation' instead of 'surname'.

3. Grouping for repeatable fields

As there is a 1:N relationship between person and occupations, it is represented by a repeatable field in CollectionSpace. In the CollectionSpace JAXB libraries this is represented by a List. The next step will aggregate a persons occupations into a list, grouped by the persons id.

The data schema for tAggregateRow_2 is setup as follows. A new occupation_list output is manually created with type 'List'.

The tAggregateRow_2 module should be set up as shown below. Note the "Group by" and "Operations" settings.

4. Mapping the data migration

The tMap component is used to visually map fields from input components to outputs. Below is mapping from the two input rows to the (manually created) 'persons' output row. A join has been made on the persons 'id' of the inputs to enable mapping to a single output row (ie. record). Note that this new output row will be automatically represented in Talend as a new 'personsStruct' type with the column names mapped to variables of the same names in this class (this is relevant to the code routine, shown later).

5. Convert to CollectionSpace format

This approach uses the CollectionSpace JAXB libraries to create valid XML from Persons information. The record variables are copied to the CollectionSpace JAXB object variables from which are generated the correct XML.

Create new Talend routine

A new Talend 'routine' (ie. Java class) is created and the following external Java libraries added (right click on the routine and select "Edit routine libraries"). The libraries can be downloaded from these locations:

commmons-lang-2.5.jar
org.collectionspace.services.person.jaxb-1.1-SNAPSHOT.jar (copied from Services code in release 1.3, path "/services/JaxRsServiceProvider/target/cspace-services/WEB-INF/lib/")
jaxb2-basics-runtime-0.5.2.jar

Below is the code contained in the routine (named 'personTest'). This class contains the 'get'PersonXml' method referenced in the call from tJavaRow component, above. Note: The type named "personsStruct" only exists if the output in the mapping is named "persons". The import path for that structure will depend on the workspace/job names. As can be seen below, adding additional single fields is trivial, where as adding lists requires finding the type of that list and populating it from the Talend list.

package routines;

import java.io.StringWriter;
import java.util.List;

import javax.xml.bind.JAXBContext;
import javax.xml.bind.JAXBException;
import javax.xml.bind.Marshaller;

import org.collectionspace.services.person.OccupationList;
import org.collectionspace.services.person.PersonsCommon;

import globus_3.persons_0_1.Persons.personsStruct;

/*
 * user specification: the function's comment should contain keys as follows: 1. write about the function's comment.but
 * it must be before the "{talendTypes}" key.
 *
 * 2. {talendTypes} 's value must be talend Type, it is required . its value should be one of: String, char | Character,
 * long | Long, int | Integer, boolean | Boolean, byte | Byte, Date, double | Double, float | Float, Object, short |
 * Short
 *
 * 3. {Category} define a category for the Function. it is required. its value is user-defined .
 *
 * 4. {param} 's format is: {param} <type>[(<default value or closed list values>)] <name>[ : <comment>]
 *
 * <type> 's value should be one of: string, int, list, double, object, boolean, long, char, date. <name>'s value is the
 * Function's parameter name. the {param} is optional. so if you the Function without the parameters. the {param} don't
 * added. you can have many parameters for the Function.
 *
 * 5. {example} gives a example for the Function. it is optional.
 */
public class personsTest {

    /**
     * getPersonXml: get Person Xml
     *
     *
     * {talendTypes} personsStruct
     *
     * {Category} User Defined
     *
     *
     */
    public static String getPersonXml(personsStruct myPerson) {

        PersonsCommon person = new PersonsCommon();

        //populate a repeatable field (occupations)
        if (myPerson.occupationList != null) {
            OccupationList occList = new OccupationList();
            List<String> occupations = occList.getOccupation();
            occupations.addAll(myPerson.occupationList);
            person.setOccupations(occList);
        }

        //populate a single field (surname)
        person.setSurName(myPerson.surName);

        //generate xml from person object
        StringWriter sw = new StringWriter();
        try {
            JAXBContext context = JAXBContext.newInstance(person.getClass());
            Marshaller marshaller = context.createMarshaller();
            marshaller.marshal(person, sw);
        } catch (JAXBException e) {
            e.printStackTrace();
        }

        //return xml
        return sw.toString();
    }
}

Import the routine

The routine needs adding to the Routine Dependencies of the job (close the job and right click on it to edit these).

Call the routine

To call the conversion routine a tJavaRow component is used. This component passes the row data to a routine, receives the xml and forwards it to the next component. The schema for TJavaRow is modified to add just one output row as shown here :

The "code" part of the component only contains the following line:

        //send input row to routine and retrieve CS xml
        output_row.xml = routines.personsTest.getPersonXml(input_row);

6. Import to CollectionSpace

For simplicity this job contains no actual import to CollectionSpace. Possibilities for import include:

importing and calling the Services Client APIs from within a Talend Routine
generating Nuxeo formatted xml for import with Nuxeo Shell
generating appropriately formatted files for import via CollectionSpaces "import" functionality (not yet implemented)

However, in this job we just create a single file for each Person record in the appropriate CollectionSpace XML format. This uses the tFileOutputDelimeted component. To configure this component provide a path and file name, set both delimiters to "", and under the "Advanced" settings, select "split into several files", set "rows in each output file" to 1 and "encoding" to UTF-8.

Below is the output of the job.

Output file 1:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
  <ns2:persons_common xmlns:ns2="http://collectionspace.org/services/person">
    <displayNameComputed>false</displayNameComputed>
    <shortDisplayNameComputed>false</shortDisplayNameComputed>
    <surName>Christensen</surName>
    <occupations>
      <occupation>painter</occupation>
      <occupation>teacher</occupation>
      <occupation>poet</occupation>
    </occupations>
  </ns2:persons_common>

Ouput file 2:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
  <ns2:persons_common xmlns:ns2="http://collectionspace.org/services/person">
    <displayNameComputed>false</displayNameComputed>
    <shortDisplayNameComputed>false</shortDisplayNameComputed>
    <surName>Hammershøi</surName>
    <occupations>
      <occupation>painter</occupation>
    </occupations>
  </ns2:persons_common>