Talend-ETL work for UCB deployments

This document pulls together information about ETL work for CollectionSpace deployments. We are taking a team-based approach to get more of us up to speed.

Topics

  • Installing Talend Open Studio (where to get software, version information, platform installation issues)
  • Starting up the first time
  • Getting around TOS
  • Connecting to a database
  • Retrieving database schema
  • Common workflows and best practices
  • Sample jobs (protect pages that attach jobs that have db passwords stored!)
  • TOS resources
  • Problems
    • Crash when retrieving db schema to your repository
    • Migrating repeating data to nested elements in XML
  • FAQ
  • Importing and exporting projects
  • When to propagate changes to connections and schema

Installing Talend Open Studio

  • Go to http://talendforge.org (or http://talend.com), and look for Talend Open Studio.  On August 8, 2011, there was a "Download Now" bubble with "Data Integration" as  one of the options.  You do need to create an account and register in order to download the software, and you might receive a phone call from a Talend company representative.  Alternatively, Chris has the zip archive of the software.  Note: Do not use the OEM version available from Jasper.
  • Version: Currently at UCB we are using version 4.2.1 though 4.2.2 is now available. 
  • Platforms: We have TOS users on Mac OSX, WinXP, and Linux.

Starting up the first time

  • Note: This is from memory, so don't believe a word of it!
  • The first time you launch TOS, you need to create a workspace or repository for your projects. Create a local repository. 
  • Create a local (Java) project using the Action dropdown or open an existing project in the Project dropdown or import a project that someone has sent you using the Import Existing Project(s) as Local command (in the Action dropdown).

Connecting to a database

A picture is worth a thousand words.

Use read-only accounts in the database

Common workflows, best practices, and tips

  • UTF-8 encoding: Make sure all your input files and outputs have UTF-8 encoding.  This is often in the Advanced Settings section of the input and output components.
  • Using metadata other projects: (Note that there might be a better way to do this!) You can export metadata objects (database connections, code like Guid and MakeDisplayName, and XML file schema) from one project.  On the Metadata folder, select Export from the context menu.  Select the metadata objects you want to export. Select the root directory where you want to put the exported objects (this  might be the parent directory for the project you want to import the objects).  Switch to the other project.  Click on one of the items in the metadata group (e.g, database connections, but it doesn't seem to matter). Select Import Items from the context menu.  Select the same root directory you specified before (hint: do not try to navigate down into the subdirectories that the export step creates).  Select the objects you want to import.
  • Loop element in XML map: Use the "import" element as the loop element in the tAdvancedFileOutputXML component.
  • User defined functions and code: To be available in a job, these must be added as Routine Dependencies to the job. With the job not open, click on the job, and select Setup Routine Dependencies from the context menu.  Click the Add button from the User Routines tab, and your code scripts should be available.
  • Use complete XML schema: Make sure you are using a complete XML schema as the target mapping on the Advanced XML output component. For example, if you use a schema from XML created by Nuxeo, you need to use a record that has data in repeating fields and field groups.  Otherwise the nested elements will be absent.  (See the corresponding problem below.)

Creating an XML schema to seed the XML tree in the tAdvancedFileOutputXML component

  • Create a record in the CollectionSpace instance.  Make sure to put data in all fields, especially repeating fields and repeating field groups.
  • Note: This creates a very large XML file, especially for cataloging. We might want to experiment with an XML file that does not have fields that are not represented in our data set.
  • Use the Nuxeo admin interface to navigate to the record type, find the appropriate record (might be the most recently updated record), and export it.  Do not worry that data displayed in the Nuxeo admin screen appears incomplete.  You need to export the data to a file to get the real record.
  • Edit the resulting file into the correct import shape.  (Attach an example here.)  See the Imports Service Home document for more information about the shape of the file.
<?xml version="1.0" encoding="UTF-8"?>
<imports>
  <import service="Persons" type="Person" CSID="4fbab950-8e36-4977-a46f-6491f799aa03">
    <schema xmlns:persons_common="http://collectionspace.org/services/person" name="persons_common">
      <persons_common:foreName>Your exported data from Nuxeo starts here</persons_common:foreName>
      ... more XML data elements ...
    </schema>
  </import>
</imports>
  • Do not worry too much about the values in the service, type, CSID, and name attributes in the "import" and "schema" elements or the namespace value in "schema". Talend will ignore those in import, so you will have to provide them within Talend. 

Character and data issues

Note: It seems to be very hard to represent the actual text for the XML versions of ampersand and other characters that need to be escaped in XML. 

  • There is a known bug in the import service where a set of characters that require different representation in XML (e.g, ampersand to "ampersand amp semicolon") need to be doubled (to "ampersand amp semicolon amp semicolon).  (document behavior)  This is a big pain!  Make sure you have fixed all ampersands and press for the bug in the Imports Service to be fixed (CSPACE-3911).
  • However there is a character combination found in PAHMA data that is "ampersand pound x0A semicolon" and which appears to be a single newline or return character in text, and not an ampersand at all. (document behavior).  Talend appears to be converting the ampersand in that string to the XML-escaped version (ampersand a m p semicolon).  However, I think when I imported the record, it was correctly treated as a return.  Need to check this though because I did delete the record. Confirmed, though Talend converts this to "ampersand amp semicolon pound x0a semicolon" it comes out as a newline at least in the CSpace UI.  Haven't checked to see what is stored in the database.
  • Talend records from PAHMA that include single-quote and double-quote (unescaped versions that is) end up looking OK upon import.
  • Talend messed up "Nuñez del Prado", replacing the "ñ" with "??". I now see an Advanced Setting on the input file to set the encoding.  For some reason, it was Custom - "US ASCII" instead of UTF8.  Will test.

Problems

TOS crashes when retrieving large schema

See Glen's comment below about success with the latest release.

Chris: Encounters this often when trying to import Sybase schema into TOS metadata.  Restarted machine and launched TOS only.  Created a new project and new database connection to smaschprod.  Tried to import all tables in smaschprod.  TOS crashed and pointed to /Applications/TOS-All-r60995-V4.2.1/workspace/.metadata/.log.  File includes the following:

!ENTRY org.talend.platform.logging 2 0 2011-08-08 13:20:39.754
!MESSAGE 2011-08-08 13:20:39,726 WARN  org.talend.core.model.metadata.DBConnectionFillerImpl  - S0022: Invalid column name 'TABLE_CATALOG'.

I can import schema on individual files.  So, I tried importing only a large set of table schema (A-G tables).  That did work successfully.  However, when I tried to add more table schema, TOS hung after I selected "Retrieve Schema" by right-clicking on the database connection.  Mac logging reports available upon request!

TOS crashes

Lam continues to have problems on WinXP with frequent but intermittent crashes.  E.g., just opening projects.

XML schema as metadata in TOS repository

Chris can create XML schema in the TOS repository, but he can not yet use those in a TOS job.  Instead, for each job, you can import the XML tree in the advanced XML output component.

XML schema coming from Nuxeo might not have all elements

Best practice might be to create a dummy record with values in all fields, including repeatable fields and field groups.  For instance, when exporting a minimal record from a test PAHMA instance, the organization schema lacked some of the nested structures. For example, output from Nuxeo might include:

<organizations_common:groups/>

instead of:

<organizations_common:groups>
     <organizations_common:group>Japanese</organizations_common:group>
</organizations_common:groups>

TOS tAdvancedFileOutputXML component escapes ampersands automatically

The tAdvancedFileOutputXML component wants to convert ampersands to the escaped XML version (ampersand amp semicolon).  This is good except that because of a bug in the Imports Service, ampersand needs to be represented differently (ampersand amp semicolon amp semicolon).  Treat accordingly.

TOS can not loop CSpace XML schema?

When trying to import denormalized data (using the loop element and group element on the tAdvancedFileOutputXML component), I (Chris) am not finding a way to get it to work.  Inevitably the "import" element is only printed out one time.  In other words, if I am expecting to create records, I would want each one to have its own "import" wrapper within the overall "imports" wrapper.  Within each "import" wrapper, I would expect one "schema" wrapper, in this simple case because I am only using a simple non-extended schema.  Instead, I have one "import" wrapper enclosing ten otherwise nicely formed "schema" wrappers with the correct format.  The looping to create repeating or nested entities such as organization names or functions seems to work correctly, but I am only getting one "import" wrapper.  I'm not sure what is going on and have tried many permutations. Tested in TOS 5; behaves identically.

When import XML tree with more than one "schema" wrapper, only one "schema" wrapper kept

If you have an XML file with more than one "schema" wrapper (e.g., including local or domain extensions), on importing into the tAdvancedFileOutputXML component, only one "schema" element is retained.  It looks like the fields from the other schema are kept though they are all combined under one "schema" wrapper.  Apparently, any elements with non-unique names are dropped.  Yuteh reports that the same thing happens with structured date sub-elements.  If you have multiple structured dates in your schema, only the first set of sub-elements are recognized. Tested in TOS 5; behaves identically.  Finding: The problem is that the core element name is not unique at the same level of the XML tree.  You can rename those (e.g., schema1, schema2, and so on) and then change the name back after the file has been created.  It looks like having multiple structured dates is OK though non-unique names are repeated in the schema, perhaps because the non-unique names do not occur at the same level of the XML tree within the same node.  Yuteh had reported some problems with actual data mapping however.

Empty records created mysteriously

September 16: While loading 9 PAHMA objects, somehow records were created with no data even though only minor changes had been done in the Talend job.  Yuteh and Chris were trying to handle ampersands, and checking whether the slightly dirty version of hex return-newline would work (where we had amp amp sharp x0A).  Nine object records were created with CSIDs only.  Still checking this.

Resources