ETL Validation of Securities in Informatica, PDI (Kettle) and Talend

I have used ETL tools for a variety of purposes in the past including loading data into different systems, ranging from traditional data warehouses to operational systems.

One of the things I have noticed over the years is how the designers of some mappings fail to take into consideration sources of errors which can manifest themselves in different ways ranging from the failure of the routine through to data loss.
In many ways these things are inexcusable, if you are loading into a system which contains not null constraints on certain field then check your incoming row to see if it will violate these constraints. If you have taken this step and managed to avoid having a mapping fail though a violation then please don’t simply filter the row. Use a router transformation to send the offending details to a file / table so that someone can investigate the issue.
The same logic should apply to things such as lookups. If you are looking up data in a table and the record you are attempting to match does not exist and so returns a null many developers then include a filter to check for a null and effectively remove this row. It’s really not that much extra effort to add a router and you can then request a business person to investigate / rectify the problem in the source system.

In addition to adding error handling logic to things such as lookups, in many circumstances it is possible to validate the data. Certain accounting systems may for example have customer accounts which lie within a specified range (eg 100000 <= x <= 10000000). Where possible you should endeavor to find out about such things from the business users and then try to create reusable routines to validate the data which can be shared across your transformations.

I have created some routines which will hopefully benefit anyone working in the finance industry and dealing in fixed income or equity securities. Effectively these securities have identifiers which serve as the primary means of identification, the prominent standards are:

cusip – A North American Standard. This is 9 digits, the last digit of which is a check digit.
isin – This is the international standard and is 12 digits, the last of which is a check digit.
sedol – A European standard which is 7 characters long, the last of which is a check digit.

The check digit helps to identify invalid codes, which you may receive for any number of reasons including invalid keying by an end user.

Originally I created 3 routines in informatica to validate these types. The first batch used a reusable expression and a few user defined functions. I have to hand it to informatica that this was possible using entirely their DSL and a single transformation, all be it one which was calling out to a user defined function (UDF) again written using the informatica DSL (expression language).
Whilst these routines worked fine my curiosity got the better of me and I decided to see if it would be possible to create the same routines using the new java transformation which was introduced in version 8. Partly since I considered it might be possible to create a more efficient routine.
I wrote a piece of code in eclipse and then tried to port it to informatica. Oh dear the validator did not like the syntax – after cursing for a while I realised that it was necessary to convert the code to java 1.4 syntax so I had to make a few changed to remove some of the foreach loops and other java 1.5 features.
I ran a test load to see which performed best, the informatica expressions or the java transformation. My initial results seem to suggest that for a small volume the expression is quickest but for a larger volume the java expression is quicker.

The informatica expressions take as an input the security as a string and outputs the security, an integer indicating the error code and an error string.

Having created the informatica routines I decided I would also recreate them in the two main open source ETL tools, mainly Pentaho Data Integration and Talend.

For Talend I implemented the logic as functions or as Talend refer to them routines, which can be called from components such as tMap. Although this works satisfactorily, it is also not ideal, since you may need to make two calls to the routine. In itself this isn’t really an issue, but it does mean someone my try and call the getErrorString method before they called the relevant validate security method. The reason for this is that I don't know how it is possible to modify the schema in Talend to add extra ports.
Ideally this would have been implemented as a components which accepted a string (the security) and returned a boolean or int to indicate if the entry was valid and also a string which would contain an error message where applicable. Either way the functions work satisfactorily and had the added benefit of enabling everything to be implemented in java. When I wrote a previous talend component I found the JET templates to be a bit of a pain.

In contrast the PDI object most closely resembles the informatica transformations in that they add additional rows to the output. This is of course more intuitive.

I have show a screen shot showing the transformations in use in PDI. I would have shown a sample in Talend, however the tMap component is broken in OSX and has been since 2007, so I can't show a real world example whereby you route/map/switch the error to a different object.

Validating Securities In Pentaho Data Integration

Validating Securities In Pentaho Data Integration

Those of you who are interested in using these transformations can download them from the places listed below:

Link to download Informatica Routines
Link to download Talend Routines
PDI Routines
PDI / Kettle Routines (Until I figure out how to contribute these please email me for a copy)

About these ads

4 Responses to “ETL Validation of Securities in Informatica, PDI (Kettle) and Talend”


  1. 1 Sean May 30, 2009 at 5:05 am

    Nice. Thanks for this post. I am going to try the Informatica routines. Though I do not have a need for cusip check routine, it will be interesting to see how you have implemented these.

    It is a coincidence that just the other day someone demoed a profiling tool. This person worked at Nomura and was showing how they could find incorrect cusips (just because they were shorter length, not because of the check digit). I am in CPG industry and rarely see cusips these days. And not suddenly I come across them twice in 2 days.

    Thx

  2. 2 hugoworld May 30, 2009 at 7:17 am

    Hi Sean

    Regarding cusips, you can actually convert the cusip to an ISIN by simply tacking on the two digit ISO country code to the front and then calculating the checkdigit and appending this to the end. This might explain why you don’t come across cusips much.

    Glad that the validitons will be of assistance to others. The routines actually perform several checks:

    1) Is the length correct
    2) Do they contain unexpected characters
    3) Does the calculated check digit match the original check digit.

    In terms of implementation the pure informatica ones use an expression and UDFs (a rarely used feature of mappings if you ask me). This is since there is commonality between the checks. The other thing is that it is often necessary to split the identifier into odd and even characters based on their position. In a programming language you could easily do this with a for loop, in informatica this is not as simply. So I created a UDF which takes the value and it’s position in the string. By getting the modulus of the position it is then possible to determine if the value lies in an odd or even position in the string. Given this fact the expression can then double the value if necessary.
    The other UDF sums each individual digit (eg 12 + 6 would be 1 + 2 + 6 = 9 not 18).

    Hopefully some of the custodians / back office outsourced security services firms can adopt these routines. I know for sure that some of them frequently supply incorrect data and supply information where the indentifiers are wrong.

    I guess the most common errors are probably due to characters being inverted eg someone typing 12 instead of 21.

  3. 3 Sean June 1, 2009 at 12:51 am

    This can be extended/changed to work with other sorts of validations. I have been using INFA for a while but do not recall UDF :-(.

    Also your comments about writing to error file are right on the money. I’ve mentioned this to my other developers many times that they should not lose any output and all should be accounted for. So if a value is null or a row is in error, write it out to an error file so we can at least reconcile after the fact.

    I’ll post if I’ve any questions after I try running this in INFA.

    Regards,
    Sean

    • 4 Shobhit November 2, 2009 at 11:30 pm

      Hi Sean,
      Hope all is well at your side. I got a chance to see this article written by you. I am working as a Sr. tchnical resource Specialist and got an urgent job from the client side for the position of QA tester with INFA exp. I tried a lot to check what exactly does it mean but didn’t get any appropriate answer w.r.t it.
      I hope you can put a light on this…

      Thanks & Regards
      Shobhit vasudeva


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s





Follow

Get every new post delivered to your Inbox.

%d bloggers like this: