I have used ETL tools for a variety of purposes in the past including loading data into different systems, ranging from traditional data warehouses to operational systems.
One of the things I have noticed over the years is how the designers of some mappings fail to take into consideration sources of errors which can manifest themselves in different ways ranging from the failure of the routine through to data loss.
In many ways these things are inexcusable, if you are loading into a system which contains not null constraints on certain field then check your incoming row to see if it will violate these constraints. If you have taken this step and managed to avoid having a mapping fail though a violation then please don’t simply filter the row. Use a router transformation to send the offending details to a file / table so that someone can investigate the issue.
The same logic should apply to things such as lookups. If you are looking up data in a table and the record you are attempting to match does not exist and so returns a null many developers then include a filter to check for a null and effectively remove this row. It’s really not that much extra effort to add a router and you can then request a business person to investigate / rectify the problem in the source system.
In addition to adding error handling logic to things such as lookups, in many circumstances it is possible to validate the data. Certain accounting systems may for example have customer accounts which lie within a specified range (eg 100000 <= x <= 10000000). Where possible you should endeavor to find out about such things from the business users and then try to create reusable routines to validate the data which can be shared across your transformations.
I have created some routines which will hopefully benefit anyone working in the finance industry and dealing in fixed income or equity securities. Effectively these securities have identifiers which serve as the primary means of identification, the prominent standards are:
cusip – A North American Standard. This is 9 digits, the last digit of which is a check digit.
isin – This is the international standard and is 12 digits, the last of which is a check digit.
sedol – A European standard which is 7 characters long, the last of which is a check digit.
The check digit helps to identify invalid codes, which you may receive for any number of reasons including invalid keying by an end user.
Originally I created 3 routines in informatica to validate these types. The first batch used a reusable expression and a few user defined functions. I have to hand it to informatica that this was possible using entirely their DSL and a single transformation, all be it one which was calling out to a user defined function (UDF) again written using the informatica DSL (expression language).
Whilst these routines worked fine my curiosity got the better of me and I decided to see if it would be possible to create the same routines using the new java transformation which was introduced in version 8. Partly since I considered it might be possible to create a more efficient routine.
I wrote a piece of code in eclipse and then tried to port it to informatica. Oh dear the validator did not like the syntax – after cursing for a while I realised that it was necessary to convert the code to java 1.4 syntax so I had to make a few changed to remove some of the foreach loops and other java 1.5 features.
I ran a test load to see which performed best, the informatica expressions or the java transformation. My initial results seem to suggest that for a small volume the expression is quickest but for a larger volume the java expression is quicker.
The informatica expressions take as an input the security as a string and outputs the security, an integer indicating the error code and an error string.
Having created the informatica routines I decided I would also recreate them in the two main open source ETL tools, mainly Pentaho Data Integration and Talend.
For Talend I implemented the logic as functions or as Talend refer to them routines, which can be called from components such as tMap. Although this works satisfactorily, it is also not ideal, since you may need to make two calls to the routine. In itself this isn’t really an issue, but it does mean someone my try and call the getErrorString method before they called the relevant validate security method. The reason for this is that I don't know how it is possible to modify the schema in Talend to add extra ports.
Ideally this would have been implemented as a components which accepted a string (the security) and returned a boolean or int to indicate if the entry was valid and also a string which would contain an error message where applicable. Either way the functions work satisfactorily and had the added benefit of enabling everything to be implemented in java. When I wrote a previous talend component I found the JET templates to be a bit of a pain.
In contrast the PDI object most closely resembles the informatica transformations in that they add additional rows to the output. This is of course more intuitive.
I have show a screen shot showing the transformations in use in PDI. I would have shown a sample in Talend, however the tMap component is broken in OSX and has been since 2007, so I can't show a real world example whereby you route/map/switch the error to a different object.
Those of you who are interested in using these transformations can download them from the places listed below: