ETL Validation of Securities in Informatica, PDI (Kettle) and Talend

I have used ETL tools for a variety of purposes in the past including loading data into different systems, ranging from traditional data warehouses to operational systems.

One of the things I have noticed over the years is how the designers of some mappings fail to take into consideration sources of errors which can manifest themselves in different ways ranging from the failure of the routine through to data loss.
In many ways these things are inexcusable, if you are loading into a system which contains not null constraints on certain field then check your incoming row to see if it will violate these constraints. If you have taken this step and managed to avoid having a mapping fail though a violation then please don’t simply filter the row. Use a router transformation to send the offending details to a file / table so that someone can investigate the issue.
The same logic should apply to things such as lookups. If you are looking up data in a table and the record you are attempting to match does not exist and so returns a null many developers then include a filter to check for a null and effectively remove this row. It’s really not that much extra effort to add a router and you can then request a business person to investigate / rectify the problem in the source system.

In addition to adding error handling logic to things such as lookups, in many circumstances it is possible to validate the data. Certain accounting systems may for example have customer accounts which lie within a specified range (eg 100000 <= x <= 10000000). Where possible you should endeavor to find out about such things from the business users and then try to create reusable routines to validate the data which can be shared across your transformations.

I have created some routines which will hopefully benefit anyone working in the finance industry and dealing in fixed income or equity securities. Effectively these securities have identifiers which serve as the primary means of identification, the prominent standards are:

cusip – A North American Standard. This is 9 digits, the last digit of which is a check digit.
isin – This is the international standard and is 12 digits, the last of which is a check digit.
sedol – A European standard which is 7 characters long, the last of which is a check digit.

The check digit helps to identify invalid codes, which you may receive for any number of reasons including invalid keying by an end user.

Originally I created 3 routines in informatica to validate these types. The first batch used a reusable expression and a few user defined functions. I have to hand it to informatica that this was possible using entirely their DSL and a single transformation, all be it one which was calling out to a user defined function (UDF) again written using the informatica DSL (expression language).
Whilst these routines worked fine my curiosity got the better of me and I decided to see if it would be possible to create the same routines using the new java transformation which was introduced in version 8. Partly since I considered it might be possible to create a more efficient routine.
I wrote a piece of code in eclipse and then tried to port it to informatica. Oh dear the validator did not like the syntax – after cursing for a while I realised that it was necessary to convert the code to java 1.4 syntax so I had to make a few changed to remove some of the foreach loops and other java 1.5 features.
I ran a test load to see which performed best, the informatica expressions or the java transformation. My initial results seem to suggest that for a small volume the expression is quickest but for a larger volume the java expression is quicker.

The informatica expressions take as an input the security as a string and outputs the security, an integer indicating the error code and an error string.

Having created the informatica routines I decided I would also recreate them in the two main open source ETL tools, mainly Pentaho Data Integration and Talend.

For Talend I implemented the logic as functions or as Talend refer to them routines, which can be called from components such as tMap. Although this works satisfactorily, it is also not ideal, since you may need to make two calls to the routine. In itself this isn’t really an issue, but it does mean someone my try and call the getErrorString method before they called the relevant validate security method. The reason for this is that I don't know how it is possible to modify the schema in Talend to add extra ports.
Ideally this would have been implemented as a components which accepted a string (the security) and returned a boolean or int to indicate if the entry was valid and also a string which would contain an error message where applicable. Either way the functions work satisfactorily and had the added benefit of enabling everything to be implemented in java. When I wrote a previous talend component I found the JET templates to be a bit of a pain.

In contrast the PDI object most closely resembles the informatica transformations in that they add additional rows to the output. This is of course more intuitive.

I have show a screen shot showing the transformations in use in PDI. I would have shown a sample in Talend, however the tMap component is broken in OSX and has been since 2007, so I can't show a real world example whereby you route/map/switch the error to a different object.

Validating Securities In Pentaho Data Integration

Validating Securities In Pentaho Data Integration

Those of you who are interested in using these transformations can download them from the places listed below:

Link to download Informatica Routines
Link to download Talend Routines
PDI Routines
PDI / Kettle Routines (Until I figure out how to contribute these please email me for a copy)

GPU Utilisation and Palo

I’ve just come across an article on how Jedox will utilise the Graphics Processing Unit of computers to accelerate performance by upto 40 times.

Interesting stuff. However how many servers out there have high end graphics cards which are normally reserved for games machines and graphics units. Most servers I encounter don’t even have windowing software installed. You simply SSH onto them. Although this might be the case with *nix servers, microsoft have also announced server core a slimmed down MS Server version. This is a good thing, the lighter the core the less programs installed which could offer an attack vector to malicious sources. However this might also mean more servers shipping with simpler graphics cards.

In which case I ponder what would be the best kind of server for palo ? Well I think it could possibly be a mac. Perhaps not an xserve but something like a mac pro since they have high end graphics card by default. Regardless of machine it’s good to see how Jedox are focussing on harnessing the power of the GPU.

It’s interesting to note how Apple have included technology to simply the use of the GPU for general purpose programming in their soon to be announced release of OS X (Snow Leopard). One thing is for sure, if you want optimal performance from future versions of Palo you best stay clear of a low end machine with integrated graphics, such as your run of the mill Dell. Something tells me that there are going to some happy accountants out there soon, with a shiny new desktop computer sporting a blazingly fast graphics card.

Taking the Pain out of Excel Reporting with a useful Talend Extension

Creating Excel Reports just got a whole lot easier.

Creating Excel Reports is Easy

Creating Excel Reports is Easy

You can read more about the tool in the embedded PDF titled Creating Excel Reports Made Easy or download it from here.

The source code is located on the talendForge site. (http://talendforge.org/exchange/tos/?cid=6). If you search by Author Hugo in the java category you can download the addin.

Love it or loath it the spreadsheet is a ubiquitous business application which empowers the end user. Whilst data can be conveyed in many formats, most business users are happiest with data in Excel.

I think every IT professional and business user alike will agree with the above statement. While spreadsheets can be a complete nightmare and are frequently abused, for many purposes including reporting and analysis the business find them to be the ideal medium.

Having spent the majority of my IT career working in data related fields I have produced more than my fair share of ETL routines and reports. However until recently something was missing; creating Excel reports was a pain.

Whilst you could create a report in a tool such as Business Objects, Crystal Reports, JasperReports, JFreeReports or any other dedicated reporting tool then export the report to Excel, I have always found these tools to be inflexible when exporting to Excel. They lack the ability to have features such as split panes, drop down filters (eg autofilters), multiple worksheets and lack control over the layout of the data.

Therefore I have created an open source extension for Talend in java which I hope can ease this pain. Traditionally you could create Excel reports in java by using a library such as POI or JXL and writing plenty of code. Alternatively you could write the program in VB/A or .NET but all of these approaches require large amounts of code and can be inflexible, with many lines dedicated to trivial issues such as formatting and data placement. If you wanted to change the report layout or format you needed to change the code and then undergo regression testing to ensure that you hadn’t inadvertently broken something else.

By using this extension and a template based approach the whole process is greatly simplified, but certainly doesn’t lack in power or flexibility. Indeed if you discover there are features which you can’t accomplish with the tool and templates you can include a piece of VBA in the template which will run when the workbook is open to complete the tasks!

I believe that the tool is straightforward and it should be possible for a business user or someone without extensive IT training to quickly get up to speed and create great looking reports without the need for a single line of bespoke code. More adventurous users will appreciate the power offered by the Tags and ability to loop on collections and access underlying features such as renaming worksheets based on object field values. You can read more about the tool in the attached PDF titled Creating Excel Reports Made Easy, infact you should use that as your guide to get up and running with the tool, since it contains numerous useful examples.

Whilst I have endeavored to test the extension and make it as flexible as possible I am of course limited to my own imagination and tests. I would therefore be greatful if others would download the tool provide feedback that will help to improve the tool and benefit us all.

In a nutshell the report is creating by using Talend and mapping the incoming data to objects. The Image below shows a template where all data has been mapped to a single employee object – hence the tags such as ${employee.name}

See how easy it can be

See how easy it can be

By using the expressions $[b2 + c2] and $[sum(e2)] in the above report we actually retain one of the many advantages of excel, notably the ability to change fields and see the effects, since at run time these values will become formulas such as =sum(e2:e10). Therefore the business might choose to flex the created spreadsheet to see what the effect would be of changing some variables. In a more traditional report generated by a tool like Crystal these fields would be pre-calculated and we loose one of the advantages which excel offers.

It is possible to nest these objects as well eg you might have an incoming row which has joined two relational tables together eg Emp and Dept. In this situation you would map some fields to the Employee Object and some fields to the Department Object, in addition you would tell the tool on which fields it can join the Employee and Department objects.

By allowing for the nesting of objects it is possible to create nested / banded reports eg see below: – Note if the pictures looks funny please trust me this is Excel, however I use a mac – so it should be noted that these spreadsheets have been produced on a unix machine, although being java based the reports can also be created on windows and linux.

A Report with Grouping And Outlines

A Report with Grouping And Outlines

The actual Excel Template looks like the one below – note this is one of the more complex spreadsheets you could hope to create since it has a variety of features. The Tags use JEXL syntax and should therefore be familiar to anyone with a knowledge of JSP – and are simple to pick up by business users. If the Tags scare you then believe me, you would need to write a lot more code in VBA or a similar language to import the data from the source system and create a report such as this.

The Template use to create the Banded Report

The Template use to create the Banded Report

The main piece of the work is done by the Talend Extension – however fear not this is very simple to use. It comprises of very few fields.

Mapping the Elements

Mapping the Elements

Please don’t let the images showing tags scare you, they really are quite simple when you get the hand of them.

In addition it is possible to create charts, filter the data, add outlines to the data, group the data and use a whole host of formulas.

The source code is located on the talendForge site. (http://talendforge.org/exchange/tos/?cid=6). If you search by Author Hugo in the java category you can download the addin.

Enjoy

Hugo

Talend Extensions

On and Off I’ve been writing an extension to Talend for best part of a few months. I say on and off since, since so much has come in the way that I haven’t had a great deal of time to dedicate to this task.

Some of you may be wondering why I am writing the extension for Talend and not PDI. Well I guess it’s because when I first embarked on this project I decided that the program would need to dynamically create java classes. The program wouldn’t know in advance what the classes were or what they looked like. When I first embarked on this endevour the only way I could think about tackling this problem was to write a piece of code which would create a .java file, compile it and add it to a jar before invoking the java classloader and loading the class using reflection. I decided that since PDI is a meta data driven tool and I had effectively resorted to generating code, that this didn’t fit with the PDI model.

Ironically after I had written the code to achieve the above objective I discovered the dynabeans project within apache commons, which achieved everything I had been attempting to and was also a lot simpler. I wouldn’t have the same objection about using dynabeans with PDI since it isn’t generating code and compiling it.

Anyhow I would like to express a few opinions of Talend Extensions. By logging this information here if I decide to create any further extensions for Talend I should at least know which files to check to see why the code isn’t compiling etc..

1) The GUI elements – These are defined in an XML file which offers the ability to quickly add common fields such as text fields etc.. Whilst this is perhaps fine to get you going I also found it limiting. I can’t see how you can create dynamic GUI’s. How can you have fields which are loaded based on the data in a table or some other source? With PDI you are writing SWT so although it might be more lines of code to create the GUI anything would be possible. What I would really like to do is some sort of validaiton on the user input. If anyone can tell me how to do this then I would be greatful.

2) Code Generation – There are effectively 3 mains Java Emitter Templates. A begin file which is executed once in the pipeline, a main file which is exectued for every row in the pipeline and an end file, which is executed at the end of the pipeline.

The generation model can be a real pain. For a start you don’t have any of the advanced feature of an IDE such as code complete, however then most annoying feature is when you have an error in the source code. You can essentially have two different kinds of errors.

  • Startup Generation Errors – These are primarily due to syntax errors or similar in your JET scripts. I’ve been writing my code in a text editor, which doesn’t give me syntax highlighting or anything else which would help to spot such errors.
  • Compile time Errors – These occur when you have a problem in your injected code which prevents the generated code from compiling or even generating.

I have found the best way to address such issues is it open the begin, main and end files located in workspace/.JETEmitters/src/org/talend/designer/codegen/translators/{YOUR FOLDER HERE}

So if you xml file defined that your component would appear in the Misc section then you would find the files that telend uses in :

workspace/.JETEmitters/src/org/talend/designer/codegen/translators/misc

You can then look at the java stack trace, find the lines in the above files and then move back the your original JET templates and rectify the problem.

The Templates can be messy. I couldn’t find any way to added methods within the templates to reuse elements of the code. I have found that I have duplicated code in both the begin and main template since I couldn’t find any way to share this code amongst the template.

Due to the fact I couldn’t see how to use methods within the templates, some of the code is quite heavily nested and is therefore not so easy to read. In fact since I was writing the code on and off I would find it difficult to come back to the code and amend it.

My other concern is testing the code. I can’t see any way to effectively unit test the component. This of course is a concern. Could any bugs end up  being compared to “whack a mole”. I’m sure you have seen the game where you whack one mole and two more pop up. So how can you ensure that in fixing one bug you don’t create two more? Particularly if you can’t have automated test suites which could be used to regression test after any change ?

However on a more positive note I actually found that the generation model conferred some real advantages. It was possible to move the vast majority of the logic to the template and ensure that the generate code would be simple. Basically since you effectively writing two pieces of code you can ensure that the generated code doesn’t have to do a great deal or work and should therefore be fast and use less resources.

PALO ETL Client – Web Gui

Great news. The Palo ETL Client now has a GUI. Developed by Tensegrity it uses GWT and runs in your browser.

Things are looking up on the palo front. I don’t know if they have updated the old web client allowing for the creation of subsets in the browser but either way this is good news.

Check it out

Update – I’ve just started the demo and realised that this isn’t complete and is at the moment a demo. Still it will be cool to be able to ETL routines in a browser and them executed on a backend server.
In that respect it’s a big like snaplogic. Except snaplogic uses flash for it’s GUI and python as the ETL engine.

Hurray – The tool is complete

I have basically finished the OLAP prototyping tool. I need to upload the code to somewhere. I will probably do this when I’m back in the UK. So mostly likely at the weekend.

The tool allows you not only create the model but also add rules (which are done by attaching notes to the nodes and set weights. By default the weight of an element will be one. However by using the 1 icon it will be given a weight of minus one.

I have uploaded a few more presentations to screencast. Unfortunately the screen recording tool will only allow for short 5 minute recordings. The first recording is basically talking about BI and setting the scene. The last two recordings are creating cubes and slicing and dicing the data with JPalo.

mindmap1

mindmap2

mindmap3

JPalo OSX

I’ve noticed that this blog has been getting numerous hits pertaining to palo and mac related searches. Indeed many of these are related to searches such as “Jpalo Mac”

I can only assume that there are people out there eager to use JPalo on the mac. Well the good news is that this is possible and you have all the same functionality as anyone else. Unfortunatley Tensegrity Haven’t got a download on their website for an eclipse RCP. I can only assume this is because they don’t a mac in their office and that since SWT uses native GUI controls they couldn’t test a package and so weren’t willing to publish something which might not work.

However fear not it’s no big deal to get the eclipse addin working all youneed to do is visit eclipse.org and download a copy eclipse. This will give you a full blown eclipse development environment, but more importantly the application will contain the native SWT jars for the mac. Next goto the tensegriy site and download either a windows or linux version of the eclipse client. Within there you need to copy the folders within the plugin sections to the plugins folder of the version of eclipse you downloaded.

When you start eclipse you need to change the perspective to Palo Client or Palo admin and your good to go. Slicing and dicing cubes, adding users and groups and many other such tasks are now easily accomplished.

In fact I hope to produce a screencast in the next few days which shows a palo server running on the mac, a model creating using a mindmap and then an iterative agile development cycle using the JPalo RCP and the mindmap to rapidly revise the model.

If anyone has questions on eclipse and jpalo give me a shout.



Follow

Get every new post delivered to your Inbox.