More Kettle tips and tricks

This note serves to document things I have learnt using Kettle (3.2). 

 


What Java? Kettle does not appear to function correctly with a 64-bit JVM.  It does happily run with Java 8.  If you do want to run Kettle (with a 32-bit JVM) and Tomcat (with a 64-bit JVM) on the same system, this is perfectly possible. My experience is on a windows system so I configured Tomcat using C:\Program Files\Apache Software Foundation\Tomcat 7.0\bin\Tomcat7w.exe to explicitly tell it where to get its JVM (in my case C:\Program Files\Java\jre1.8.0_45\bin\server\jvm.dll) and I set the path environment variable to have C:\Program Files (x86)\Java\jre1.8.0_45\bin at the beginning.
 


Installation  To expand on the recipe given elsewhere:

Download and unzip Kettle.  From version 1.4 we are compatible with version 3 of the PDI so you can download version 3.2 - BUT not yet (at version 1.6) compatible with version 4 of the PDI. 

Unzip the OpenVPMSLoader.zip file found in the import/plugin folder in the openvpms distribution.  This should be unzipped into plugins/steps folder in Kettle.
NOTE: Kettle 3.2 ships with version 2.0.6 of the Spring framework, whereas the OpenVPMS Loader plugin requires Spring 3.x.  Hence you need to:
1. backup your Kettle installation directory
2. remove libext/spring/spring-core.jar
3. copy the OpenVPMS jars to libext/spring: 
         cp plugins/steps/OpenVPMSLoader/*.jar libext/spring/
4. Start Kettle

Failure to do the above will result in a "No application context specified" error when the OpenVPMS Loader plugin starts.
 


Kettle is a really neat powerful tool.  You may find the following will save you some time.

Log size  The internal log is not of infinite size. If something goes wrong and you have an error on every row, the earlier records will be discarded.  Also after an overflow situation like this, Kettle seems to be unable to fully clear the log.  That is, you can clear it, and it appears to empty, but when you start a new transform/job the old stuff re-appears.  To fully clear the log after an overflow, stop and restart Kettle.

External log  You can specify the log file for a job or transform.  If you specify it for a job, then the settings apply to all the transforms in the job (unless you provide log settings for them).  Note that the records are batched out into the log - the internal log runs ahead of the external one, and in the event of a crash (most commonly a java heap overflow), you will not get all records in the external log.

Minimum logging = no errors  If you set the logging level to Minimum, then if your transform has errors, these will not show in the log.

Can detach and park  If you have a job that consists of a number of jobs and/or transforms done in sequence, you may wish to re-run the transform omitting some of the transforms/jobs.  The easiest way to do this is to delete the hop at each end of the string you don't to run, then move this out of the way, and rejoin things so as to omit the transform/jobs you don't want to run. See below.

Always turn off Lazy Conversion  Some of the input steps have a 'Lazy Conversion' checkbox.  Do not tick it - using this leads to peculiar error messages.

Trim memo fields  Some databases support a long text field called 'memo'.  If you are processing this data, and you want to log it via the Text File Output step, make sure that you trim back its lenght.  Left unaltered the field lenght is over a million characters and this will cause a heap overflow.

No registry start  It is possible to have Kettle crash leaving its repository containing a screwed up transform.  This may lead to Kettle crashing on start.  The fix is to tick the 'No Repository' box on the startup screen.  Then you can can connect to the repository once Kettle has started. 

Performance Herewith tips to get the maximum performance out of Kettle:

  1. Give MySQL lots of buffer space (I run innodb_buffer_pool_size=3G).
  2. If possible put the database on an SSD - perhaps not wise for production, but for the conversion phase, the faster the disk, the better.
  3. Where possible run transforms/jobs in parallel.  Since the basic pattern of work is 'compute in Kettle, write to MySQL', it turn out that running two jobs in parallel has little effect on the run-time of each. Three or four is possible given suitable hardware.  Running multiple jobs is just a matter of creating a job containing multiple sub-jobs - like the picture below. Right-click on the Start element and select "Launch next entries in parallel".  Note that the complete job does not finnish until all the sub-jobs have finished. You will need to think about what things can and cannot be run in parallel.  You can also split the load of one type of item (eg invoices) into multiple parts with each part loading a separate date range.
  4. Don't think that the OpenVPMS Loader 'Batch Size' parameter should be large. Experimentation shows that things go faster as it gets smaller: eg 500~112 minutes, 200~107 minutes, 50~96 minutes.
  5. If at all possible run kettle and MySQL on the same machine.  For production you will want MySQL (and Tomcat) running on a dedicated server, but for conversion its better to run all on one system.  [Currently I am running two invoice loads in parallel and this is resulting in over 3.3MB/sec of network I/O between Kettle and MySQL.]

The following points all apply to the 'Modified Java Script Value' transform step

Var's not required  You only need to use "var" to define a variable if you need it in the next step.  If its just an internal work variable, then don't use var.

Always turn off compatibility mode  Always disable the compatibility mode. Compatibility should only be set if you have transforms from the earlier versions of Kettle.

Replace may need escape  If you use the replace function, be aware that the second argument (string to look for) is in fact a regular expression, and thus that any regex meta characters will need escaping.  Futhermore, since in javascript \ needs escaping, you need to double the backslash. Hence, to replace periods (.) by spaces you need  replace(str,"\\."," ") and to replace backslashes themselves you need to quadruple them, ie replace(str,"\\\\","x") will replace \ by x.  However, since the 'look for' string is a regex expression, you can use replace() to replace multiple characters at the same time. 
Hence replace(str,"[\\-\\* ]","x") will replace every minus (-), asterisk (*), and space character by 'x'.

Regex is peculiar  If you want to use the str2RegExp function, be aware that the required format of the regex string is not obvious.  Firstly you MUST specify string start (^) and string end ($), and you MUST double every backslash. For example the string to look for any-chars9999 9999any-chars  OR any-chars99999999any-chars (ie phone numbers) is:  "^.*(\\d{4} \\d{4}).*|.*(\\d{8}).*$".

Be careful of plus  Javascript uses the + operator to both add numbers and concatenate strings. It also has no specific variable typing facility, and can treat things that you think are numbers as strings. The expression (v1 + v2) may concatenate the two variables.  If you need these to be treated as numbers either force the second to be converted to a number by using the Number() function [ie write the expression as (v1 + Number(v2))], or use the fact that the minus operator forces the variable to be treated as a number [ie write the expression as (-(-v1 - v2))].

Syndicate content