Adding more breeds to lookup

Submitted by pyevet on Sat, 07/06/2014 - 17:08

Implementation discussion

G'day all,

I have been asked to add a handful of breeds into the test system and thought this would be a good oportunity to practice using kettle. Creating a csv with an id, CODE and Name and mapping these to the archetype loaded fine using the plugin. However no link to species was created. Further hunting lead to the lookupRelationship.speciesBreed archetype which needs the lookup_id of the breed as the target.

Looking at http://www.openvpms.org/forum/adding-patients-kettle#comment-4395 indicates that multiple sequential transforms will be needed. How can I fetch the breed lookup_id to map to the target as part of the transform?

Regards

Re: Adding more breeds to lookup

Submitted by tanderson on Sat, 07/06/2014 - 22:08.

You need an step which can provide each of the species and their corresponding breeds.

This could be a spreadsheet or CSV that has the species code in one column and the breed in the other.

This then feeds in to an OpenVPMS Loader step like:

Field Name	Map To	Value
species	<lookupRelationship.speciesBreed>source	<lookup.species>$value
breed	<lookupRelationship.speciesBreed>target	<lookup.breed>$value

This says:

1. Create a new lookupRelationship.speciesBreed

2. Assign the 'source' node the lookup.species reference determined by $value, which comes from the 'species' field

3. Assign the 'target' node of the lookup.breed reference determined by $value, which comes from the 'breed' field

This relies on both species and breed having previous being loaded via an OpenVPMS Loader step. If this is not the case, then you can refer to existing rows using:

Field Name	Map To	Value
species	<lookupRelationship.speciesBreed>source	<lookup.species>code=$value
breed	<lookupRelationship.speciesBreed>target	<lookup.breed>code=$value

This says:

1. Create a new lookupRelationship.speciesBreed

2. Assign the 'source' node the lookup.species with code = the species field.

3. Assign the 'target' node the lookup.breed with code = the breed field.

Or something like that. Hopefully the ETL gurus will correct me if I'm wrong.

Re: Adding more breeds to lookup

Submitted by tgething (not verified) on Sun, 08/06/2014 - 05:13.

Hi - I have the stuff that you need but I am currently travelling in Sicily, but will be home in Oz next Thursday.

What I did was to build a csv file of the data I needed to load and the used a kettle transform to generate the xml file to be loaded by the dataload script. I can give you the various files, but not until I get home next week.

Regards, Tim G

Re: Adding more breeds to lookup

Submitted by tanderson on Sun, 08/06/2014 - 11:47.

Any reason you did it that way? I would have thought a kettle-only approach would be easier.

Re: Adding more breeds to lookup

Submitted by tgething (not verified) on Sun, 08/06/2014 - 16:15.

Tim A - you are probably correct. This was one of the first things I did for the RxWorks conversion and I was trying to get as much as possible loaded via dataload. If I had to do it again, I would go the kettle route. The trick you need is to understand that you can refer back to something loaded earlier if you know its key in the etlog file. Sound like a good project to do and document.

Regards, Tim G

Re: Adding more breeds to lookup

Submitted by pyevet on Sun, 08/06/2014 - 14:43.

Thanks Tims,

I tried a one-step, but it needed the breed first. So next, a 2-step using the same source csv file with code, name, description and species code. The second step failed until I gave it a different id, then all worked well.

After lunch I'll make 2 csv's, each with just what each step needs, then make a job out of it. Alternately, maybe 1 source file with 2 id fields, pointing one transform to one and the next transform to the other, like "br001" then "spbr001".

Thanks again,

Simon

Simon Slater

Registered Linux User #463789 @ http://linuxcounter.net

Re: Adding more breeds to lookup

Submitted by pyevet on Sun, 08/06/2014 - 16:31.

The alternate worked nicely, vis 1 file with 2 distinct id fields, 2 transforms called by 1 job.

Can an id be reused in a different context? i.e. can a separate csv file for a lookup of colours with a separate transform have, say id=001 if there has already been a breed with id=001? Or do I need breed id=br001 and colour id=c001 for example?

Simon

Simon Slater

Registered Linux User #463789 @ http://linuxcounter.net

Re: Adding more breeds to lookup

Submitted by tanderson on Sun, 08/06/2014 - 23:13.

The "Id Field Name" field in the "Map values" dialog serves two purposes:

to uniquely identify input rows, when coupled with the Loader name. This is used to skip loading rows that have aleady been loaded
establish a link between the input row and the generated objects. E.g. the row with Id "CANINE" might generate a lookup.species object. This can be subsequently referenced using <lookup.species>CANINE.

So yes, an id can be re-used in a different context, except in the situation where two generated objects have the same reference. E.g. if both transform A and B generate objects with reference "<party.customerperson>JOEBLOGS" then you'll get an error when you try and use the reference:

Invalid reference: <party.customerperson>JOEBLOGS. Reference resolves to > 1 objects

Re: Adding more breeds to lookup

Submitted by pyevet on Tue, 10/06/2014 - 12:56.

To play with this, I made a transform to load up some colours from a csv file and deliberately re-used an id already used for a breed. The transform loaded all but that one record, but without error, so this would be covered by your first point. I thought that the second may have applied though, because the archetypes are different ( lookup.breed and lookup.colour). I guess the failure is because the entity referenced for each archetype is the same, that is, code, even though the value of the code is different.

Regards

Simon.

Simon Slater

Registered Linux User #463789 @ http://linuxcounter.net

Re: Adding more breeds to lookup

Submitted by tgething (not verified) on Tue, 10/06/2014 - 15:35.

Simon - if you look at the structure of the etl_log file (which is where the OpenVPMS Loader facility in Kettle remembers what has been done) you will see that one of the keys is the row_id plus the archetype. This enables the loader to have the option of skipping a row if it has been loaded previously, and it also allows you to reference something loaded in a prior step.

This also means that there are times when it is necessary to clear the etl_log file so that prior kettle runs are forgotten - however, this can also cause problems because it may lead to the creation of 'duplicate' data. ie, if I load up some breeds, then clear the log, and run the Kettle transform again the breeds will be duplicated - EVEN though you have 'skip duplicates' set in the OpenVPMS Loader step - because you have emptied the loader's memory of prior actions by clearing the log. [Remember than unlike a traditional database where you might have a breed file where species/breed must be unique, OpenVPMS uses and entity/relationship structure where everything has an id (which is unique) but there is no built-in limit to having two products or breeds or ... with the same name.

Regards, Tim G

Re: Adding more breeds to lookup

Submitted by pyevet on Tue, 10/06/2014 - 17:25.

Thanks Tim,

Given what you said I would have thought the skipped records should have loaded. But that is not important at this stage, I'll do more reading on E/R models and ETL. When preparing the data, I'll just prepend a string like CUST or PET etc to the unique id for that table. That should eliminate any trouble later if the numeric ids are the same accross tables.

Regards, Simon

Simon Slater

Registered Linux User #463789 @ http://linuxcounter.net

Re: Adding more breeds to lookup

Submitted by tgething (not verified) on Tue, 10/06/2014 - 22:31.

Simon: the etl loader record identification is totally separate from the ID number that gets generated when the entity is created. I have always used (for the loader id) a character string that is meaningful to me "PN" for patient number, followed by the RxWork patient number [I was converting an RxWorks database]. You do have to be a little cautious (because of the way java works) when you form up the string.

I will provide more info and examples after I get home (Thurs evening AEST).

Regards. Tim G

Re: Adding more breeds to lookup

Submitted by pyevet on Mon, 16/06/2014 - 11:10.

G'day Tim, how's the jet-lag going?

I think what I am missing is in the correct use of the Id Field Name in the loader plugin and what field to actually use there. I have had a play with adding data and all is fine, but when attempting to update the same data the transforms fail silently, even after a Tomcat re-start to clear the cache.

My data structure is heirarchical (cust -> pet -> tx etc), each table having a unique numerical key. It is this unique key with prepended text ( 001 -> id001) that I have used in the Id Field Name. However, whenever the same id value is found the record is skipped rather than updated, even with Skip Processed unchecked. I have things like multiple phone numbers, email addresses etc in separate tables, so these will need to be added after the customers to set the <contact.phoneNumber><lookup.contactPurpose>.

Looking at http://www.openvpms.org/forum/kettle-openvpms-loader-how-load-customer-n... I would have thought that the CUSTID be used in the Id Field Name rather than CUSTOMER_NOTE_ID. Could you explain this?

On this same example, CUSTOMER_NOTE_ID appears in 2 rows, 4 & 6. Does the <act.customerNote>$value in the value column of these rows refer to the CUSTOMER_NOTE_ID value or to the note itself in row 1?

Regards, Simon

Simon Slater

Registered Linux User #463789 @ http://linuxcounter.net

Re: Adding more breeds to lookup

Submitted by tgething (not verified) on Mon, 16/06/2014 - 15:31.

Simon:

a) jetlag - have not yet managed to re-arrrange my sleeping pattern, at 7pm I am knackered and go to bed, but wake up at 1am ready to start the day !!

b) my example for loading species and breeds is attached - after downloading, rename to lsb.zip

c) now to your question above, its probably easiest to work through the lsb (Load Species & Breeds) example.

The job runs 3 transforms, to load the species, then the breeds then to link them.

Species load is as follows:

Very simple - what is not obvious is the above (because I have not not expanded the Process step) is that oIID contains the same data as oCode. Hence in the loader step we are loading the species data and the etl log will remember that there are items of archetype lookup.species - below is a dump of etl_log. You can see log_id's 1-3 were generated by the 'species' loader step and you can see that the row_id's record the 'ID Field Name' contents. The id and linkId fields allow the species record to be found. [Note that the code field of any lookup is by convention a) uppercase; b) has spaces replaced by underlines - so if you use Administration|Lookups to create the species named say 'Dairy Cow' then the system will generate the code as DAIRY_COW - the spreadsheet SpeciesBreeds.xlsx does this for the me.]

Now we load the breeds as follows:

As you can see this is the same sort of thing as the species load. The subtle difference is that the oIID record id (set from the Code field) is not simply the upper cased breed, but this prefixed by the species, eg FROG_GREEN_SPOTTED. This is done to handle the case where there are two breeds of the same name but different species, eg a Green Frog and a Green Newt.

Now we need to link the breeds to their species. This is done as follows:

Here you can see that to set the link to the breed we set the target field to the value "<lookup.breed>$value" - ie we are telling the loader to go and look in etl_log for a record whose archetype is lookup.breed and whose row_id is what is in the Code field - when it finds this record it uses the id and linkId information to fill in the target field. Similarly for the source field.

It is obvious that the above generates the appropriate data in the lookupRelationship record. What is not obvious is how the other end of the link is set, ie the source field in the species record [actually I should say 'node called source'], and the target field in the breed record. We certainly did not set these fields (sorry 'nodes') when we loaded the species and breed data.

The answer is that the OpenVPMS loader automatically sets these for us - ie it says to itself 'have set the source node in the relationship record - what does it point at? - ahah the species - so I will also go and set the source node in the species'.

I hope this clarifies things. Note that the row_id, ie the contents of the Id Field Name field can be anthing. In this case it made sence to use the contents of the code fields since I knew these were unique. In the case of loading patient and customer data from a system being converted, my practice has been to use the patient and customer numbers from the system being converted and prefix the patient numbers with 'PN' and the customer numbers with 'CN'. All you need is something that can be used to identify previously loaded data.

Regards, Tim G

PS I am quite happy to give you all the kettle stuff that I built for the conversion from RxWorks - if you want to (to dig through and see how everything was loaded) email me at tim dot gething at bigpond dot com

Attachment	Size
lsb-zip.txt	20.04 KB

Adding more breeds to lookup

Comment viewing options