PDA

View Full Version : File join, stream lookup



kettle_anonymous
02-02-2006, 08:16 AM
Hi.
What is the best way to do a 1 to 0,n join from 2 files ?
For example:
File1: FamilyID, FirstName
FamilyID is unique and is the 'master' file. I must find in the result file all the FirstName


File2: FamilyID, LastName, BirthDate
Here, I have no key, I can have multiple rows in the file2 with the same FamilyId
Not all FamilyID of the File1 are present, nethertheless.



I need a result file like :File3 : FirstName, LastName, BirthDate
LastName, BirthDate can be 'null' is they don't existe in File2



When I use the stream lookup because of its internal structure (hastable), the result file is wrong because I don't have unique Key in File 2.



Is there a solution for my problem ?



Thanks in advance.
ppa

MattCasters
02-02-2006, 08:25 AM
As much as I would like to minimize I/O, you probably need to store the files in temporary tables and do a database lookup or database join with that. Yesterday, I added a tracker for a feature request that will allow you to do complex comparissons in StreamLookup. (If the comparator is "equal to", it will use a hashtable, otherwise a sorted list.)

Euh, hold on, wait a minute, you can use a "Join Rows" step and set the condition there.
That should give you the result you like. It will give you a cartesian product, but you could use a "Unique Rows" step before joining.
Make sure to change the names of the input fields of "Join Rows" in such a way that they are unique between the 2 streams. Otherwise it's difficult to use the values in the condition.

Hope this helps,

Matt

kettle_anonymous
02-08-2006, 03:01 AM
I think i may be having the same kind of problem :/

I am collecting data from a relational database in MySQL

I have the 2 tables that relate to each other

----------------------------------------------

Table 1: (products)

product_id
product_name
product_description

----------------------------------------------

Table 2: (products_info)

product_id
product_price
product_quantity

----------------------------------------------

In my web application these 2 table are joined using the following simple query:

SELECT *
FROM products AS p, products_info AS pi
WHERE p.product_id = pi.product_id

This will output this result:

product_id
product_name
product_description
product_id <------- (duplicate)
product_price
product_quantity

What i need to do with Kettle/Spoon is the same idea. I need to join the fields from these 2 tables and output the result to a txt/csv file. I have been trying to do this using various steps yet cant quite get the result im looking for :(

Any help would be really appreciated.

Kind regards, Harris

MattCasters
02-08-2006, 12:27 PM
Well, first of all, you can put the exact SQL from above into TableInput and it will work.
If you don&#39;t specify the fields in text-file output it will work.
As long as you know that referencing product_id (2 occurences in the row) will mean manipulating ONLY the first occurence!
Please know that having duplicate field names in rows is bad practice in Kettle as much as it is bad practice in SQL... (and some of databases even don&#39;t allow this)

If you want to source from products and add product info, use a simple "Database Lookup" step with product_id as the key.
You can also use a "Database Join" step where you can specify any complex SQL you would like as a joiner.

Hope this helps,

Matt