View Full Version : Starting with Big Data: lookups

07-09-2013, 09:02 AM
Hi all,

I have been using Kettle for some months and usually had the problem of big data file inputs (most of them joins).

Now I have to compare two big structured text files (key - some lines), which may contain millions of rows each.

I have considered composing key-value pairs of both files, so I could search by key and compare its value. I wanted to store all data of first file into a noSQL database(structured as I said), and then, as I go reading the second file, ask it for each key I find on the fly.

Obviously I can´t use the lookup stream step because it runs out of memory, so I would like to test if I could use some kind of noSQL database to store first file´s data and make queries by key.

At this point, I wanted to ask you if hadoop could help me, or what alternatives Pentaho 4.3 could offer me for my purposes (MongoDB maybe?). I would like it to be done as efficiently as possible.

Thank you very much. Regards!

07-12-2013, 04:05 PM
This is similar to the following post:


which has a link to Hemal's blog describing how to do MapReduce-based joins/lookups with Kettle.