I am building an ETL process that includes a MongoDB repository.

When the repository loads, the documents get a hash (generated in Oracle - I am using a combination of a unique key and a hash to minimize collision chances). For the periodic updates, I am pulling the hash and account number (the unique key) out of the MongoDB database and sending it through a Modified Javascript Value step to convert the data to rows. I am also pulling the hash and the account out of the Oracle database - and I will then send both to a Merge Join - where I will do an outer join and then filter the output of that to figure out which records have changed hashes and therefore need updates.

Right now this is all happening inside of one transformation - and with only 10 rows (for testing) I am getting 6 out of 10 matches - when I should be getting 10/10. Both sets are sorted properly prior to the Merge Join step. I am wondering if this is a timing issue - and if I should be pulling the data in two separate transformations (one for Mongo and one for Oracle) and then feeding these into a third transformation for the Merge Join.

Also - I am wondering if there is a better way to do this. With Mongo - I am kind of out on the cutting edge - as I have to pull the data in and then use the Javascript step to convert the data to rows - to the comparison - and then filter that output to find the rows that either need updated or are new and need inserted.

New to Pentaho -but I do have some ETL experience. Would appreciate any advice, direction or feedback.

Thanks,

Matt