Hitachi Vantara Pentaho Community Forums
Results 1 to 2 of 2

Thread: Efficient way of doing joins

  1. #1

    Default Efficient way of doing joins


    I have to cross data from different sources: sql server db, txt files, excel files, etc. My problem is that very often I run out of memory when trying to cross these different data streams. Files normally are big in terms lf number of rows (above 1million ross) but lean in terms of numbers of Columns (aprox 20 columns). I would like to know your opinion on how you have fixed this type of problem previously and how you merge streams of data efficiently.

    thanks formyou help.

  2. #2
    Join Date
    May 2014


    The most efficient way to join bigger tables together is Merge Join. It requires the streams to be ordered by the keys.
    If the files are not sorted, you can either sort them in Pentaho (you can limit the number of rows to process in memory, for anything above that a temp directory is used, you can choose to compress the temp files to save some space), or load them into a stage area in a database and then get it from there with ORDER BY. I like the second way because then you don't have to worry about temp space on the machine (which might be an issue in production environments).

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.