Hitachi Vantara Pentaho Community Forums
Results 1 to 10 of 10

Thread: Losing data during Merge Join

  1. #1
    Join Date
    Jul 2017
    Posts
    5

    Default Losing data during Merge Join

    Hi y'all. I've been having a peculiar problem with Pentaho.

    I have this huge CSV file that I make go through several transformations. Part of those transformations is the addition of extra information from other CSV files. The problem with this part is that when I merge this huge file (about 6.5 million lines) with a way smaller file (about 2200 lines) using a Left Outer join, to add a particular column, I end up missing several values of the smaller file, and I don't know exactly why. This merge is the first step of the transformation, so it's not like anything behind them is doing anything to them.

    As some extra info, the merge is done using a single column. Also, both files get a sort before the merge, so it shouldn't be something like that.

    I've made a sub-transformation of the whole thing to confirm that the error is there indeed. You can see it here:

    Name:  pentah.jpg
Views: 60
Size:  18.2 KB

    Here's a link for the transformation shown: https://drive.google.com/file/d/0B9q...ew?usp=sharing . The full transformation is bigger, but the error manifests right after the select, or the merge; I still don't know exactly why is data being lost.

    The filter removes the lines that have no data from the right part file (Parent), and the result should have the same amount of lines that the original Parent file, but instead it has much less. Any guesses, please?
    Attached Images Attached Images  
    Last edited by manuati; 11-02-2017 at 03:36 PM. Reason: Added some extra information

  2. #2
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    Try to limit images to VGA size else they are subject to compression by the the forum software.
    Or zip and attach images which must not be compressed - for better readability.
    There's a tool button for adding attachments in the advanced editor.
    So long, and thanks for all the fish.

  3. #3
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    Maybe because an Inner join only shows rows that have data from both sides? All other data is dropped. That's the definition of an Inner join.
    Try Left Outer, Right Outer, or Full Outer to see what it does to your output.

  4. #4
    Join Date
    Jul 2017
    Posts
    5

    Default

    Dang, forgot to mention: It's a Left Outer join actually. All the joins done to the big file are done that way, since it's the one with the information that matters. I tried a Full Outer but didn't have much luck. Haven't tried a Right Outer yet.

    Thanks for commenting!

  5. #5
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    Can you attach your transformation without the files, so we can inspect the settings?

  6. #6
    Join Date
    Jul 2017
    Posts
    5

    Default

    Quote Originally Posted by marabu View Post
    Can you attach your transformation without the files, so we can inspect the settings?
    Here you go: https://drive.google.com/file/d/0B9q...ew?usp=sharing

    I also modified the original post with the link.

  7. #7
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    Right Outer should be what you're looking for.

    You defined "Parent" as "Two" which is the same as "Right", so... Left (1) Outer will give all rows, regardless of if the match exists in Right (2)
    You want the opposite.

    Alternatively, swap the "First Step" and "Second Step" in the configuration.

  8. #8
    Join Date
    Jul 2017
    Posts
    5

    Default

    Quote Originally Posted by gutlez View Post
    Right Outer should be what you're looking for.

    You defined "Parent" as "Two" which is the same as "Right", so... Left (1) Outer will give all rows, regardless of if the match exists in Right (2)
    You want the opposite.

    Alternatively, swap the "First Step" and "Second Step" in the configuration.
    Didn't worked. I'm still losing data along the way.

  9. #9
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    Can you post simplified example data that shows us what is happening so we can explain it to you?

  10. #10
    Join Date
    Jul 2017
    Posts
    5

    Default

    Quote Originally Posted by gutlez View Post
    Can you post simplified example data that shows us what is happening so we can explain it to you?
    I don't think I can. I was building a simplified file that has only the involved columns in the merge, since the original file has a lot of sensible data that I can't share. But when I processed this file, I didn't lose any data after the merge. I tried the transformation again with the original file, but it did lose data, so I'm guessing that if I want to make an example file, I will have to have more columns in the file, and that's some info I can't really share.

    Either way, I'm being told that that information I'm trying to get from the merge can and will be added in a different way, so there's no need to solve this anymore. Thank you guys for your help and patience, I really appreciate it.

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.