Hitachi Vantara Pentaho Community Forums
Results 1 to 5 of 5

Thread: Rows duplicating on "change number of copies to start" in table output step

  1. #1
    Join Date
    Jan 2014
    Posts
    25

    Default Rows duplicating on "change number of copies to start" in table output step

    I have the below transformation.

    Name:  sshot.jpg
Views: 269
Size:  18.9 KB

    There are 60 million rows to be loaded. The "change number of copies to start" property of the table output step is set to 2 in an attempt to increase throughput.
    The 60 million rows need to go into both the tables, the columns differ in both of them.
    The transformation was test ran with 100 rows. The tables were loaded with 100 rows each but there were duplicates in them.
    How can it be ensured that only unique rows get inserted in the tables.
    Attached Images Attached Images  
    Last edited by nivinjacob; 01-24-2014 at 02:26 AM. Reason: wrong image

  2. #2
    Join Date
    Apr 2009
    Posts
    337

    Default

    please do not use the dummy step, just launch two copies of table input
    Regards,
    Madhu

  3. #3
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    Not really sure if your throughput will improve that way, but watch your data movement setting: It will affect each copy of a step, so a bit of decoupling is called for.

    Name:  Copies.png
Views: 332
Size:  14.5 KB
    So long, and thanks for all the fish.

  4. #4
    Join Date
    Jan 2014
    Posts
    25

    Default

    Marabu's decoupling has worked fine. Tables are loaded correctly.
    However if the table input step is configured to launch 2 entries ( to speed up read ) twice the number of records are loaded in the target tables.
    Is there a way to solve this.

    When the Dummy step is removed and 2 copies of table input are launched there are 2 scenarios :

    1. when the data movement from table input step is round robin, duplicates are loaded in the target tables, though the row count match with source.
    2. when the data movement from table input step is copy, twice the number of rows from source are loaded in the target tables.

  5. #5
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    Quote Originally Posted by nivinjacob View Post
    However if the table input step is configured to launch 2 entries
    Think twice: You get identical step copies, so each step has the same SELECT statement.
    If you really think a second input step will do you good, you will have to add it explicitly.
    Come up with a data partitioning scheme, so your SELECT statements don't read the same rows.
    So long, and thanks for all the fish.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.