Hitachi Vantara Pentaho Community Forums
Results 1 to 9 of 9

Thread: Whats the best way to remove ALMOST duplicate rows (one field is different)

  1. #1
    Join Date
    Feb 2011
    Posts
    3

    Default Whats the best way to remove ALMOST duplicate rows (one field is different)

    Hi,

    I am working on my first "real" transformation, meaning I've done a bunch of practicing with dummy date and now I am working on my first business need.

    I have extracted data from a text file - each row is a 'snapshot' of a problem ticket at a given time. I need to toss the older snapshots and retain the newest.

    for example; 'pm_num' is my problem ticket field; and 'page' is the snapshot. pm_num is a text field, page is a number. 'page' does NOT contain duplicates, pm_num CAN contain duplicates. so if I have two rows, they might look like this:

    pm_num || page
    ----------------------
    pm760162 || 001
    pm760162 || 002

    I need to discard the top row, but I can't really figure out how. Any assistance is appreciated!

    Mike

  2. #2
    Join Date
    Jun 2010
    Posts
    114

    Default

    I am not sure but...
    May be setting pm_num column as unique, ordering the page column in ascending order and then using update step to update the rows.
    Would this help?
    Thanks,
    Om
    ----------------------
    Spoon version -4.4.0

  3. #3
    Join Date
    Apr 2008
    Posts
    1,771

    Default

    Hi Mike.
    I would use the Sort step, sort by pm_num (acending) and page (descending), then use the step Unique Rows.
    Note that in Unique Rows you have to use pm_num to identify a unique row.

    Mick.

  4. #4
    Join Date
    Sep 2010
    Posts
    135

    Default

    Hi,

    I'd do the same as Mick. First, order you rows by pm_num (ascending) and page (descending). As you want to keep the pm_num with the biggest page_num, use "Unique rows" step with pm_num field. It will only keep the first row of the repeated "pm_num" rows (the one with the highst page)

  5. #5
    Join Date
    Feb 2011
    Posts
    3

    Default

    Thanks everyone, the sort and unique did the trick!

  6. #6
    Join Date
    Nov 2008
    Posts
    143

    Default

    You don't actually need the Unique rows step.
    Sort rows has a check box that does the same trick.

  7. #7
    Join Date
    Sep 2010
    Posts
    135

    Default

    Quote Originally Posted by renatopb View Post
    You don't actually need the Unique rows step.
    Sort rows has a check box that does the same trick.
    Hi,

    It has an option "Only pass unique rows (verifies keys only)". I supose it compares the values of a column (the one that is the key) and delete the rows that are repeated. In my case, I need to compare if two rows are equal, but equal means that the values of 4 columns are the same so I guess there is only one way to achieve that, using "Unique rows". By the way, If you use this option in "Sort rows". How do you specify which one is the key column?

    Thanks.
    Kind regards,
    Aitor

  8. #8
    Join Date
    Nov 2008
    Posts
    143

    Default

    Simply tell which columns you want to do the sorting. Those will be your keys.

    Based on your fisrt post, if you don't need "page", then just put "pm_num" as you key and the step will sort it out for you based on "pm_num", discarding all but one value of "page".

    On the other hand, if you do need "page", say the first or the last record, then you better stick with Sort -> Unique approach, because Sort rows will only keep the first value passed on by the stream.
    Last edited by renatopb; 02-10-2011 at 09:40 AM. Reason: typo

  9. #9
    Join Date
    Sep 2010
    Posts
    135

    Default

    Ok, so I understand that if you want to compare and delete the repeated rows by more than one column, just sort by all the columns ant then check "Only pass unique rows (verifies keys only)" option.
    Thank you very much!
    Regards,
    Aitor

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.