Hitachi Vantara Pentaho Community Forums
Results 1 to 3 of 3

Thread: How to de-dupe a large file and consolidate data?

  1. #1
    Join Date
    Oct 2011
    Posts
    2

    Red face How to de-dupe a large file and consolidate data?

    HI all

    I am very new to Kettle and fairly new to the world of DI.

    I have a large .csv file that I have compiled and deduped from over 100 .csv files, using Kettle.

    However, I'm now stuck on what to do next.

    There are instances where the same Company has the same Phone number, but no address, or a different address, and different Category. How do I roll all the data up so that there is one instance of the Company, with a Street and Mailing address, and all categories concatenated in the one cell (; delimited)?

    Any pointers to examples will be greatly appreciated.

    I am resorting, in desperation, to SQL scripts using MySQL...

    Thanks.

    Regards

    Tweets

  2. #2
    Join Date
    Apr 2008
    Posts
    1,771

    Default

    Hi Tweets..
    welcome to the wonderful world of deduplication.

    I would be very careful in merging data from 2 different records, unless you are 100% sure that they are referring to the same address.
    For example, it could be that you identify the same person, but with street name in one row and street number in another row.
    It could be that those two elements together do NOT improve your data, but create some inconsistencies - ie that number was from an old address!

    But, if you are happy with consolidating your data, I would use the "Group by" step.
    You can specify to group information from the FIRST row with data.
    That should help you to create one row for each company.

    To concatenate your fields you can use a Modified Java Script step or a Calculator.
    I would use JS.
    var newlongfield = address + ";" + postcode + ";" + telephone;

    Mick

  3. #3
    Join Date
    Oct 2011
    Posts
    2

    Default Thanks!

    Quote Originally Posted by Mick_data View Post
    Hi Tweets..
    welcome to the wonderful world of deduplication.

    I would be very careful in merging data from 2 different records, unless you are 100% sure that they are referring to the same address.
    For example, it could be that you identify the same person, but with street name in one row and street number in another row.
    It could be that those two elements together do NOT improve your data, but create some inconsistencies - ie that number was from an old address!

    But, if you are happy with consolidating your data, I would use the "Group by" step.
    You can specify to group information from the FIRST row with data.
    That should help you to create one row for each company.

    To concatenate your fields you can use a Modified Java Script step or a Calculator.
    I would use JS.
    var newlongfield = address + ";" + postcode + ";" + telephone;

    Mick

    Thanks Mick - I will give it a go.


Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.