Hitachi Vantara Pentaho Community Forums
Results 1 to 4 of 4

Thread: Pentaho De-Duplicate Step

  1. #1
    Join Date
    Oct 2014
    Posts
    29

    Default Pentaho De-Duplicate Step

    Hello, I am working on a project where we are getting customer data from two different databases. In most of the cases the same customer exists in both the databases. My goal is to de-duplicate these customers and create a master customer and associate all the related aka slaves customers to the master. A customer is identified by name, addressline1, city, state, zip, phone attributes. So I need to match the records using all these attributes. Pentaho website mentions about Data Quality and Profiling capabilities using which we could Standardize, validate, de-duplicate and cleanse inconsistent or redundant data. We are currently cleansing the addresses using a API which is being invoked from PDI. What are the steps or patterns I could use for deduplicating customers? Please share your ideas on how you would approach this use case. Thanks.
    Pentaho DI 5.4 / Win / RHEL

  2. #2
    Join Date
    Dec 2014
    Posts
    10

    Default

    AK801,

    Well, there is a "Merge Join" step that has the functionality that you're looking for in terms of redundancy. I'm not exactly sure what you mean by "standardize". It basically works like any join would. You give it the keys to join on and it outputs the result for processing, which you then can output to a table. As long as you have the same attributes that make a unique customer in both databases, it will work for cleansing redundant data.

    In terms of inconsistent data, if there's inconsistency in the key attributes, there's not much you can do about it (obviously). Any secondary attributes can be checked inside of the "Merge Rows (diff)" step. You can compare the values of any non primary key fields that you pass through it and handle the data that has the inconsistency.

    That's how I would handle it, though I'm admittedly less experienced than most people here.

  3. #3
    Join Date
    Oct 2014
    Posts
    29

    Default

    I would like to use the fuzzy match algorithms to de-dup the customer records using name, addressline1, city, state, zip and phone number attributes. As part of the de-dup I would like to group all similar records (masters and slaves) together. The Merge Join step works for exact match condition in the join so this will not give the similar records. The Fuzzy Match step outputs only the first match and doesn't output all the matches so this also will not give the similar records. Is there any other approach I can try to accomplish this use case? Thanks.
    Pentaho DI 5.4 / Win / RHEL

  4. #4
    Join Date
    Apr 2008
    Posts
    146

    Default

    That step works like the stream lookup step and does not like nulls in any of the attributes you are comparing. You'll get a null pointer exception if you let one slip through.
    This is art and science. One technique we have used for names, is the concatenate the three common attributes together FirstMiddleLast on both streams before you hit the fuzzy step. Then you can use a difference algo on it to tell you how many edits are required to make the strings match. The first two algos in the dropdown list do that. The art part is picking the 'maximum' number acceptable. A number two low... nulls (no join) or match. Number too high, after a certain number of edits anything can be made to match.

    You'll have to stair step your datasources and try to prioritize them by reliability. Remember you can test several streams at once and over and over again.

    You'll probably introduce value mapper and regex steps here and there as well as "If when null" steps to set default values when something is blank. We used a single character for example, so that the odds play in our favor when a value is populated that the distance to join is very high and won't give us false positives.

    Those are my two cents for zero cents. Would be interesting to know how it plays out for you.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.