Hitachi Vantara Pentaho Community Forums
Results 1 to 8 of 8

Thread: Fuzzy Match design pattern?

  1. #1
    Join Date
    Jun 2014
    Posts
    27

    Default Fuzzy Match design pattern?

    All,

    I have a strange case. For each fleet (parent record) I have several contacts (child records). I have the same fleets with some of the same contacts in an existing data store. I want to combine the 2 streams and use a fuzzy match step to see if the names are similar to update or are so different that they need to be entered as new contact records.

    The issue I have is that, for a particular fleet, I want to search all the existing contacts, but the Fuzzy match step does not allow me to specify a match on fleet_id before trying to match on the names. Can anyone think of a design pattern that will match the fleets, but allow me to Fuzzy match names only within the same fleet?

    For instance, I have Fleet 1 and Fleet 2. Fleet 1 has an existing contact - Bob Smith. Fleet 2 has a DIFFERENT contact that is also Bob Smith. My new data set contains the same fleets - 1 & 2. The new Fleet 1 has a contact - Robert Smith. While the new Fleet 2 has a contact - Bob Smith. Currently, the Fuzzy match step takes only a single field from each stream as a comparator, so the Bob Smith from Fleet 2 would match the Bob Smith record in Fleet 1 and 2! I want to ONLY match the existing Fleet 1 and update the existing record. I want to leave the existing Fleet 2's contact as Bob Smith.

    Thanks in advance for any advice.

  2. #2
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    You'll have to process your data fleet by fleet.
    Use a job and two transformations, the first one to find all fleets, the second one to process a subset for each fleet.
    Don't forget to enable advanced transformation option "Execute for every input row" for the second transformation.
    If there's just a very small number of fleets you even might be able to partition and process your data in a single transformation using "Switch/Case".
    Last edited by marabu; 11-21-2014 at 01:57 PM. Reason: typo
    So long, and thanks for all the fish.

  3. #3
    Join Date
    Aug 2013
    Posts
    25

    Default

    Hmmm same use case as mine (different thread, sorry).... Maybe the Pentaho Business Rule Engine ETL step from Uwe Geercken is usefull. I've asked it as comment on his youtube video: https://www.youtube.com/watch?v=EzTNo_V1QJI. (and via twitter)
    Last edited by jaapandre; 11-22-2014 at 12:31 PM.

  4. #4
    Join Date
    Aug 2013
    Posts
    25

    Default

    tweets from uwe:

    @jaapandre right now it can use regular expressions and soundex. but it's open to extend it.
    @jaapandre I had a quick look. it's not so complicated . so I could implement Levenshtein algorithm.
    @jaapandre it's interesting - something new to learn. and I found a library https://github.com/rrice/java-string-similarity I will do some tests
    Last edited by jaapandre; 11-22-2014 at 03:18 PM.

  5. #5
    Join Date
    Aug 2013
    Posts
    25

    Lightbulb Possible solution: javascript value step

    rule engine step does not seem the solution (it is a filter step, not a join step).

    I've used another solution:
    *merge join the two data streams.
    *use javascript value step to calculate the distance
    *further process rows (needs some investigation? aggregate by lowest distance?)

    the javascript function to calculate the distance is from:
    http://en.wikibooks.org/wiki/Algorit...nce#JavaScript

    Zie attached kettle transformation: fuzzyMatchJavaScript.zip

  6. #6
    Join Date
    Aug 2013
    Posts
    25

    Default

    in stead of javascript value step, you can also use the calculator step. Divers distance measures are available.

  7. #7
    Join Date
    Aug 2013
    Posts
    25

    Default

    If you want the fuzzy match step extended, please vote here (http://jira.pentaho.com/browse/PDI-13265)

  8. #8
    Join Date
    Jan 2015
    Posts
    3

    Default

    fuzzy match step has to extend, otherwise we are not able to develop Match Designer process which available in IBM Data Stage.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.