Hitachi Vantara Pentaho Community Forums
Results 1 to 10 of 10

Thread: Fuzzy Match and special characters

  1. #1
    Join Date
    Jun 2013
    Posts
    6

    Default Fuzzy Match and special characters

    Hello all,

    I'm trying to use the Fuzzy Match step with french words (and for specifically, french first names).
    Is there a way to use this step with special characters. My tests show that Therese (my input test) is closer to Terese than it is to Thérèse (the proper writing). Using an automatic replacement would result in this first name being written unproperly instead of just missing the accents. I would not be cleaning my data but just making it worse

    I have the same problem with other accents, cedillas, dieresis or Π(among others)

    Thanks in advance for your answers

    Mathieu

  2. #2
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    I'm not surprised by your findings if matching was done via smallest edit distance (Levenshtein).
    Better results in your case should be obtained with a phonetic algorithm like Double Metaphone.
    So long, and thanks for all the fish.

  3. #3
    Join Date
    Jun 2013
    Posts
    6

    Default

    Hello Marabu, thanks for your help. I just tried double metaphone which gives weird results.
    For example, Julien will match Ghislaine (in the comparison stream) but Ghislaine will match Céciliane

  4. #4
    Join Date
    Apr 2008
    Posts
    1,771

    Default

    Hi.
    My 2 cents.
    I've done various jobs involving matching or deduping lists of business and/or people.
    When there are not-english/american names it's always very tricky and a lot of manual process is involved, even with specialised software.

    One way of dealing with those names was to replace accents - do matching - replace accents back.

    Tedious job!

    Mick

  5. #5
    Join Date
    Jun 2013
    Posts
    6

    Default

    Hi,

    thanks a lot for your answer. Indeed, it seems to be a very tricky process. I'll try to use a different approach.

    Maxx

  6. #6
    Join Date
    Jun 2013
    Posts
    6

    Default

    News flash, Needleman Wunsch gives pretty good results and the measure field will give me a way of knowing if the output value is different from the input one (which will then give me a good way of starting manual treatment).

    Maxx

  7. #7
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    Metaphone seems not to be very francophile
    On the other side, diacritical characters have a disastrous effect on the edit distance.
    I agree with Mick, that it's best to keep accents out of the way, but it's not necessarily tedious.
    Unicode to the rescue - now even Levenshtein can help again.
    Attached Files Attached Files
    So long, and thanks for all the fish.

  8. #8
    Join Date
    Jun 2013
    Posts
    6

    Default

    Hi, thanks a lot to both of you for your help !

    I think I need to clarify my needs : I want to be able to :
    1/find if a first name's writing is correct
    2/correct first names which are close to proper spellings (for example turn a Sébastin into Sébastien)
    3/be informed about names which don't match proper spellings (for example, I'm not expecting PDI to turn J Paul into Jean-Paul automatically, I think I can do it manully)

    The problem is that, if I remove accents, cedillas and so on, steps 1 and 3 will work as a charm but step 2 might not work. In my example, the output will be Sebastien (not that far form my expected result but not perfect either )

  9. #9
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    Step 2 should work alright.
    Don't forget to allow for a reasonable edit distance - Sébastien would be considered with a measure of 2.
    So long, and thanks for all the fish.

  10. #10
    Join Date
    Jun 2013
    Posts
    6

    Default

    Ouh, yeah, I just got it !
    I turn my input Sébastin into Sebastin. It will be compared to Sébastien and avoiding the accent problem it should give me a good result

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.