Hitachi Vantara Pentaho Community Forums
Results 1 to 5 of 5

Thread: Longest match

  1. #1
    Join Date
    Aug 2016
    Posts
    267

    Default Longest match

    Trying to process a stream of data by comparing a field to a info-list, then returning an ID from this info-list (if a match is found). So far so good, a fairly simply task. However, this lookup must be done using longest match! This is similar to longest prefix match used with IP-addresses.

    Example:

    Incoming row with value "10000123"

    Info-list:
    1000
    100001
    1000012
    100002
    1200

    The longest match here would be 1000012. The result should be fetching a corresponding ID in this info list.

    I tried using recursive sub-transformation, iterating through the various lengths, to try and find the longest match first. Kettle/spoon didn't like that very much, and it stopped/froze. I think the sub-transformation executes even if it doesn't receive any input rows, if that's the case it's not able to break the recursion and will crash into infinity. I don't want to hardcode each iteration (could be very many different lengths in the info-list). So I wonder if there is any other solution than doing it manually in java?
    Last edited by Sparkles; 02-11-2019 at 10:28 AM.

  2. #2
    Join Date
    Apr 2008
    Posts
    4,663

    Default

    Perhaps:
    Break Strings to Character Lines, and do a comparison between them.
    Then do a running total of matching and total characters
    Filter for max value where Ematch = Etotal

    But that would be SLOW.

    Value| Match | Line | Char | Matches | Total
    10000123 | 1000 | 1 | 1 | 1 | 1
    10000123 | 1000 | 2 | 0 | 1 | 2
    10000123 | 1000 | 3 | 0 | 1 | 3
    10000123 | 1000 | 4 | 0 | 1 | 4
    10000123 | 1200 | 1 | 1 | 1 | 1
    10000123 | 1200 | 2 | 2 | 0 | 1
    10000123 | 1200 | 3 | 0 | 1 | 2
    10000123 | 1200 | 4 | 0 | 1 | 3
    10000123 | 1000012 | 1 | 1 | 1 | 1
    10000123 | 1000012 | 2 | 0 | 1 | 2
    10000123 | 1000012 | 3 | 0 | 1 | 3
    10000123 | 1000012 | 4 | 0 | 1 | 4
    10000123 | 1000012 | 5 | 0 | 1 | 5
    10000123 | 1000012 | 6 | 1 | 1 | 6
    10000123 | 1000012 | 7 | 2 | 1 | 7

    Filtering where Line=Total:
    Value| Match | Line | Char | Matches | Total
    10000123
    | 1000 | 1 | 1 | 1 | 1
    10000123 | 1000 | 2 | 0 | 1 | 2
    10000123 | 1000 | 3 | 0 | 1 | 3
    10000123 | 1000 | 4 | 0 | 1 | 4
    10000123 | 1200 | 1 | 1 | 1 | 1
    10000123 | 1000012 | 1 | 1 | 1 | 1
    10000123 | 1000012 | 2 | 0 | 1 | 2
    10000123 | 1000012 | 3 | 0 | 1 | 3
    10000123 | 1000012 | 4 | 0 | 1 | 4
    10000123 | 1000012 | 5 | 0 | 1 | 5
    10000123 | 1000012 | 6 | 1 | 1 | 6
    10000123 | 1000012 | 7 | 2 | 1 | 7

    Select Max Total group on Value:
    Value| Match | Line | Char | Matches | Total
    10000123 | 1000012 | 7 | 2 | 1 | 7

    It's another solution, and would be easy to read to understand for ongoing maintenance, but on your dataset sizes, might be a problem.

  3. #3
    Join Date
    Apr 2008
    Posts
    4,663

    Default

    So... I'm a dork.

    Two streams:
    Incoming Row
    LookupStream

    Add a field (Key Length) to the lookup row.

    Do a cartesian Join on Incoming Row.KeyField Starts With LookupStream.KeyValue
    Sort on KeyField asc, KeyLength desc
    Add RowChanging Sequence on KeyField
    Filter Rows for Sequence = 1

    That Sort will slow things down, but I'm not sure how to solve that problem... Maybe use Group By with "Return All Rows" instead of the sort and RowChanging?

  4. #4
    Join Date
    Aug 2016
    Posts
    267

    Default

    I like how you think! I had one attempt where I compared all possibilities (in this case, using a max string length of 15 characters). So first compare any incoming data with any info-list of length 15. Filter "true" matches to success, then continue down to 14 and so forth. A total of 60 steps hehe. It worked, but still not the fastest. By sorting info-list like strings (not integer!), it is possible to go the other way, start with the smallest and go up. Once you hit a match, continue to try until you got a non-match, then keep the last match. I ended up doing this with about 100 lines of java code. There is a data-structure called Trie which could be useful in this case also which could be the most effective solution.

    You got some interesting alternatives here, which shows that it could be possible with steps, but perhaps it would be a problem on large amount of rows.
    Last edited by Sparkles; 02-12-2019 at 06:56 AM.

  5. #5
    Join Date
    Apr 2008
    Posts
    4,663

    Default

    I feel like tree string splitting this with string structures would be more CPU efficient (and also time efficient), and potentially a way to hard-code it, but I really hate hard-coding.

    EG. you could bin on first character (0-9), then on second character (0-9), etc, and if a matching doesn't exist, it drops out to the "end step" ... Hard Coded. Which makes it harder to maintain.

    UDJC isn't really a problem (it's actually very performant), as long as you plan on having someone who speaks Java around to maintain it. MJSV on the other hand, I would recommend against.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2017 Pentaho Corporation. All Rights Reserved.