Hitachi Vantara Pentaho Community Forums
Results 1 to 9 of 9

Thread: Regex filtering

  1. #1

    Lightbulb Regex filtering

    Hi guys,
    This is will be a question on using regex to sieve data and also a bit on efficiency?
    Which would make more logical sense or faster?

    Say you've got 10 million records (bank card number stored as string) coming through. Sometimes those numbers contain an alphabet or two that you don't want.
    So, which would be better?

    Option 1:
    filter step. Filter using regex for valid numbers (which is 99.9999% correct). Or, filter for invalid numbers?

    Option 2:
    Data validation. Using regex for valid numbers, or vice versa?

    And also, between the two steps -- filter and data validation step -- which is more efficient?

    Please comment. Your views are important. Thanks!

    Regards.

  2. #2
    Join Date
    Feb 2008
    Posts
    107

    Default

    Why dont you test both and see which is faster ?

  3. #3
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    For a mere 10m rows I'm sure it doesn't matter that much.

  4. #4
    Join Date
    Apr 2007
    Posts
    2,010

    Default

    I can't prove it, but i've had a couple of scenarios in the last few days which have led me to believe the data validator step can be incredibly slow. I'm trying to pin it down still... No one else see that?

    Update: I've just done some testing. If you have error handling, and every record fails, it is 10x slower than if you have everything pass. But in a simple case it is certainly not slow:

    1M rows, everything validates, I can process 600k/s
    1M rows, all fails, it processes 60k/s

    Thats still a lot higher than what i've been seeing in my slow transforms though, so i'll investigate further. but matt is the 10x drop in performance just due to going down the error stream expected?
    Last edited by codek; 12-01-2011 at 12:42 PM.

  5. #5

    Default

    Thats still a lot higher than what i've been seeing in my slow transforms though, so i'll investigate further. but matt is the 10x drop in performance just due to going down the error stream expected?
    What's your finding on this, Codek?

  6. #6
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    The performance drop is probably caused by the fact that it needs to output a bunch of extra fields, construct new rows and so on.
    If you find a case where things get a lot slower, let me know and I'll investigate further.

  7. #7
    Join Date
    Sep 2009
    Posts
    810

    Default

    Exception processing is very slow in Java (internal stuff like creating stack traces, unwinding the stack etc. takes time). Most steps implement error handling by catching exceptions as unforeseen things happen. Should that be the case for the validator (using exception handling for managing control flow) it would explain the performance drop.

    .. just checking the source ... yap seems to be the case ...

    expected and predictable failures (regex not matching for example) should not be handled by exception handling as it will result in inferior performance. Worth logging a jira case I think.

    PS: http://www.mortench.net/blog/2006/08...tion-handling/

    Cheers
    Slawo

  8. #8
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    Yeah, you're right Slawo. We can simple get rid of the throw/catch of the exception itself since pretty much everyone uses error handling on this step.
    So file away guys, I'll improve performance.

  9. #9
    Join Date
    Apr 2007
    Posts
    2,010

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.