Hitachi Vantara Pentaho Community Forums
Results 1 to 4 of 4

Thread: Faster Stream Lookup

  1. #1
    Join Date
    Jun 2007

    Question Faster Stream Lookup

    Hi everyone,

    I am working on integrating some data from one of our customers. They provide the data to us weekly in the form of a .txt file. In this txt file is a column with a date field. I compare this date field with a table in a database to retrieve a matching dateid (whihc I use again later for other steps).

    The date field in the txt file is dd/MM/yyyy, and the date field in the database (PostGres 8.3) is a timestamp with the date value.

    To compare the two successfully I am using a javascript step with the date2str() function on each input, and matching the string values in a stream lookup step. This is working successfully for me right now, but I am wondering if there is a better way to do this? It does seem to slow the whole process down somewhat.

    Any thoughts would be appreciated.


    The Frog
    Everything should be made as simple as possible, but not simpler - Albert Einstein

  2. #2
    DEinspanjer Guest


    date2str uses Java's SimpleDateFormat class which is rather overweight (due to locale sensitivity and such) for just being able to compare a dd/MM/yyyy string to a date field. The Javascript step itself will slow things down as well.

    You could either try using the pure javascript date parsing class, or you could try comparing the individual date components to date components extracted from the table via the SQL date part functions. I think the latter approach would be the fastest.

  3. #3
    DEinspanjer Guest


    I created a quick test case just to look at the performance of the different types of lookup.

    One important thing, I was very wrong about the performance of the pure yyyy-MM-dd lookup. At least with 3.1RC1, it is very fast. I need to go back to my old tests that use timezone offsets and see if something has changed there.

    The results of the tests in the attached zip for me:
    DB lookup of ymd ~= 40k/rps
    Stream lookup of ymd ~= 44k/rps
    Stream lookup of date ~= 112k/rps
    Attached Files Attached Files

  4. #4
    Join Date
    Jun 2007

    Talking The need for speed...........

    Hi DEinspanjer,

    Thanks for the test cases. I have managed to replicate the performance improvement on my end too. It runs nearly three times faster than with the JavaScript step. I did have some issue with the data types though, and have ended up with what seems to be a rather odd set of data types to get the job done.

    The txt (csv) file I am having the date field (yyyymmdd) returned as an integer, and the table_input from Postgres I am collecting like this:

    dateid, to_char(modayweekstart, 'yyyymmdd') as Monday
    FROM "time"

    This returns me a text string (with 8 characters) but the metadata shows it as a string of length 2147483647.

    When trying to compare the table_input string with a string from the txt file I get one of these java data type errors saying the object isnt the right type or some such rubbish. When the txt file is 'read' as an integer instead, the problem goes away and the whole thing runs smoothly. Quick, clean, fast, and accurate.

    Same issue comes when trying to do a direct compare on the date fields from both as dates. It just doesnt work, hence the to_char in Postgres and the integer as a data type for the txt file.

    On my laptop (Pentium M 1.6GHz) I am getting around 23600+ rows per second through the lookup step instead of around 8k with the javascript. It does seem nice and fast indeed.

    Just thought I'd share what I have learned. Thanks for the help. I appreciate it greatly.


    The Frog
    Everything should be made as simple as possible, but not simpler - Albert Einstein

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.