Hitachi Vantara Pentaho Community Forums
Results 1 to 2 of 2

Thread: Status & Lazy conversions

  1. #1
    Matt Casters Guest

    Default Status & Lazy conversions

    Dear Kettle-dev,

    Things have been seriously heating up in the 3.0 development cycle.
    A lot of configuration methods for Steps where added, parts of the GUI moved
    to XUL, ... all to make it easier to customize and embed Kettle.

    I also managed to implement the new RowSet/BaseStep tandem effectively
    removing the hated-by-all sleep() statements. Performance gains there are
    very "interesting".

    More cleanup was done and everything is coming along nicely.

    High time to rip it all appart by adding lazy conversions to the mix...

    Tonight I'm committing a new Step called (CSV Input) that is a very simple CSV
    file reader.

    ----> http://www.kettle.be/images/lazy-csv-input.png

    Although it is simple, it's also wickedly fast...
    This step reads the 300.001 rows in this file:
    http://s3.amazonaws.com/kettle/inputfile.txt.gz in around 0.6 - 1.4 seconds
    on my laptop.
    (on average 28 MB/s) Granted, the large amount of RAM in my machine is
    helping to cache this tiny little file. However, it's a number of times
    faster than the "old" Text File Input step. (5.4-6.0 seconds)

    The secret sauce is the "Lazy conversion" algorithm that I've been working on
    lately. The idea behind it is to keep data in binary (byte[]) form as much
    as possible.
    I've created a new CSV algorithm that reads over Java NIO in large blocks as
    well. It's not comparing Strings anywhere, only binary data.

    The reason for that is that it takes as much time to read in the data from
    disk as it is to do the byte[] conversion to UTF-8 a.k.a. java.lang.String.
    (these conversions slash performance at least in half, and then we only have
    a String, not the target data type)

    With lazy conversions we keep the data in binary form as long as possible,
    when it travels from step to step and this leads to very nice results for the
    text-file-output step as well.
    If (and only if) the formatting of the numeric (Number, Integer & BigNumber)
    and Date output fields are the same as specified on input, there is no
    further action required and the data is simply dumped to disk again from the
    binary format. (I'm going to add the encoding to the list of requirements as
    well.)

    The end result is that the simple "Reading - Writing" transformation is
    munching said text file in 1.0 - 1.6 seconds on my laptop. it peases me also
    to note that a certain other ETL tool <cough>talend</cough> takes 4.8-5.1
    seconds to do the same.

    A lot of testing, tweaking still remains to be done, but I'm sure that we can
    make this work now and again the results are encouraging.

    To keep things simple, I'm proposing to write a new "Fixed Length Input" step
    that reads fixed length files along the same principles.
    (Or if anyone else is volunteering?)
    I have found that "tweaking" the existing Text File Input step is a hazardous
    undertaking because of the serious amount of code in there. It was after all
    designed to be versatile, not fast. It's better to keep that stuff backward
    compatible as much as possible.

    Until next time,

    Matt
    ____________________________________________
    Matt Casters, Chief Data Integration
    Pentaho, Open Source Business Intelligence
    http://www.pentaho.org -- mcasters (AT) pentaho (DOT) org
    Tel. +32 (0) 486 97 29 37

    --~--~---------~--~----~------------~-------~--~----~
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com
    To unsubscribe from this group, send email to kettle-developers-unsubscribe (AT) g...oups (DOT) com
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en
    -~----------~----~----~----~------~----~------~--~---

  2. #2
    Jay Goldman Guest

    Default RE: Status & Lazy conversions

    I'm not surprised to find that byte(UTF-8)=>String conversion
    significantly degrades performance. At a previous company working on
    apps for j2me devices I had to write our own UTF-8 to String conversion
    to get data thruput into the device (from the internet) up to a
    reasonable level. The implemention of this processing in the jvm was so
    inefficient in terms of buffer/object management.

    btw, Java Strings might be better described as UTF-16, i.e., each
    character of the string is represented as one ore more 16-bit values.
    This handles most uses of UNICODE character coding easily. UTF-8 is a
    coding scheme for mapping UNICODE character codes into a byte stream
    which can require up to 4 bytes for a single character but is optimized
    for ASCII (ASCII character codes and their associated UNICODE codes are
    identical and fit into 7 bits, thus requiring only a single UTF-8 byte).
    It's the handling of character codes >127 where the fun begins and
    performance drops as varying numbers of input bytes are mapped to (java)
    characters.

    cheers,
    jay

    -----Original Message-----
    From: kettle-developers (AT) googlegroups (DOT) com
    [mailto:kettle-developers (AT) googlegroups (DOT) com] On Behalf Of Matt Casters
    Sent: Thursday, July 05, 2007 7:06 PM
    To: kettle-developers
    Subject: Status & Lazy conversions



    Dear Kettle-dev,

    Things have been seriously heating up in the 3.0 development cycle.
    A lot of configuration methods for Steps where added, parts of the GUI
    moved to XUL, ... all to make it easier to customize and embed Kettle.

    I also managed to implement the new RowSet/BaseStep tandem effectively
    removing the hated-by-all sleep() statements. Performance gains there
    are very "interesting".

    More cleanup was done and everything is coming along nicely.

    High time to rip it all appart by adding lazy conversions to the mix...

    Tonight I'm committing a new Step called (CSV Input) that is a very
    simple CSV file reader.

    ----> http://www.kettle.be/images/lazy-csv-input.png

    Although it is simple, it's also wickedly fast...
    This step reads the 300.001 rows in this file:
    http://s3.amazonaws.com/kettle/inputfile.txt.gz in around 0.6 - 1.4
    seconds on my laptop.
    (on average 28 MB/s) Granted, the large amount of RAM in my machine is
    helping to cache this tiny little file. However, it's a number of times
    faster than the "old" Text File Input step. (5.4-6.0 seconds)

    The secret sauce is the "Lazy conversion" algorithm that I've been
    working on
    lately. The idea behind it is to keep data in binary (byte[]) form as
    much
    as possible.
    I've created a new CSV algorithm that reads over Java NIO in large
    blocks as well. It's not comparing Strings anywhere, only binary data.

    The reason for that is that it takes as much time to read in the data
    from disk as it is to do the byte[] conversion to UTF-8 a.k.a.
    java.lang.String.
    (these conversions slash performance at least in half, and then we only
    have a String, not the target data type)

    With lazy conversions we keep the data in binary form as long as
    possible, when it travels from step to step and this leads to very nice
    results for the text-file-output step as well.
    If (and only if) the formatting of the numeric (Number, Integer &
    BigNumber) and Date output fields are the same as specified on input,
    there is no further action required and the data is simply dumped to
    disk again from the binary format. (I'm going to add the encoding to the
    list of requirements as
    well.)

    The end result is that the simple "Reading - Writing" transformation is
    munching said text file in 1.0 - 1.6 seconds on my laptop. it peases me
    also to note that a certain other ETL tool <cough>talend</cough> takes
    4.8-5.1 seconds to do the same.

    A lot of testing, tweaking still remains to be done, but I'm sure that
    we can make this work now and again the results are encouraging.

    To keep things simple, I'm proposing to write a new "Fixed Length Input"
    step that reads fixed length files along the same principles.
    (Or if anyone else is volunteering?)
    I have found that "tweaking" the existing Text File Input step is a
    hazardous undertaking because of the serious amount of code in there. It
    was after all designed to be versatile, not fast. It's better to keep
    that stuff backward compatible as much as possible.

    Until next time,

    Matt
    ____________________________________________
    Matt Casters, Chief Data Integration
    Pentaho, Open Source Business Intelligence http://www.pentaho.org --
    mcasters (AT) pentaho (DOT) org Tel. +32 (0) 486 97 29 37





    --~--~---------~--~----~------------~-------~--~----~
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com
    To unsubscribe from this group, send email to kettle-developers-unsubscribe (AT) g...oups (DOT) com
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en
    -~----------~----~----~----~------~----~------~--~---

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.