Hitachi Vantara Pentaho Community Forums
Results 1 to 7 of 7

Thread: 3.0 : Cool first results

  1. #1
    Matt Casters Guest

    Default 3.0 : Cool first results

    Dear friends,

    The subversion junkies among you probably saw some new code appear yesterday
    in the experimental/ source tree.
    Here are the first results of the code-rewrite.

    The first thing I'm attacking at the moment is the separation of Metadata
    and Data in rows.
    I'm also driving the data handling away from a value-based to a row-based
    system.
    The reason for this is that those little Value objects create a strain on
    both the object creation system of the JVM as well as the garbage collector.
    We've seen up to 20% of a CPU being pegged for garbage collection, clearly
    someone had to look at this sooner or later.

    So I've been thinking long and hard about this and the simplest RowData
    class I could think of is not a class at all but simply an object array:
    Object[]
    The primitives allowed in this Object array are the same as in the "old
    style" values:

    String (String)
    Double (Number)
    Long (Integer)
    Date (Date)
    BigDecimal (BigNumber)
    Boolean (Boolean)
    byte[] (Binary)

    Our aim for the 2.5 style engine was for empty Strings and values to be
    equal to null. This is enforced now by simply making elements in the Object
    array null.

    After some core coding, conversion routines, base classes, interfaces, I did
    a first test. It was the copying/cloning of rows with 10/100/1000 Strings
    and 50/500/5000 mixed values:
    This test you can run yourself, it's called
    org.pentaho.pdi.core.row.SpeedTest. The result of a staggering 1.88M r/s
    generated on my machine is a far cry from the 0.45M r/s that I can squeeze
    out of a Row Generator.
    And that is with empty rows while these rows have 10 Strings each.

    Time to run 'String10' test 1000000 times : 531 ms (1883239 r/s)
    Time to run 'Mixed10' test 1000000 times : 3016 ms (331564 r/s)

    Time to run 'String100' test 1000000 times : 4921 ms (203210 r/s)
    Time to run 'Mixed100' test 1000000 times : 30782 ms (32486 r/s)

    Time to run 'String1000' test 1000000 times : 50422 ms (19832 r/s)
    Time to run 'Mixed1000' test 1000000 times : 339687 ms (2943 r/s)

    Encouraged with these initial results I converted the core Database classes.
    (Database.java and the linked DatabaseMeta, the interfaces, etc.)
    This was actually easier than I initially thought and around 2AM last night
    I was able to run the first test.

    The test program called org.pentaho.pdi.core.row.DBSpeedTest reads all rows
    from a database table and sees how long that takes using the new and the old
    engine.
    It runs both tests 5 times to cancel out any occasional processes I might
    have left running on my laptop and to cancel out DB caching. After all,
    we're not testing the (MySQL) database here.
    Nr of rows 1.110.110











    Run Old engine New engine Diff New % of OLD New r/s

    1 52.281 25.625 26.656 50,99% 43.321
    2 52.953 25.562 27.391 51,73% 43.428
    3 51.406 25.469 25.937 50,46% 43.587
    4 51.500 25.297 26.203 50,88% 43.883
    5 51.297 25.375 25.922 50,53% 43.748
    6 50.188 24.828 25.360 50,53% 44.712
    7 50.328 25.156 25.172 50,02% 44.129
    8 50.609 25.109 25.500 50,39% 44.212
    9 51.000 25.375 25.625 50,25% 43.748
    10 51.250 25.125 26.125 50,98% 44.183







    Average 51.281 25.292 25.989 49,32% 43.892


    Again we see a serious speed bump slashing the time it takes to read the
    rows in half. Interestingly, CPU usage was only slightly higher for the new
    engine, both at 80-100%.

    Another advantage with the data/metadata split is that we already have
    support for comments in the value metadata and we can add plenty of other
    information without any cost to performance.

    The big catch as discussed earlier is obviously the required API change of
    the various steps. However, I found that in most of the situations I came
    across so far, it's actually easier to ship around data if you don't have to
    worry about wrapping it in Value and Row objects.
    I'll continue to "port" code from src/ to experimental/ in the coming days.
    Feel free to help out with that or run these tests yourself to verify my
    findings.

    All the best,

    Matt
    ____________________________________________
    Matt Casters, Chief Data Integration
    Pentaho, Open Source Business Intelligence
    http://www.pentaho.org <http://www.pentaho.org/> -- mcasters (AT) pentaho (DOT) org
    Tel. +32 (0) 486 97 29 37



    --~--~---------~--~----~------------~-------~--~----~
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com
    To unsubscribe from this group, send email to kettle-developers-unsubscribe (AT) g...oups (DOT) com
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en
    -~----------~----~----~----~------~----~------~--~---

  2. #2
    Matt Casters Guest

    Default RE: 3.0 : Cool first results

    Status update: we have the first steps / transformations running.

    A few quick results:

    Generate Rows --> Dummy, generate 10M empty rows:
    OLD: 669.478 rows/sec
    NEW: 1.257.387 rows/sec (x2)

    Table Input --> Dummy, read 1.110.110 rows of customer data from remote
    system
    OLD: 14.835 rows/sec (100% CPU)
    NEW: 23.286 rows/sec (100% CPU) (x1.5)

    Generate Rows (10 fields, 6 data types) --> Select Values (random select,
    re-order, first tab)
    OLD: 28.775 rows/sec
    NEW: 161.603 rows/sec (x5)

    Generate Rows (10 fields, 6 data types) --> Select Values (delete field #5)
    OLD: 56.838 rows/sec
    NEW: 233.580 rows/sec (x4)

    Generate Rows (10 fields, 6 data types) --> Select Values (Metadata, rename
    all fields, change data type of 2 fields)
    OLD: 52.239 rows/sec
    NEW: 259.067 rows/sec (x5)

    All 4 of these transformations are put in Unit tests. I'll try to upload
    the test-data for the Table Input case to a H2 db too.

    I will be attempting to convert Text File Input next, but I'm expecting
    similar results for the next steps to follow.
    This is good news as a 15-20% increase in performance would be a
    disapointment compared to the work we have to put into this.
    50-500% increase in speed is excellent news. I'm sure that if we tune the
    transformation engine later we can squeeze a couple more % out of it.

    All this code is located in source trees experimental and experimental_test

    All the best,

    Matt



    _____

    From: kettle-developers (AT) googlegroups (DOT) com
    [mailto:kettle-developers (AT) googlegroups (DOT) com] On Behalf Of Matt Casters
    Sent: Tuesday, May 08, 2007 11:23 AM
    To: kettle-developers (AT) googlegroups (DOT) com
    Subject: 3.0 : Cool first results


    Dear friends,

    The subversion junkies among you probably saw some new code appear yesterday
    in the experimental/ source tree.
    Here are the first results of the code-rewrite.

    The first thing I'm attacking at the moment is the separation of Metadata
    and Data in rows.
    I'm also driving the data handling away from a value-based to a row-based
    system.
    The reason for this is that those little Value objects create a strain on
    both the object creation system of the JVM as well as the garbage collector.
    We've seen up to 20% of a CPU being pegged for garbage collection, clearly
    someone had to look at this sooner or later.

    So I've been thinking long and hard about this and the simplest RowData
    class I could think of is not a class at all but simply an object array:
    Object[]
    The primitives allowed in this Object array are the same as in the "old
    style" values:

    String (String)
    Double (Number)
    Long (Integer)
    Date (Date)
    BigDecimal (BigNumber)
    Boolean (Boolean)
    byte[] (Binary)

    Our aim for the 2.5 style engine was for empty Strings and values to be
    equal to null. This is enforced now by simply making elements in the Object
    array null.

    After some core coding, conversion routines, base classes, interfaces, I did
    a first test. It was the copying/cloning of rows with 10/100/1000 Strings
    and 50/500/5000 mixed values:
    This test you can run yourself, it's called
    org.pentaho.pdi.core.row.SpeedTest. The result of a staggering 1.88M r/s
    generated on my machine is a far cry from the 0.45M r/s that I can squeeze
    out of a Row Generator.
    And that is with empty rows while these rows have 10 Strings each.

    Time to run 'String10' test 1000000 times : 531 ms (1883239 r/s)
    Time to run 'Mixed10' test 1000000 times : 3016 ms (331564 r/s)

    Time to run 'String100' test 1000000 times : 4921 ms (203210 r/s)
    Time to run 'Mixed100' test 1000000 times : 30782 ms (32486 r/s)

    Time to run 'String1000' test 1000000 times : 50422 ms (19832 r/s)
    Time to run 'Mixed1000' test 1000000 times : 339687 ms (2943 r/s)

    Encouraged with these initial results I converted the core Database classes.
    (Database.java and the linked DatabaseMeta, the interfaces, etc.)
    This was actually easier than I initially thought and around 2AM last night
    I was able to run the first test.

    The test program called org.pentaho.pdi.core.row.DBSpeedTest reads all rows
    from a database table and sees how long that takes using the new and the old
    engine.
    It runs both tests 5 times to cancel out any occasional processes I might
    have left running on my laptop and to cancel out DB caching. After all,
    we're not testing the (MySQL) database here.
    Nr of rows 1.110.110











    Run Old engine New engine Diff New % of OLD New r/s

    1 52.281 25.625 26.656 50,99% 43.321
    2 52.953 25.562 27.391 51,73% 43.428
    3 51.406 25.469 25.937 50,46% 43.587
    4 51.500 25.297 26.203 50,88% 43.883
    5 51.297 25.375 25.922 50,53% 43.748
    6 50.188 24.828 25.360 50,53% 44.712
    7 50.328 25.156 25.172 50,02% 44.129
    8 50.609 25.109 25.500 50,39% 44.212
    9 51.000 25.375 25.625 50,25% 43.748
    10 51.250 25.125 26.125 50,98% 44.183







    Average 51.281 25.292 25.989 49,32% 43.892


    Again we see a serious speed bump slashing the time it takes to read the
    rows in half. Interestingly, CPU usage was only slightly higher for the new
    engine, both at 80-100%.

    Another advantage with the data/metadata split is that we already have
    support for comments in the value metadata and we can add plenty of other
    information without any cost to performance.

    The big catch as discussed earlier is obviously the required API change of
    the various steps. However, I found that in most of the situations I came
    across so far, it's actually easier to ship around data if you don't have to
    worry about wrapping it in Value and Row objects.
    I'll continue to "port" code from src/ to experimental/ in the coming days.
    Feel free to help out with that or run these tests yourself to verify my
    findings.

    All the best,

    Matt
    ____________________________________________
    Matt Casters, Chief Data Integration
    Pentaho, Open Source Business Intelligence
    http://www.pentaho.org <http://www.pentaho.org/> -- mcasters (AT) pentaho (DOT) org
    Tel. +32 (0) 486 97 29 37






    --~--~---------~--~----~------------~-------~--~----~
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com
    To unsubscribe from this group, send email to kettle-developers-unsubscribe (AT) g...oups (DOT) com
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en
    -~----------~----~----~----~------~----~------~--~---

  3. #3
    Biswapesh Chattopadhyay Guest

    Default Re: 3.0 : Cool first results

    Matt,

    This is exciting indeed! How can I checkout and build this code? Can we get
    the multi-node transformations running on this sometime soon?

    Biswa.

    On 10/05/07, Matt Casters <mcasters (AT) pentaho (DOT) org> wrote:
    >
    > Status update: we have the first steps / transformations running.
    >
    > A few quick results:
    >
    > Generate Rows --> Dummy, generate 10M empty rows:
    > OLD: 669.478 rows/sec
    > NEW: 1.257.387 rows/sec (x2)
    >
    > Table Input --> Dummy, read 1.110.110 rows of customer data from remote
    > system
    > OLD: 14.835 rows/sec (100% CPU)
    > NEW: 23.286 rows/sec (100% CPU) (x1.5)
    >
    > Generate Rows (10 fields, 6 data types) --> Select Values (random select,
    > re-order, first tab)
    > OLD: 28.775 rows/sec
    > NEW: 161.603 rows/sec (x5)
    >
    > Generate Rows (10 fields, 6 data types) --> Select Values (delete field
    > #5)
    > OLD: 56.838 rows/sec
    > NEW: 233.580 rows/sec (x4)
    >
    > Generate Rows (10 fields, 6 data types) --> Select Values (Metadata,
    > rename all fields, change data type of 2 fields) OLD: 52.239 rows/sec
    > NEW: 259.067 rows/sec (x5)
    >
    > All 4 of these transformations are put in Unit tests. I'll try to upload
    > the test-data for the Table Input case to a H2 db too.
    >
    > I will be attempting to convert Text File Input next, but I'm expecting
    > similar results for the next steps to follow.
    > This is good news as a 15-20% increase in performance would be a
    > disapointment compared to the work we have to put into this.
    > 50-500% increase in speed is excellent news. I'm sure that if we tune the
    > transformation engine later we can squeeze a couple more % out of it.
    >
    > All this code is located in source trees *experimental* and *
    > experimental_test*
    >
    > All the best,
    >
    > Matt
    >
    >
    >
    > ------------------------------
    > *From:* kettle-developers (AT) googlegroups (DOT) com [mailto:
    > kettle-developers (AT) googlegroups (DOT) com] *On Behalf Of *Matt Casters
    > *Sent:* Tuesday, May 08, 2007 11:23 AM
    > *To:* kettle-developers (AT) googlegroups (DOT) com
    > *Subject:* 3.0 : Cool first results
    >
    > Dear friends,
    >
    > The subversion junkies among you probably saw some new code appear
    > yesterday in the experimental/ source tree.
    > Here are the first results of the code-rewrite.
    >
    > The first thing I'm attacking at the moment is the separation of Metadata
    > and Data in rows.
    > I'm also driving the data handling away from a value-based to a row-based
    > system.
    > The reason for this is that those little Value objects create a strain on
    > both the object creation system of the JVM as well as the garbage collector.
    > We've seen up to 20% of a CPU being pegged for garbage collection, clearly
    > someone had to look at this sooner or later.
    >
    > So I've been thinking long and hard about this and the simplest
    > RowData class I could think of is not a class at all but simply an object
    > array: Object[]
    > The primitives allowed in this Object array are the same as in the "old
    > style" values:
    >
    > String (String)
    > Double (Number)
    > Long (Integer)
    > Date (Date)
    > BigDecimal (BigNumber)
    > Boolean (Boolean)
    > byte[] (Binary)
    >
    > Our aim for the 2.5 style engine was for empty Strings and values to be
    > equal to null. This is enforced now by simply making elements in the Object
    > array null.
    >
    > After some core coding, conversion routines, base classes, interfaces, I
    > did a first test. It was the copying/cloning of rows with 10/100/1000
    > Strings and 50/500/5000 mixed values:
    > This test you can run yourself, it's called *
    > org.pentaho.pdi.core.row.SpeedTest.* The result of a staggering 1.88M r/s
    > generated on my machine is a far cry from the 0.45M r/s that I can squeeze
    > out of a Row Generator.
    > And that is with *empty* rows while these rows have 10 Strings each.
    >
    > Time to run 'String10' test 1000000 times : 531 ms (1883239 r/s)
    > Time to run 'Mixed10' test 1000000 times : 3016 ms (331564 r/s)
    >
    > Time to run 'String100' test 1000000 times : 4921 ms (203210 r/s)
    > Time to run 'Mixed100' test 1000000 times : 30782 ms (32486 r/s)
    >
    > Time to run 'String1000' test 1000000 times : 50422 ms (19832 r/s)
    > Time to run 'Mixed1000' test 1000000 times : 339687 ms (2943 r/s)
    >
    > Encouraged with these initial results I converted the core Database
    > classes. (Database.java and the linked DatabaseMeta, the interfaces,
    > etc.)
    > This was actually easier than I initially thought and around 2AM last
    > night I was able to run the first test.
    >
    > The test program called *org.pentaho.pdi.core.row.DBSpeedTest* reads all
    > rows from a database table and sees how long that takes using the new and
    > the old engine.
    > It runs both tests 5 times to cancel out any occasional processes I might
    > have left running on my laptop and to cancel out DB caching. After all,
    > we're not testing the (MySQL) database here.
    > *Nr of rows* *1.110.110*
    >
    >
    >
    >
    >
    >
    >
    >
    >
    > *Run * *Old engine* *New engine* *Diff* *New % of OLD* *New r/s* 1
    > 52.281 25.625 26.656 50,99% 43.321 2 52.953 25.562 27.391 51,73% 43.428 3
    > 51.406 25.469 25.937 50,46% 43.587 4 51.500 25.297 26.203 50,88% 43.883 5
    > 51.297 25.375 25.922 50,53% 43.748 6 50.188 24.828 25.360 50,53% 44.712 7
    > 50.328 25.156 25.172 50,02% 44.129 8 50.609 25.109 25.500 50,39% 44.212 9
    > 51.000 25.375 25.625 50,25% 43.748 10 51.250 25.125 26.125 50,98% 44.183
    >
    >
    >
    >
    >
    > *Average* *51.281* *25.292* *25.989* *49,32%* *43.892*
    >
    >
    > Again we see a serious speed bump slashing the time it takes to read the
    > rows in half. Interestingly, CPU usage was only slightly higher for the new
    > engine, both at 80-100%.
    >
    > Another advantage with the data/metadata split is that we already have
    > support for comments in the value metadata and we can add plenty of other
    > information without any cost to performance.
    >
    > The big catch as discussed earlier is obviously the required API change of
    > the various steps. However, I found that in most of the situations I came
    > across so far, it's actually easier to ship around data if you don't have to
    > worry about wrapping it in Value and Row objects.
    > I'll continue to "port" code from src/ to experimental/ in the coming
    > days. Feel free to help out with that or run these tests yourself to verify
    > my findings.
    >
    > All the best,
    >
    > Matt
    > ____________________________________________
    > Matt Casters, Chief Data Integration
    > Pentaho, Open Source Business Intelligence
    > http://www.pentaho.org -- mcasters (AT) pentaho (DOT) org
    > Tel. +32 (0) 486 97 29 37
    >
    >
    > >

    >


    --~--~---------~--~----~------------~-------~--~----~
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com
    To unsubscribe from this group, send email to kettle-developers-unsubscribe (AT) g...oups (DOT) com
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en
    -~----------~----~----~----~------~----~------~--~---

  4. #4
    Roland Bouman Guest

    Default Re: 3.0 : Cool first results

    Matt,

    this is incredible! Wow

    --
    Roland Bouman

    --~--~---------~--~----~------------~-------~--~----~
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com
    To unsubscribe from this group, send email to kettle-developers-unsubscribe (AT) g...oups (DOT) com
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en
    -~----------~----~----~----~------~----~------~--~---

  5. #5
    nicholas guzaldo Guest

    Default Re: 3.0 : Cool first results

    Matt,

    I gotta agree with everyone else. WOW!
    Just when you think that kettle can't get any better you make it faster.
    I'm going to have to check it out.

    Keep up the excellent work guys.

    Nic

    --~--~---------~--~----~------------~-------~--~----~
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com
    To unsubscribe from this group, send email to kettle-developers-unsubscribe (AT) g...oups (DOT) com
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en
    -~----------~----~----~----~------~----~------~--~---

  6. #6
    Matt Casters Guest

    Default RE: 3.0 : Cool first results

    Hi Biswa,

    The code is simply in SVN trunk. I created a 2 seperate source code trees
    to which I'm porting the code to.
    If you load the project in Eclipse you'll be able to run unit tests on
    experimental_test which allows you to run the yourself except for the Table
    Input unit test.
    I'll try to modify the Ant build file next week.

    Be careful though, we only converted 6 steps so far, that leaves 60
    something steps to go.
    All in all, the conversion is not a lot of work but since I'm writing
    test-code to go with each step I'll probably won't be able to do 1 or 2 each
    day.
    Then we also need to convert Spoon etc.
    I also want to split up the codebase in seperate jars:

    kettle-core.jar (Row, Const, etc)
    kettle-database.jar (Database connections, etc)
    kettle-gui.jar (Core dialog & widgets collection)
    kettle-runtime.jar (Trans, Jobs, steps, Pan, Kitchen)
    kettle-spoon.jar (Spoon GUI)

    That way it's still going to take a few months before we see the complete
    codebase conversion finished.
    Then again, these first results are encouraging because we can really push
    ahead with the fancy stuff when it's done. Data lineage, Impact analyse and
    other "complex" metadata structures can then be passed without performance
    impact.

    All the best,

    Matt


    _____

    From: kettle-developers (AT) googlegroups (DOT) com
    [mailto:kettle-developers (AT) googlegroups (DOT) com] On Behalf Of Biswapesh
    Chattopadhyay
    Sent: Thursday, May 10, 2007 7:55 PM
    To: kettle-developers (AT) googlegroups (DOT) com
    Subject: Re: 3.0 : Cool first results


    Matt,

    This is exciting indeed! How can I checkout and build this code? Can we get
    the multi-node transformations running on this sometime soon?

    Biswa.


    On 10/05/07, Matt Casters <mcasters (AT) pentaho (DOT) org> wrote:

    Status update: we have the first steps / transformations running.

    A few quick results:

    Generate Rows --> Dummy, generate 10M empty rows:
    OLD: 669.478 rows/sec
    NEW: 1.257.387 rows/sec (x2)

    Table Input --> Dummy, read 1.110.110 rows of customer data from remote
    system
    OLD: 14.835 rows/sec (100% CPU)
    NEW: 23.286 rows/sec (100% CPU) (x1.5)

    Generate Rows (10 fields, 6 data types) --> Select Values (random select,
    re-order, first tab)
    OLD: 28.775 rows/sec
    NEW: 161.603 rows/sec (x5)


    Generate Rows (10 fields, 6 data types) --> Select Values (delete field #5)
    OLD: 56.838 rows/sec
    NEW: 233.580 rows/sec (x4)

    Generate Rows (10 fields, 6 data types) --> Select Values (Metadata, rename
    all fields, change data type of 2 fields)
    OLD: 52.239 rows/sec
    NEW: 259.067 rows/sec (x5)

    All 4 of these transformations are put in Unit tests. I'll try to upload
    the test-data for the Table Input case to a H2 db too.

    I will be attempting to convert Text File Input next, but I'm expecting
    similar results for the next steps to follow.
    This is good news as a 15-20% increase in performance would be a
    disapointment compared to the work we have to put into this.
    50-500% increase in speed is excellent news. I'm sure that if we tune the
    transformation engine later we can squeeze a couple more % out of it.

    All this code is located in source trees experimental and experimental_test

    All the best,


    Matt




    _____

    From: kettle-developers (AT) googlegroups (DOT) com
    [mailto:kettle-developers (AT) googlegroups (DOT) com] On Behalf Of Matt Casters
    Sent: Tuesday, May 08, 2007 11:23 AM
    To: kettle-developers (AT) googlegroups (DOT) com
    Subject: 3.0 : Cool first results



    Dear friends,

    The subversion junkies among you probably saw some new code appear yesterday
    in the experimental/ source tree.
    Here are the first results of the code-rewrite.

    The first thing I'm attacking at the moment is the separation of Metadata
    and Data in rows.
    I'm also driving the data handling away from a value-based to a row-based
    system.
    The reason for this is that those little Value objects create a strain on
    both the object creation system of the JVM as well as the garbage collector.
    We've seen up to 20% of a CPU being pegged for garbage collection, clearly
    someone had to look at this sooner or later.

    So I've been thinking long and hard about this and the simplest RowData
    class I could think of is not a class at all but simply an object array:
    Object[]
    The primitives allowed in this Object array are the same as in the "old
    style" values:

    String (String)
    Double (Number)
    Long (Integer)
    Date (Date)
    BigDecimal (BigNumber)
    Boolean (Boolean)
    byte[] (Binary)

    Our aim for the 2.5 style engine was for empty Strings and values to be
    equal to null. This is enforced now by simply making elements in the Object
    array null.

    After some core coding, conversion routines, base classes, interfaces, I did
    a first test. It was the copying/cloning of rows with 10/100/1000 Strings
    and 50/500/5000 mixed values:
    This test you can run yourself, it's called
    org.pentaho.pdi.core.row.SpeedTest. The result of a staggering 1.88M r/s
    generated on my machine is a far cry from the 0.45M r/s that I can squeeze
    out of a Row Generator.
    And that is with empty rows while these rows have 10 Strings each.


    Time to run 'String10' test 1000000 times : 531 ms (1883239 r/s)
    Time to run 'Mixed10' test 1000000 times : 3016 ms (331564 r/s)

    Time to run 'String100' test 1000000 times : 4921 ms (203210 r/s)
    Time to run 'Mixed100' test 1000000 times : 30782 ms (32486 r/s)

    Time to run 'String1000' test 1000000 times : 50422 ms (19832 r/s)
    Time to run 'Mixed1000' test 1000000 times : 339687 ms (2943 r/s)

    Encouraged with these initial results I converted the core Database classes.
    (Database.java and the linked DatabaseMeta, the interfaces, etc.)
    This was actually easier than I initially thought and around 2AM last night
    I was able to run the first test.

    The test program called org.pentaho.pdi.core.row.DBSpeedTest reads all rows
    from a database table and sees how long that takes using the new and the old
    engine.
    It runs both tests 5 times to cancel out any occasional processes I might
    have left running on my laptop and to cancel out DB caching. After all,
    we're not testing the (MySQL) database here.

    Nr of rows 1.110.110











    Run Old engine New engine Diff New % of OLD New r/s

    1 52.281 25.625 26.656 50,99% 43.321
    2 52.953 25.562 27.391 51,73% 43.428
    3 51.406 25.469 25.937 50,46% 43.587
    4 51.500 25.297 26.203 50,88% 43.883
    5 51.297 25.375 25.922 50,53% 43.748
    6 50.188 24.828 25.360 50,53% 44.712
    7 50.328 25.156 25.172 50,02% 44.129
    8 50.609 25.109 25.500 50,39% 44.212
    9 51.000 25.375 25.625 50,25% 43.748
    10 51.250 25.125 26.125 50,98% 44.183







    Average 51.281 25.292 25.989 49,32% 43.892


    Again we see a serious speed bump slashing the time it takes to read the
    rows in half. Interestingly, CPU usage was only slightly higher for the new
    engine, both at 80-100%.

    Another advantage with the data/metadata split is that we already have
    support for comments in the value metadata and we can add plenty of other
    information without any cost to performance.

    The big catch as discussed earlier is obviously the required API change of
    the various steps. However, I found that in most of the situations I came
    across so far, it's actually easier to ship around data if you don't have to
    worry about wrapping it in Value and Row objects.
    I'll continue to "port" code from src/ to experimental/ in the coming days.
    Feel free to help out with that or run these tests yourself to verify my
    findings.

    All the best,

    Matt
    ____________________________________________
    Matt Casters, Chief Data Integration
    Pentaho, Open Source Business Intelligence
    http://www.pentaho.org <http://www.pentaho.org/> -- mcasters (AT) pentaho (DOT) org
    Tel. +32 (0) 486 97 29 37










    --~--~---------~--~----~------------~-------~--~----~
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com
    To unsubscribe from this group, send email to kettle-developers-unsubscribe (AT) g...oups (DOT) com
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en
    -~----------~----~----~----~------~----~------~--~---

  7. #7
    Matt Casters Guest

    Default RE: 3.0 : Cool first results

    > Can we get the multi-node transformations running on this sometime soon?

    Yes. TransSplitter, SocketReader and SocketWriter have been ported. However
    the new serialisation code, although written, remains highly untested.
    Also Carte has not yet been modified to 3.0.

    All the best,

    Matt


    _____

    From: kettle-developers (AT) googlegroups (DOT) com
    [mailto:kettle-developers (AT) googlegroups (DOT) com] On Behalf Of Biswapesh
    Chattopadhyay
    Sent: Thursday, May 10, 2007 7:55 PM
    To: kettle-developers (AT) googlegroups (DOT) com
    Subject: Re: 3.0 : Cool first results


    Matt,

    This is exciting indeed! How can I checkout and build this code? Can we get
    the multi-node transformations running on this sometime soon?

    Biswa.


    On 10/05/07, Matt Casters <mcasters (AT) pentaho (DOT) org> wrote:

    Status update: we have the first steps / transformations running.

    A few quick results:

    Generate Rows --> Dummy, generate 10M empty rows:
    OLD: 669.478 rows/sec
    NEW: 1.257.387 rows/sec (x2)

    Table Input --> Dummy, read 1.110.110 rows of customer data from remote
    system
    OLD: 14.835 rows/sec (100% CPU)
    NEW: 23.286 rows/sec (100% CPU) (x1.5)

    Generate Rows (10 fields, 6 data types) --> Select Values (random select,
    re-order, first tab)
    OLD: 28.775 rows/sec
    NEW: 161.603 rows/sec (x5)


    Generate Rows (10 fields, 6 data types) --> Select Values (delete field #5)
    OLD: 56.838 rows/sec
    NEW: 233.580 rows/sec (x4)

    Generate Rows (10 fields, 6 data types) --> Select Values (Metadata, rename
    all fields, change data type of 2 fields)
    OLD: 52.239 rows/sec
    NEW: 259.067 rows/sec (x5)

    All 4 of these transformations are put in Unit tests. I'll try to upload
    the test-data for the Table Input case to a H2 db too.

    I will be attempting to convert Text File Input next, but I'm expecting
    similar results for the next steps to follow.
    This is good news as a 15-20% increase in performance would be a
    disapointment compared to the work we have to put into this.
    50-500% increase in speed is excellent news. I'm sure that if we tune the
    transformation engine later we can squeeze a couple more % out of it.

    All this code is located in source trees experimental and experimental_test

    All the best,


    Matt




    _____

    From: kettle-developers (AT) googlegroups (DOT) com
    [mailto:kettle-developers (AT) googlegroups (DOT) com] On Behalf Of Matt Casters
    Sent: Tuesday, May 08, 2007 11:23 AM
    To: kettle-developers (AT) googlegroups (DOT) com
    Subject: 3.0 : Cool first results



    Dear friends,

    The subversion junkies among you probably saw some new code appear yesterday
    in the experimental/ source tree.
    Here are the first results of the code-rewrite.

    The first thing I'm attacking at the moment is the separation of Metadata
    and Data in rows.
    I'm also driving the data handling away from a value-based to a row-based
    system.
    The reason for this is that those little Value objects create a strain on
    both the object creation system of the JVM as well as the garbage collector.
    We've seen up to 20% of a CPU being pegged for garbage collection, clearly
    someone had to look at this sooner or later.

    So I've been thinking long and hard about this and the simplest RowData
    class I could think of is not a class at all but simply an object array:
    Object[]
    The primitives allowed in this Object array are the same as in the "old
    style" values:

    String (String)
    Double (Number)
    Long (Integer)
    Date (Date)
    BigDecimal (BigNumber)
    Boolean (Boolean)
    byte[] (Binary)

    Our aim for the 2.5 style engine was for empty Strings and values to be
    equal to null. This is enforced now by simply making elements in the Object
    array null.

    After some core coding, conversion routines, base classes, interfaces, I did
    a first test. It was the copying/cloning of rows with 10/100/1000 Strings
    and 50/500/5000 mixed values:
    This test you can run yourself, it's called
    org.pentaho.pdi.core.row.SpeedTest. The result of a staggering 1.88M r/s
    generated on my machine is a far cry from the 0.45M r/s that I can squeeze
    out of a Row Generator.
    And that is with empty rows while these rows have 10 Strings each.


    Time to run 'String10' test 1000000 times : 531 ms (1883239 r/s)
    Time to run 'Mixed10' test 1000000 times : 3016 ms (331564 r/s)

    Time to run 'String100' test 1000000 times : 4921 ms (203210 r/s)
    Time to run 'Mixed100' test 1000000 times : 30782 ms (32486 r/s)

    Time to run 'String1000' test 1000000 times : 50422 ms (19832 r/s)
    Time to run 'Mixed1000' test 1000000 times : 339687 ms (2943 r/s)

    Encouraged with these initial results I converted the core Database classes.
    (Database.java and the linked DatabaseMeta, the interfaces, etc.)
    This was actually easier than I initially thought and around 2AM last night
    I was able to run the first test.

    The test program called org.pentaho.pdi.core.row.DBSpeedTest reads all rows
    from a database table and sees how long that takes using the new and the old
    engine.
    It runs both tests 5 times to cancel out any occasional processes I might
    have left running on my laptop and to cancel out DB caching. After all,
    we're not testing the (MySQL) database here.

    Nr of rows 1.110.110











    Run Old engine New engine Diff New % of OLD New r/s

    1 52.281 25.625 26.656 50,99% 43.321
    2 52.953 25.562 27.391 51,73% 43.428
    3 51.406 25.469 25.937 50,46% 43.587
    4 51.500 25.297 26.203 50,88% 43.883
    5 51.297 25.375 25.922 50,53% 43.748
    6 50.188 24.828 25.360 50,53% 44.712
    7 50.328 25.156 25.172 50,02% 44.129
    8 50.609 25.109 25.500 50,39% 44.212
    9 51.000 25.375 25.625 50,25% 43.748
    10 51.250 25.125 26.125 50,98% 44.183







    Average 51.281 25.292 25.989 49,32% 43.892


    Again we see a serious speed bump slashing the time it takes to read the
    rows in half. Interestingly, CPU usage was only slightly higher for the new
    engine, both at 80-100%.

    Another advantage with the data/metadata split is that we already have
    support for comments in the value metadata and we can add plenty of other
    information without any cost to performance.

    The big catch as discussed earlier is obviously the required API change of
    the various steps. However, I found that in most of the situations I came
    across so far, it's actually easier to ship around data if you don't have to
    worry about wrapping it in Value and Row objects.
    I'll continue to "port" code from src/ to experimental/ in the coming days.
    Feel free to help out with that or run these tests yourself to verify my
    findings.

    All the best,

    Matt
    ____________________________________________
    Matt Casters, Chief Data Integration
    Pentaho, Open Source Business Intelligence
    http://www.pentaho.org <http://www.pentaho.org/> -- mcasters (AT) pentaho (DOT) org
    Tel. +32 (0) 486 97 29 37










    --~--~---------~--~----~------------~-------~--~----~
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com
    To unsubscribe from this group, send email to kettle-developers-unsubscribe (AT) g...oups (DOT) com
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en
    -~----------~----~----~----~------~----~------~--~---

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.