Hitachi Vantara Pentaho Community Forums
Results 1 to 4 of 4

Thread: RE: Object serialization

  1. #1
    Matt Casters Guest

    Default RE: Object serialization

    Hi Sven,

    We are not debating or questioning the data passing algorithm between steps
    at this time.

    What was being discussed was value and row serialization. At this time we
    indeed just use DataOutputStream.writeLong() to serialize a value.
    All I'm saying is that in a LOT of cases, this is pretty wasteful. Using
    strict data types is indeed something that nobody wants.
    Just to give you an idea, I once tried to create a slowly changing dimension
    on Microsoft SSIS. It was a customer dimension with a customer ID and a
    name field, that's all.
    Together with a colleage we spent 2 hours on it because it kept complaining
    about the fact that the customer_id field from the source table didn't match
    the natural key customer_id field in the dimension.
    You see, it was a signed short integer vs an unsigned short integer. Can't
    have that right? Who knows, the ID in the source system might be negative!
    Microsoft obviously considered it safer to catch these things before
    runtime, leading to huge frustration on our behalf.

    It's especially bothering me because there is no need for this. In Java at
    least there is no memory benefit to using a byte/short/integer over a long
    in an object because memory usage is rounded to 8 bytes anyway. Reference:
    https://www.sdn.sap.com/irj/sdn/webl...=/pub/wlg/5163 (**)

    That problem is something we needed to address in the stream lookup: the
    standard Java Hashtable/HashMap implementations use massive amounts of
    memory. (useless in our case too)
    We slashed memory consumption by a factor 4 or 5 in certain cases. (we now
    consume around 30 bytes for a long/long pair, including overhead)

    If people have ideas on how to limit this even further in a next major
    release, please let us know.

    To get back on topic: object serialization is only needed when you need to
    store data on disk for a while or when you want send data to a remote host.
    (for our clustering solution)
    Object serialization is expensive because I/O is slow and you want to limit
    it as much as possible. Kettle in general was designed to limit I/O as much
    as possible. For example, if you don't need to do a sort in Kettle, you
    don't do it because it generates I/O.
    However, there are problems when you just have to do it and these we want to
    solve as good as possible.

    We can also use object serialization to save memory because of (**): instead
    of storing 20 Value objects in a Row, all consuming lots of overhead, we
    just save the serialized (byte[]) bytes.
    The disadvantage of that is obviously that you need to de-serialize the row
    during processing and it's therefor only interesting if you want to go for
    the very large in-memory buffers etc.

    > Did you ever think between an "Expert" and a "Basic" modus of the steps?

    The "Expert modus in this case allows to choose between different integer
    types, the "Basic" modus always uses long types.

    Constantly I think about it, but I hate these "expert" modes even more than
    the problem they claim to solve. Take for example the Text File Input step.
    What parts would you consider "expert"?
    For example, a beginner, working in an Office in Paris, that receives a
    paged text file from an affiliate in Thailand would need almost all options
    in the step.
    I see a brighter future for the nice wizards that ask the right questions
    (CSV/Fixed, etc) and intelligently sets defaults.
    There is also still a lot of room for clearer layouts in the dialogs to make
    things clearer. We should work on that in the future.

    All the best,

    Matt



    -----Original Message-----
    From: kettle-developers (AT) googlegroups (DOT) com
    [mailto:kettle-developers (AT) googlegroups (DOT) com] On Behalf Of sven.thiergen
    Sent: Monday, December 11, 2006 9:05 AM
    To: kettle-developers
    Subject: Re: Object serialization


    One basic questions to Java serialization first: When serializing an integer
    or a byte I assume the data is just written as it is to the OutputStream. Is
    that correct or is it packed/scrambled in some way?

    Next basic question to the I/O: We are talking about the data passed from
    step to step, right? Isn't that happening all in RAM anyway? If so, do you
    think when handling and passing data just in the RAM that the I/O is
    affected that serious that it's worth differing between the integer types?

    If it's worth it I think one can offer the user a mechanism to select the
    correct integer type. With bit operations one can validate the type (at
    least I think so) and give a warning in case of problems. Some problems do
    occur as said when doing calculations with different integer types. But the
    user may select again what kind of integer he thinks is appropriate as the
    operations's result. Indeed, in many cases using 4-byte integers completely
    suffices.

    I do compare this with databases: The (experienced!) user is allowed to
    choose between different integer types and so he is allowed to mess things
    up. But that's always the problem - It's a question of what kind of users do
    use Kettle.

    Did you ever think between an "Expert" and a "Basic" modus of the steps? The
    "Expert modus in this case allows to choose between different integer types,
    the "Basic" modus always uses long types.





    --~--~---------~--~----~------------~-------~--~----~
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com
    To unsubscribe from this group, send email to kettle-developers-unsubscribe (AT) g...oups (DOT) com
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en
    -~----------~----~----~----~------~----~------~--~---

  2. #2
    Matt Casters Guest

    Default RE: Object serialization

    >Regarding I/O - in which way data is accessed? Do you read / write single
    values or large blocks of data? When doing clustering I think the answer is
    "large blocks of data", how about the "Sort" step? If >you access small
    chunks of data it doesn't matter whether it's 8 or 4 bytes, the bottleneck
    is not the data size but the I/O operation as a whole. I/O tuning will only
    be effective when transfering large >chunks of data (I am no expert, just
    what I think).

    I don't think that this is true since we make extensive use of buffering and
    also compress the output. Also, like I said, it's not just I/O, memory use
    and network bandwith can be important too.
    Lately I've been testing Kettle clustering on the Amazon Elastic Computing
    Cloud. A single of these servers can output around 4000 rows/sec using
    MySQL.
    If you send the data over sockets (using serialization) from a master to 5
    slave servers, you get 5x4000 row inserts/sec. That scales nicely. At a
    certain point, the serialization is becoming a new bottleneck for
    scalability though. (processing 144M rows/hour is not "shabby" but there
    are use-cases that demand even higher performance)

    >A final word: You may be right with those "expert" modes. One could talk

    for hours about their usefullness and there is definitely no easy answer.
    Just the same as with this problem: You'll have to make >a decision and keep
    fingers crossed.

    I would rather use profilers and test-cases to make a decision ;-)
    We have learned a lot over the passed year, we know the weaknesses in the
    Kettle transformation engine. We are investigating ways to improve on them.
    We just have to make sure that:
    *) we remain backward compatible with existing transformations
    *) we don't introduce new bottlenecks

    Being open-source we also have a couple of big advantages over closed
    source:
    *) we can admit to our mistakes, it wouldn't make sense to not do it
    actually as everything is out in the open anyway
    *) we can look at other open source projects and take the best solutions. A
    special thanks goes out Mark Proctor & Peter Van Weert from the Drools
    projects (JBoss Rules) with regards to the memory-efficient hashtable they
    wrote. We are using it in Kettle with success.
    *) there is little to no market/marketing/financial pressure : we can focus
    on the things that matter.

    Matt


    -----Original Message-----
    From: kettle-developers (AT) googlegroups (DOT) com
    [mailto:kettle-developers (AT) googlegroups (DOT) com] On Behalf Of sven.thiergen
    Sent: Monday, December 11, 2006 10:25 AM
    To: kettle-developers
    Subject: Re: Object serialization


    Okay, I understand a lot better now. Still no easy solution for this.
    You suggest some efficiency algorithms which are hard to implement in most
    cases and need some serious testing but maybe they speed up the system by
    factor 2 or 3.

    It seems as if the data type is only of importance if real I/O is involved.
    So it might be a solution to offer the experienced user an interface to set
    the data type, but only when I/O is involved (sorting and clusters). After
    the I/O is done it's all converted back to type "long". That's rather easy
    and may give us some hints about the effectiveness.

    Regarding I/O - in which way data is accessed? Do you read / write single
    values or large blocks of data? When doing clustering I think the answer is
    "large blocks of data", how about the "Sort" step? If you access small
    chunks of data it doesn't matter whether it's 8 or 4 bytes, the bottleneck
    is not the data size but the I/O operation as a whole. I/O tuning will only
    be effective when transfering large chunks of data (I am no expert, just
    what I think).

    Can you typically access all your data before actually processing it?
    In other words - can you examine the 1.000.000 fields of a certain column -
    which are of type long (but may fit in type integer or even
    short) - before actually doing anything with them?

    If so it would be possible to implement some kind of ideal packing
    algorithm: Counting the number of *different* values (over 1.000.000 rows!),
    assigning each a number by starting at 1 and incrementing on each new value.
    In many cases 2 bytes will be sufficient (covering
    65.536 different values) for 1.000.000 actual values. On both sides (saving
    side / loading side) the values will be of type long to the user, the magic
    happens inside.

    A final word: You may be right with those "expert" modes. One could talk for
    hours about their usefullness and there is definitely no easy answer. Just
    the same as with this problem: You'll have to make a decision and keep
    fingers crossed.





    --~--~---------~--~----~------------~-------~--~----~
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com
    To unsubscribe from this group, send email to kettle-developers-unsubscribe (AT) g...oups (DOT) com
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en
    -~----------~----~----~----~------~----~------~--~---

  3. #3
    Matt Casters Guest

    Default RE: Object serialization

    Jens, that is exactly what we did ;-)

    -----Original Message-----
    From: kettle-developers (AT) googlegroups (DOT) com
    [mailto:kettle-developers (AT) googlegroups (DOT) com] On Behalf Of Jens Bleuel
    Sent: Thursday, December 14, 2006 12:11 PM
    To: kettle-developers (AT) googlegroups (DOT) com
    Subject: AW: Object serialization


    Hi Matt & all,

    > That problem is something we needed to address in the stream
    > lookup: the standard Java Hashtable/HashMap implementations use
    > massive amounts of memory. (useless in our case too) We slashed memory
    > consumption by a factor 4 or 5 in certain cases. (we now consume
    > around 30 bytes for a long/long pair, including overhead)
    >
    > If people have ideas on how to limit this even further in a next major
    > release, please let us know.


    I guess the overhead mentioned above is Hashtable/HashMap related, right?

    Another overhead (also mentioned here before I thing from Biswa) is Kettle
    stores the Value-Data and the Value-Meta-Data. Since a lot of steps need the
    Value-Meta-Data we should not change this principle architecture at this
    time.

    What about changing this within a single step like the Stream Lookup? Save
    the Value-Meta-Data once for a Value and store only all the Value-Data for
    sorting etc. After processing, the step merges the "single
    stored"-Value-Meta-Data with the Value-Data and put it to the strem.

    Any thoughts?

    Have a nice pre-christmas-time,

    Jens




    [color=blue]
    > -----Urspr

  4. #4
    Matt Casters Guest

    Default RE: Object serialization

    The state of affairs is that all steps that serialize compress data and only
    store data (sometimes 1 row of metadata).
    For the memory problem it was especially StreamLookup that was affected so
    we did that one first.
    When we have a nice (fast/memory efficient) way of storing rows in memory,
    we will extend it to the other steps that need it.
    (database lookup, etc)

    As far as the metadata inconsistencies are concerned: it is now unrelated to
    serialization. No step can handle mixed rows correctly and I don't actually
    think there is a way to handle that. I already told you guys that we are
    back to storing 8 bytes per integer.
    I would propose to add more metadata checks in the GUI to cover those issues
    later.

    Matt


    -----Original Message-----
    From: kettle-developers (AT) googlegroups (DOT) com
    [mailto:kettle-developers (AT) googlegroups (DOT) com] On Behalf Of Jens Bleuel
    Sent: Thursday, December 14, 2006 12:37 PM
    To: kettle-developers (AT) googlegroups (DOT) com
    Subject: AW: Object serialization


    Good work ;-)

    Did you do this for all "sort-/memory-related" steps?

    Is there a check for Meta-Data inconsistencies and/or could this issue also
    be checked with the "safe mode"?

    Keep on hacking,
    Jens
    [color=blue]
    > -----Urspr

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.