Hitachi Vantara Pentaho Community Forums
Results 1 to 3 of 3

Thread: RE: AW: Object serialization

  1. #1
    Matt Casters Guest

    Default RE: AW: Object serialization

    Thanks Sven.

    Well, there is only so much I want to sacrifice for performance anyway.

    Maybe we should look into our own algorithm to store numbers. Oracle seems
    to have a really cool system where they store only the minimal amount of
    data.
    Actually, I think you can spend a considerabal amount of CPU to calculate
    certain things, just to avoid I/O.
    It's OK for small amounts of data, but when you need to serialize milions of
    rows, a small gain on a row can mean minutes in processing time.

    Also, I think we could do more with bit-masks.
    Now, a DataOutputStream.writeBoolean() results in a full byte being written
    out.

    Take the example of isNull() : each field has this bit. Suppose you would
    store data on a row-level and you have 10 field values, you could do it in 2
    bytes, not 10.
    11111111 110000000
    Same goes for numbers. Suppose you store 3 bits per value to set a integer
    type:

    000 : 1-byte integer
    001 : 2-byte integer
    010 : 3-byte integer
    011 : 4-byte integer
    100 : 5-byte integer
    101 : 6-byte integer
    101 : 7-byte integer
    111 : 8-byte integer

    You would be storing meta-data for safety with a small overhead, but I think
    you can win it back in efficiency.
    I think Oracle does something simimlar in their database. If you have a
    Number(38) data type there and you store values between 1-100 for example,
    you only use 1.2 bytes.
    I'm sure they use a similar system to reduce I/O as much as possible. The
    difference is that they also save minimal amounts of data for floats,
    doubles, bigdecimals, etc.

    I's more complex though with all the bit-manipulation going on. However,
    I'm sure it will be worth it.
    A few clever methods could take care of the bit-shifting that goes on.
    Something like

    public void addBits(BitBuffer buffer, int bits)
    public int getBits(BifBuffer buffer)

    In the mean time, since I value Sven's & others opinion too much to ignore
    it, we've gone back to always store 8-bytes.

    Take care,

    Matt

    -----Original Message-----
    From: kettle-developers (AT) googlegroups (DOT) com
    [mailto:kettle-developers (AT) googlegroups (DOT) com] On Behalf Of list123 (AT) pandora (DOT) be
    Sent: Wednesday, December 06, 2006 5:48 PM
    To: kettle-developers
    Subject: Re: AW: Object serialization



    Maybe I'm the odd duck, but I would keep it at 8 bytes for all. Saving them
    as different size feels like cutting corners to make it possibly a little
    bit speedier and smaller, like making our own "Y2K" problem.

    In another project I've worked on we had we did something similar and had to
    switch back:
    - If you do calculations between these numbers as what will you save them,
    if you only allow one kind of result for all "rows" you can't decide that
    from a single occurence, one calculation may fit 2 bytes, while the next
    calculation may need 8.
    - For different sizes logic will need to be in to determine how to read save
    numbers, which will also take time.

    Regards,
    Sven





    --~--~---------~--~----~------------~-------~--~----~
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com
    To unsubscribe from this group, send email to kettle-developers-unsubscribe (AT) g...oups (DOT) com
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en
    -~----------~----~----~----~------~----~------~--~---

  2. #2
    sven.thiergen Guest

    Default Re: Object serialization

    One basic questions to Java serialization first: When serializing an
    integer or a byte I assume the data is just written as it is to the
    OutputStream. Is that correct or is it packed/scrambled in some way?

    Next basic question to the I/O: We are talking about the data passed
    from step to step, right? Isn't that happening all in RAM anyway? If
    so, do you think when handling and passing data just in the RAM that
    the I/O is affected that serious that it's worth differing between the
    integer types?

    If it's worth it I think one can offer the user a mechanism to select
    the correct integer type. With bit operations one can validate the type
    (at least I think so) and give a warning in case of problems. Some
    problems do occur as said when doing calculations with different
    integer types. But the user may select again what kind of integer he
    thinks is appropriate as the operations's result. Indeed, in many cases
    using 4-byte integers completely suffices.

    I do compare this with databases: The (experienced!) user is allowed to
    choose between different integer types and so he is allowed to mess
    things up. But that's always the problem - It's a question of what kind
    of users do use Kettle.

    Did you ever think between an "Expert" and a "Basic" modus of the
    steps? The "Expert modus in this case allows to choose between
    different integer types, the "Basic" modus always uses long types.


    --~--~---------~--~----~------------~-------~--~----~
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com
    To unsubscribe from this group, send email to kettle-developers-unsubscribe (AT) g...oups (DOT) com
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en
    -~----------~----~----~----~------~----~------~--~---

  3. #3
    sven.thiergen Guest

    Default Re: Object serialization

    Okay, I understand a lot better now. Still no easy solution for this.
    You suggest some efficiency algorithms which are hard to implement in
    most cases and need some serious testing but maybe they speed up the
    system by factor 2 or 3.

    It seems as if the data type is only of importance if real I/O is
    involved. So it might be a solution to offer the experienced user an
    interface to set the data type, but only when I/O is involved (sorting
    and clusters). After the I/O is done it's all converted back to type
    "long". That's rather easy and may give us some hints about the
    effectiveness.

    Regarding I/O - in which way data is accessed? Do you read / write
    single values or large blocks of data? When doing clustering I think
    the answer is "large blocks of data", how about the "Sort" step? If you
    access small chunks of data it doesn't matter whether it's 8 or 4
    bytes, the bottleneck is not the data size but the I/O operation as a
    whole. I/O tuning will only be effective when transfering large chunks
    of data (I am no expert, just what I think).

    Can you typically access all your data before actually processing it?
    In other words - can you examine the 1.000.000 fields of a certain
    column - which are of type long (but may fit in type integer or even
    short) - before actually doing anything with them?

    If so it would be possible to implement some kind of ideal packing
    algorithm: Counting the number of *different* values (over 1.000.000
    rows!), assigning each a number by starting at 1 and incrementing on
    each new value. In many cases 2 bytes will be sufficient (covering
    65.536 different values) for 1.000.000 actual values. On both sides
    (saving side / loading side) the values will be of type long to the
    user, the magic happens inside.

    A final word: You may be right with those "expert" modes. One could
    talk for hours about their usefullness and there is definitely no easy
    answer. Just the same as with this problem: You'll have to make a
    decision and keep fingers crossed.


    --~--~---------~--~----~------------~-------~--~----~
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com
    To unsubscribe from this group, send email to kettle-developers-unsubscribe (AT) g...oups (DOT) com
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en
    -~----------~----~----~----~------~----~------~--~---

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.