Hitachi Vantara Pentaho Community Forums
Results 1 to 5 of 5

Thread: Suggestions on increasing throughput of clustered transformation

  1. #1
    DEinspanjer Guest

    Default Suggestions on increasing throughput of clustered transformation

    I'm trying to push 2 TB of data from an ETL into two slave nodes that are landing the data.

    My data is averaging about 256 bytes per row.
    The main part of the transformation is capable of processing as much as 30k rps, and the landing step is a Text File Output which is certainly capable of at least that much, but I'm finding that the communication between the master and slaves is throttled at about 8k rps. This is putting my throughput at about .02 Gbps per node. Way too slow.

    I tried tweaking the settings on the cluster dialog. Increasing the buffer size makes sense to me because it should keep the step from ever having to resize the storage array. I am not sure whether a large, medium, or small number would be best for the flush though.

    Also, I"m dealing with about 30 fields of data, all of them created in the transformation (hence no lazyness). I was wondering if I might be able to increase the throughput somehow by turning those fields back into lazy byte fields such that the step in the slave didn't have to reconvert the data from byte to objects again.. Does this make sense?

    Would there be some way that I could throw partitioning into the mix to maybe open up more master<=>slave pipes and then merge the rows together down into the text file output step?

  2. #2
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    Daniel, it makes a lot of sense. Object serialization and de-serialization can be a bottle-neck depending on the type of data you are carrying around.
    Keeping the data serialized (lazy conversion) can make a difference.

    I also found that NOT compressing data can help speed up things on the sockets.

    Would there be some way that I could throw partitioning into the mix to maybe open up more master<=>slave pipes and then merge the rows together down into the text file output step?
    I'm not sure. More sockets == more overhead. Obviously, if you have CPU power to spare, the story is different. It depends on your setup really.

    Matt

  3. #3
    DEinspanjer Guest

    Default

    I certainly have plenty of CPU to allocate to this task.
    I also have plenty of pipe to allocate to them. The nodes are all on a GigE switch.

    Do you think a high medium or low flush number would be good? Should it be an absolute number or based on the number of rows I'm pushing?

  4. #4
    DEinspanjer Guest

    Default

    Is there any way to take fields created by steps in the master transformation and make them lazy before they are landed by the slave Text File Output steps?

    I've tried using a select values meta-data to change all the strings to Binary and then made sure that the TFO steps have fast output enabled. That didn't buy me a noticeable increase in throughput

  5. #5
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    Daniel, I don't think you can change fields TO binary storage type, just FROM binary to normal.
    There is a method these days in ValueMeta to do it, so perhaps we should allow this to happen as well.

    Flushing: early and often might be better. What you actually get in longer clustered chains of steps is a sort of wave effect because of all the buffering.
    Suppose you have a compressed buffer somewhere, that is only going to send blocks of data over the wire. In the mean time, a slave is waiting. Then it gets the block of data, it creates a spike of work, then it waits again, etc. Flushing early might help to get a more constant flow of data.

    It all depends very much on the configuration. I once wrote a professor at a local Belgian university that I knew from years ago. He's specialized in parallel computing and I wanted to have him look at the issue. Unfortunately I never got a reply from him and then I lacked the time to ask again.

    Anyway, perhaps, later on, we can create some sort of messaging channel between remote steps to signal a need for new data. For now, thinker with the flush :-)

    Cheers,
    Matt

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.