Hitachi Vantara Pentaho Community Forums
Results 1 to 2 of 2

Thread: filter

  1. #1

    Post filter

    I am using filter step frequently in the jobs.
    Change number of copies to start.. can be done on this step. Is there any performance related number to it?

  2. #2
    DEinspanjer Guest

    Default

    It can be done. Here is what I consider when dealing with problems or questions like this:

    1. Always start out from scratch (default rowset size, all steps running just one copy) and measure where your first bottleneck is. I have made some other posts where I detail a guideline for optimizing a transformation. Just give a search.
    2. Whenever possible, keep the number of threads in your transformation to a functional size. It is easy to end up with more than a hundred threads if you have a fairly large transformation and you try to set up several steps to run in multiple copies. Notice I don't mention a number for functional size. That depends on your hardware, OS, JVM, and your data.
    3. When running some steps in multiple copies, keep in mind that data can be "pipelined" and there is a small penalty to merging pipelines.
    This means that if you have ten files to read, and you use a Text File Input step running in two copies to read it, you will have two pipelines coming out of that step. It is typically more efficient to have your next step run in two copies as well so that each copy can have its own dedicated pipeline from the TFI. If your next step runs in just one copy, there will be some synchronization overhead involved in having the two pipelines both feeding rows into that step.
    4. A filter step can run in multiple copies (and if you have multiple pipelines upstream, rule #3 applies). However, if the filter step has an explicit true and false step instead of just an implicit true step, its target steps cannot run in multiple copies.
    4a. If you have expect a small number of false rows and you want to keep your multi-copy pipelines for the true case, use a Data Validator step instead with error handling configured to send the false rows to their destination. Data Validator can run in multiple copies and have a main target with multiple copies.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.