Hitachi Vantara Pentaho Community Forums
Results 1 to 5 of 5

Thread: Disabling a branch in a transformation

  1. #1
    Join Date
    Aug 2016
    Posts
    289

    Default Disabling a branch in a transformation

    After experimenting with some "improved" stream/branch disable functionality, my big data transformation now suffers from congestion and comes to a halt! The transformation reads a single file and writes statistics to fact tables. There's more than 22'000'000 rows totally for this file. Which means everything has to run smooth and fast, or rows start to pile up.

    Some sub-streams should be disabled depending on arguments given at start. The straight forward way to do this in Spoon is to:

    1) Add "Get Variables" step. Add the variable which decides wheter the branch should be disabled or not.
    2) Add "Filter" step, filter on the stream field set above. True: continue stream. False: disable stream.

    However, this process means the same constant field is added 22'000'000 times, step 1) above. And the logic comparison is then done 22'000'000 times, step 2) above. That's 19'999'999 times more than necessary!

    So I tried do make my own java code to test only once:

    Code:
    public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException
    {
        if (first)
        {
            first = false;
            //Disable the stream / branch if ENABLE_BRANCH variable is not 'Y' (yes)
            String enable = getVariable("ENABLE_BRANCH", "NULL");
            if(!enable.equals("Y"))
            {
                setOutputDone();
                return false;
            }
        }
         Object[] r = getRow();
    
    
        if (r == null)
        {
            setOutputDone();
            return false;
        }
        
        r = createOutputRow(r, data.outputRowMeta.size());
        putRow(data.outputRowMeta, r);
        return true;
    }
    This works excellent with small data. The sub-branch to be disabled is immediately green/finished even before receiving the first row! But the transformation freezes with big data. Why? What am I missing? Seems like the steps up-stream is still trying to send rows somehow. Why would they try to send rows when this step has already executed "setOutputDone()" and returned false?

    I wish the filter step would accept variables!
    Last edited by Sparkles; 04-12-2019 at 12:41 PM.

  2. #2
    Join Date
    Apr 2008
    Posts
    4,689

    Default

    The upstream steps will always send the data.

    What if you refactor a little bit?

    Code:
    public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException
    {
         if (first)
         {
              first = false;
              //Disable the stream / branch if ENABLE_BRANCH variable is not 'Y' (yes)
              String enable = getVariable("ENABLE_BRANCH", "NULL");
         }
         Object[] r = getRow();
    
    
         if (r == null)
         {
              setOutputDone();
              return false;
         }
    
         if (enable.equals("Y"))
         {
              putRow(data.outputRowMeta, r);
         }
         return true;
    }
    WARNING! Untested code above.
    I am not a coder. I do not know if above changes will work, let alone improve throughput

  3. #3
    Join Date
    Aug 2016
    Posts
    289

    Default

    I think you're right. That should most probably work without affecting performance. It still bothers me that a sub-stream can't be shut off immediately. Your solution looks like middleground when it comes to performance.

    The step prior to this splits into multiple sub-streams. That should of course continue sending data, but I think the problem is it keeps sending data to the sub-stream that is disabled, and this UDJC step that finished doesn't remove the new incoming rows.
    Last edited by Sparkles; 04-15-2019 at 06:34 AM.

  4. #4
    Join Date
    Jan 2015
    Posts
    107

    Default

    If you know at start time which stream(s) you want, why not make separate transformations for the common cases and launch the correct one either by name or from a parent job?

    Performance wise, starting a transformation with only the steps you need would be optimal. The downside is increased maintenance if you have a lot of combinations of substreams or make a lot of changes to the logic.

  5. #5
    Join Date
    Aug 2016
    Posts
    289

    Default

    Thanks for sharing your thoughts. It did cross my mind. But as you point out, the problem is code duplication and could lead to high amount of transformation files for multiple combinations.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.