Hitachi Vantara Pentaho Community Forums
Results 1 to 10 of 10

Thread: Loops in jobs

  1. #1
    Matt Casters Guest

    Default Loops in jobs

    Hi Kettle devs,

    It has occurred to me earlier and more recently to others that creating
    loops in jobs is somewhat a cumbersome process.
    So perhaps we can line up the top 5 of most common use-cases and find
    ease-of-use solutions to those?

    One use-case is where we loop over a DB result set (query), copy the rows to
    result, set variables and use those for each row in the result set.
    In that specific case I imagine we could wrap the "Table Input" step around
    a job entry, execute that, copy the rows to result, all in one job entry.
    Setting the variables is something we could cram into the "Transformation"
    or "Job" job entries without too much of a problem.
    That would mean we could eliminate 2 transformations: one to get the result
    set and one to set the variables inside the loop. All that remains are 2
    job entries: "Table Input" and "Transformation/Job".

    So do me (and our users:-) a favor and let us know your most common use-case
    for loops in a job.
    If there is a pattern we could perhaps come up with a more clever way of
    doing this compared to writing N new job entries for "Table Input", "Text
    File Input" and so on.

    Thanks in advance!

    Matt
    --
    Matt Casters <mcasters (AT) pentaho (DOT) org>
    Chief Data Integration, Kettle founder, Author of Pentaho Kettle
    Solutions<http://www.amazon.com/Pentaho-Kettle-Solutions-Building-Integration/dp/0470635177>
    (Wiley <http://eu.wiley.com/WileyCDA/WileyTitle/productCd-0470635177.html>)
    Pentaho : The Commercial Open Source Alternative for Business Intelligence

    --
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
    To unsubscribe from this group, send email to kettle-developers+unsubscribe (AT) g...oups (DOT) com.
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en.

  2. #2
    Roland Bouman Guest

    Default Re: Loops in jobs

    There is one particular case I use in production, which is best
    described as a "crawler". I'm not sure it could be cleanly abstracted
    but I'll just throw it in the pot.

    I have a website and I want to take it "offline" - get all pages, and
    the pages they link to, and so on - either until I reach a predefined
    depth, or until I found all links.

    To control the loop, i have a configuration (.properties) file
    containing some specifics such as the root url (and credentials) where
    I can get the pages from, as well as the max_depth.
    In addition to that I have a separate "initial-links" file that lists
    all urls that serve as starting point (so I can lift several pages in
    the same job run)

    The looping is controlled by two things
    1) I keep track of the current depth in a root-level variable

    2) There are three additional "scratchpad" files, a "current-links" to
    keep track of the links I need to examine during the current
    iteration, an "all-links" file to keep track of the links I tracked
    down already in any previous iterations, and a "new-links" file to
    store any links I discover during the current iteration.

    The loop itself is then implemented as such:

    * before the loop, i set the depth variable to 0, and I overwrite the
    "current-links" file with the "initial-links" file, and the
    "all-links" file is emptied.
    * for each iteration, I do:
    1) check if the depth variable is less than or equal to the max_depth
    configuration value. if not, we're done looping.
    2) process all pages pointed to the links in the "current-links" file.
    If I discover any links in those pages, I store them in the
    "new-links" file.
    3) do a diff between the "all-links" and the "new-links" files to
    discard any "new-links" that were already processed. The real new
    links are then dumped to the "current-links" file
    4) check if the last iteration yielded any new links (iow check if
    "current-links" is not empty). If it is, we're done looping. If not, I
    increase the depth variable by 1 and re-enter at step 1)

    Maybe there are simpler ways to do it but this is what I have now and
    it works quite well.

    I don't know if anyone thinks this pattern is useful, but if so, the
    challenge will be to generalize it. I mean, I can see how this would
    already be very useful for the typical html website case with <a>,
    <img>, <script> and <link> tags ( and friends). But in my use case,
    the pages are not really html pages but xml, and the links are not
    really links but identifying numbers from which my transformation
    logic can derive urls for new xml files.

    Anyway - just an example where I rely on loops.

    kind regards,

    Roland

    On Tue, May 3, 2011 at 4:41 PM, Matt Casters <mcasters (AT) pentaho (DOT) org> wrote:[color=blue]
    > Hi Kettle devs,
    >
    > It has occurred to me earlier and more recently to others that creating
    > loops in jobs is somewhat a cumbersome process.
    > So perhaps we can line up the top 5 of most common use-cases and find
    > ease-of-use solutions to those?
    >
    > One use-case is where we loop over a DB result set (query), copy the rows to
    > result, set variables and use those for each row in the result set.
    > In that specific case I imagine we could wrap the "Table Input" step around
    > a job entry, execute that, copy the rows to result, all in one job entry.
    > Setting the variables is something we could cram into the "Transformation"
    > or "Job" job entries without too much of a problem.
    > That would mean we could eliminate 2 transformations: one to get the result
    > set and one to set the variables inside the loop.

  3. #3
    Matt Casters Guest

    Default Re: Loops in jobs

    Thanks Roland!
    When I read the description I was actually expecting a job that looked way
    more complex, so it's not all that bad.

    Also thanks to the other folks that sent in use-cases in private, sometimes
    with complete jobs attached. It's highly appreciated.

    From all the use-cases I think I saw a few things that always came back with
    respect to loops:
    1) Get a bunch of rows (from a table or a file) and copy those rows to
    result
    2) Loop over the rows with a job. Setting a bunch of variables is the first
    thing you do in the job

    for 1) I'm thinking of something dramatic to help out in the form of a
    dynamic xform builder.
    for 2) it would be much easier if you could set the variables in the
    "Transformation" or "Job" job entries themselves. I think it makes a lot of
    sense to do it there anyway.

    I'll wrap these up in 2 JIRA cases and I'll try to get something going for
    them as soon as possible so you can all give more feedback.

    Best regards,

    Matt

    2011/5/6 Roland Bouman <roland.bouman (AT) gmail (DOT) com>

    > There is one particular case I use in production, which is best
    > described as a "crawler". I'm not sure it could be cleanly abstracted
    > but I'll just throw it in the pot.
    >
    > I have a website and I want to take it "offline" - get all pages, and
    > the pages they link to, and so on - either until I reach a predefined
    > depth, or until I found all links.
    >
    > To control the loop, i have a configuration (.properties) file
    > containing some specifics such as the root url (and credentials) where
    > I can get the pages from, as well as the max_depth.
    > In addition to that I have a separate "initial-links" file that lists
    > all urls that serve as starting point (so I can lift several pages in
    > the same job run)
    >
    > The looping is controlled by two things
    > 1) I keep track of the current depth in a root-level variable
    >
    > 2) There are three additional "scratchpad" files, a "current-links" to
    > keep track of the links I need to examine during the current
    > iteration, an "all-links" file to keep track of the links I tracked
    > down already in any previous iterations, and a "new-links" file to
    > store any links I discover during the current iteration.
    >
    > The loop itself is then implemented as such:
    >
    > * before the loop, i set the depth variable to 0, and I overwrite the
    > "current-links" file with the "initial-links" file, and the
    > "all-links" file is emptied.
    > * for each iteration, I do:
    > 1) check if the depth variable is less than or equal to the max_depth
    > configuration value. if not, we're done looping.
    > 2) process all pages pointed to the links in the "current-links" file.
    > If I discover any links in those pages, I store them in the
    > "new-links" file.
    > 3) do a diff between the "all-links" and the "new-links" files to
    > discard any "new-links" that were already processed. The real new
    > links are then dumped to the "current-links" file
    > 4) check if the last iteration yielded any new links (iow check if
    > "current-links" is not empty). If it is, we're done looping. If not, I
    > increase the depth variable by 1 and re-enter at step 1)
    >
    > Maybe there are simpler ways to do it but this is what I have now and
    > it works quite well.
    >
    > I don't know if anyone thinks this pattern is useful, but if so, the
    > challenge will be to generalize it. I mean, I can see how this would
    > already be very useful for the typical html website case with <a>,
    > <img>, <script> and <link> tags ( and friends). But in my use case,
    > the pages are not really html pages but xml, and the links are not
    > really links but identifying numbers from which my transformation
    > logic can derive urls for new xml files.
    >
    > Anyway - just an example where I rely on loops.
    >
    > kind regards,
    >
    > Roland
    >
    > On Tue, May 3, 2011 at 4:41 PM, Matt Casters <mcasters (AT) pentaho (DOT) org> wrote:
    > > Hi Kettle devs,
    > >
    > > It has occurred to me earlier and more recently to others that creating
    > > loops in jobs is somewhat a cumbersome process.
    > > So perhaps we can line up the top 5 of most common use-cases and find
    > > ease-of-use solutions to those?
    > >
    > > One use-case is where we loop over a DB result set (query), copy the rows

    > to
    > > result, set variables and use those for each row in the result set.
    > > In that specific case I imagine we could wrap the "Table Input" step

    > around
    > > a job entry, execute that, copy the rows to result, all in one job entry.
    > > Setting the variables is something we could cram into the

    > "Transformation"
    > > or "Job" job entries without too much of a problem.
    > > That would mean we could eliminate 2 transformations: one to get the

    > result
    > > set and one to set the variables inside the loop. All that remains are 2
    > > job entries: "Table Input" and "Transformation/Job".
    > >
    > > So do me (and our users:-) a favor and let us know your most common

    > use-case
    > > for loops in a job.
    > > If there is a pattern we could perhaps come up with a more clever way of
    > > doing this compared to writing N new job entries for "Table Input", "Text
    > > File Input" and so on.
    > >
    > > Thanks in advance!
    > >
    > > Matt
    > > --
    > > Matt Casters <mcasters (AT) pentaho (DOT) org>
    > > Chief Data Integration, Kettle founder, Author of Pentaho Kettle
    > > Solutions (Wiley)
    > > Pentaho : The Commercial Open Source Alternative for Business

    > Intelligence
    > >
    > > --
    > > You received this message because you are subscribed to the Google Groups
    > > "kettle-developers" group.
    > > To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
    > > To unsubscribe from this group, send email to
    > > kettle-developers+unsubscribe (AT) g...oups (DOT) com.
    > > For more options, visit this group at
    > > http://groups.google.com/group/kettle-developers?hl=en.
    > >

    >
    >
    >
    > --
    > Roland Bouman
    > blog: http://rpbouman.blogspot.com/
    > twitter: @rolandbouman
    >
    > Author of "Pentaho Solutions: Business Intelligence and Data
    > Warehousing with Pentaho and MySQL",
    > http://tinyurl.com/lvxa88 (Wiley, ISBN: 978-0-470-48432-6)
    >
    > Author of "Pentaho Kettle Solutions: Building Open Source ETL
    > Solutions with Pentaho Data Integration",
    > http://tinyurl.com/33r7a8m (Wiley, ISBN: 978-0-470-63517-9)
    >
    > --
    > You received this message because you are subscribed to the Google Groups
    > "kettle-developers" group.
    > To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
    > To unsubscribe from this group, send email to
    > kettle-developers+unsubscribe (AT) g...oups (DOT) com.
    > For more options, visit this group at
    > http://groups.google.com/group/kettle-developers?hl=en.
    >
    >



    --
    Matt Casters <mcasters (AT) pentaho (DOT) org>
    Chief Data Integration, Kettle founder, Author of Pentaho Kettle
    Solutions<http://www.amazon.com/Pentaho-Kettle-Solutions-Building-Integration/dp/0470635177>
    (Wiley <http://eu.wiley.com/WileyCDA/WileyTitle/productCd-0470635177.html>)
    Fonteinstraat 70, 9400 OKEGEM - Belgium - Cell : +32 486 97 29 37
    Pentaho : The Commercial Open Source Alternative for Business Intelligence

    --
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
    To unsubscribe from this group, send email to kettle-developers+unsubscribe (AT) g...oups (DOT) com.
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en.

  4. #4
    Roland Bouman Guest

    Default Re: Loops in jobs

    Hi Matt, all!

    On Fri, May 6, 2011 at 1:44 PM, Matt Casters <mcasters (AT) pentaho (DOT) org> wrote:
    > When I read the description I was actually expecting a job that looked way
    > more complex, so it's not all that bad.


    True. I got the entire thing done in a day (including logic inside the
    transformations), I expected it would take me longer.

    > From all the use-cases I think I saw a few things that always came back with
    > respect to loops:
    > 1) Get a bunch of rows (from a table or a file) and copy those rows to
    > result
    > 2) Loop over the rows with a job. Setting a bunch of variables is the first
    > thing you do in the job


    Not sure if that was clear from my example, but I designed mine
    explicitly to avoid a "copy rows to result" step. I do everything with
    my scratchpad files inside transformations.
    I did this on purpose - I wasn't sure how large the resultsets could
    be - the number of links has a tendency to explode quite rapidly and I
    assumed it would give me better performance and scalability to keep
    all the row processing inside the transformations (probably at the
    expense of more io, but in my case the crawling is the limiting bit,
    not reading links from the files)

    kind regards,

    Roland.
    [color=blue]
    > for 1) I'm thinking of something dramatic to help out in the form of a
    > dynamic xform builder.
    > for 2) it would be much easier if you could set the variables in the
    > "Transformation" or "Job" job entries themselves.

  5. #5
    Brandon Jackson Guest

    Default Re: Loops in jobs

    We have two difference scenarios where we get 'layered' ETL.

    1. We perfect an idea and it becomes a module, then it is wrapped up in a
    bigger process which loops it.
    2. We gather some inputs, and loop jobs containing transforms on the initial
    input. (The most mentioned case here).

    We could chalk this up to development lifecycle where we should revisit our
    design and try to make it more sequential, than layered.

    One area where I struggle is when something has been in production for a
    while and we try moving to a later version of PDI and encounter problems.
    Due to the layering, it takes time and effort to prepare the transformation
    to be retested as an individual unit again. I have to add variables and set
    them, which i must make sure I delete when reinserting the piece into the
    production ETL. I also have to gather up all the inputs, which could be
    determined in layers above to work towards diagnosing, repeating and
    correcting the issue.

    Example. We have PDI 4.0.1-CE. Works fantastically for a transform working
    with text files and DB lookups to process therapy transactions from a sister
    company. It completes in 1.5 minutes. I try it in any version after 4.0.1
    and it takes so long, I've never seen completion. Logs don't create
    errors, because there are no errors. From spoon, I cannot dive down into
    the running job because an iteration a lower level would have been
    finished. I've enabled logs in the transform and nothing get's
    generated. So unless I either do not understand logs OR tear the tranform
    completely apart and build up again, there is no hope of quickly zeroing in
    on the issue.

    Possible solutions come to mind:

    Basically a way to switch into and out of testing mode. This implies more
    than 'row output' or 'debug'. I want variable setting, activation of steps
    that feed dummy data into the transform and special notes. When one exits
    'testing mode', then those dummy data steps, variables and notes all are
    disabled and disappear neatly away.

    Just an idea. I have little pride. Please feel free to point me to
    specific education resources if my suggestions reveal a lack of
    understanding of some functionality that exists in PDI currently to help me
    over the hurdle.

    PS: Although it has improved greatly, visually and functionally, I still
    find logging and performance management one of the most confusing aspects of
    PDI, especially when ETL layers up and is hard to execute on a transform by
    transform basis.

    Thanks for the great product and the commitment to constant and never ending
    improvement.

    Brandon




    On Tue, May 3, 2011 at 9:41 AM, Matt Casters <mcasters (AT) pentaho (DOT) org> wrote:

    > Hi Kettle devs,
    >
    > It has occurred to me earlier and more recently to others that creating
    > loops in jobs is somewhat a cumbersome process.
    > So perhaps we can line up the top 5 of most common use-cases and find
    > ease-of-use solutions to those?
    >
    > One use-case is where we loop over a DB result set (query), copy the rows
    > to result, set variables and use those for each row in the result set.
    > In that specific case I imagine we could wrap the "Table Input" step around
    > a job entry, execute that, copy the rows to result, all in one job entry.
    > Setting the variables is something we could cram into the "Transformation"
    > or "Job" job entries without too much of a problem.
    > That would mean we could eliminate 2 transformations: one to get the result
    > set and one to set the variables inside the loop. All that remains are 2
    > job entries: "Table Input" and "Transformation/Job".
    >
    > So do me (and our users:-) a favor and let us know your most common
    > use-case for loops in a job.
    > If there is a pattern we could perhaps come up with a more clever way of
    > doing this compared to writing N new job entries for "Table Input", "Text
    > File Input" and so on.
    >
    > Thanks in advance!
    >
    > Matt
    > --
    > Matt Casters <mcasters (AT) pentaho (DOT) org>
    > Chief Data Integration, Kettle founder, Author of Pentaho Kettle Solutions<http://www.amazon.com/Pentaho-Kettle-Solutions-Building-Integration/dp/0470635177>
    > (Wiley<http://eu.wiley.com/WileyCDA/WileyTitle/productCd-0470635177.html>
    > )
    > Pentaho : The Commercial Open Source Alternative for Business Intelligence
    >
    >
    > --
    > You received this message because you are subscribed to the Google Groups
    > "kettle-developers" group.
    > To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
    > To unsubscribe from this group, send email to
    > kettle-developers+unsubscribe (AT) g...oups (DOT) com.
    > For more options, visit this group at
    > http://groups.google.com/group/kettle-developers?hl=en.
    >


    --
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
    To unsubscribe from this group, send email to kettle-developers+unsubscribe (AT) g...oups (DOT) com.
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en.

  6. #6
    Jens Bleuel Guest

    Default Re: Loops in jobs

    A little bit unusual, but it happened:

    A) Have a looping option in a transformation.
    Yes, I know this is normally out of scope for transformations, but the
    following use case is there in regards to a continues data load / real
    time processing:
    - Query JMS or other queues with some sort of Input step
    - This will stop when all the available data is processed (or for JMS
    until a timeout is reached or just continue infinite - this is the only
    step I know where this "infinite continue" is implemented)
    - Now imagine, after a specific amount of time new data arrive and need
    to be processed asap.

    Actually you may need to restart the transformation what costs
    performance, need a looping logic outside of the transformation and all
    this leads to a small delay.

    We have for example the JMS consumer step that can read continuously. A
    problem here is: When you want to stop the transformation in a
    controlled way, this is not possible at this time since all steps get
    the signal to stop and rows may silently disappear. A feature request
    for this: react to a signal that is sent only to this step and stop
    processing in a controlled way. A JIRA needs to be created but in the
    project where this was found, a loss of some rows (up to the buffer
    size) is not critical, really...

    From my point of view this request leads to an interesting loop design
    proposal for transformations:
    1) Have an option for some input steps to just restart after they are
    finished.
    2) The restart may be delayed for a specific amount of time.
    3) This step needs to listen to a specific signal to stop. This is
    different from stopping the transformation.

    B) This type of looping option is actually possible within jobs with the
    Start job entry (repeat functionality) whereas product management set
    this feature to deprecated since a while. The recommendation was to
    restart and loop by the scheduler or external process. But out of the
    above given reasons (overhead, delays and even avoid overlapping job
    runs), I still think the features of the start job entry are still
    valid. Especially since the link between the scheduler and monitoring is
    not given, yet.
    Adding the listener for a specific signal to stop a start job in a
    controlled way and keep the repeat option, would be very nice to have.

    When a real looping logic within a job would be realized the delay or a
    restart of a transformation may be acceptable in the above scenario.

    For looping we may think of a "for/next" job entry implementation with
    some options like:
    - maximum number of iterations
    - idle time before a next cycle
    - some conditions to check if it should continue or not (I know the
    phrase "some conditions" may be a wide range, e.g. variables to check or
    checking a date/time range)
    - a break option to end the "for/next" loop premature
    - nested "for/next" should be allowed thus we may need an ID to reference

    That are my thoughts for now...

    Cheers,
    Jens

    Am 03.05.2011 16:41, schrieb Matt Casters:
    > Hi Kettle devs,
    >
    > It has occurred to me earlier and more recently to others that creating
    > loops in jobs is somewhat a cumbersome process.
    > So perhaps we can line up the top 5 of most common use-cases and find
    > ease-of-use solutions to those?
    >
    > One use-case is where we loop over a DB result set (query), copy the
    > rows to result, set variables and use those for each row in the result set.
    > In that specific case I imagine we could wrap the "Table Input" step
    > around a job entry, execute that, copy the rows to result, all in one
    > job entry.
    > Setting the variables is something we could cram into the
    > "Transformation" or "Job" job entries without too much of a problem.
    > That would mean we could eliminate 2 transformations: one to get the
    > result set and one to set the variables inside the loop. All that
    > remains are 2 job entries: "Table Input" and "Transformation/Job".
    >
    > So do me (and our users:-) a favor and let us know your most common
    > use-case for loops in a job.
    > If there is a pattern we could perhaps come up with a more clever way of
    > doing this compared to writing N new job entries for "Table Input",
    > "Text File Input" and so on.
    >
    > Thanks in advance!
    >
    > Matt
    > --
    > Matt Casters <mcasters (AT) pentaho (DOT) org <mailto:mcasters (AT) pentaho (DOT) org>>
    > Chief Data Integration, Kettle founder, Author of Pentaho Kettle
    > Solutions
    > <http://www.amazon.com/Pentaho-Kettle-Solutions-Building-Integration/dp/0470635177> (Wiley
    > <http://eu.wiley.com/WileyCDA/WileyTitle/productCd-0470635177.html>)
    > Pentaho : The Commercial Open Source Alternative for Business Intelligence
    >
    >
    > --
    > You received this message because you are subscribed to the Google
    > Groups "kettle-developers" group.
    > To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
    > To unsubscribe from this group, send email to
    > kettle-developers+unsubscribe (AT) g...oups (DOT) com.
    > For more options, visit this group at
    > http://groups.google.com/group/kettle-developers?hl=en.


    --
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
    To unsubscribe from this group, send email to kettle-developers+unsubscribe (AT) g...oups (DOT) com.
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en.

  7. #7
    Matt Casters Guest

    Default Re: Loops in jobs

    Namaste Jens,

    I didn't know that anyone "deprecated" looping under the "Start" job entry.
    There must have been a good reason to do so. However, I hereby officially
    "un-deprecate" it for 4.2.0 after I did extensive testing to make sure no
    memory leaks remain.

    http://jira.pentaho.com/browse/PDI-5502

    The first job entry to use the logic you descibed is the "HL7 MLLP Input"
    job entry. That one gets a single record from a HL7 queue and passes the
    message to the other job entries. It's very fast too ;-)

    The "re-start without stop" logic of steps will actually be possible with
    the "Single Threaded Transformation" execution engine. My plan was to drop
    that one behind a mapping-like step but it will become a separate step.
    The way that it will work is that you have say 1000 rows entering the new
    engine. That will process the rows in batches of 1-N.
    Certain steps will be restarted every time, like file reading steps or steps
    that read a single element off a queue somewhere (your JMS sample). Most
    steps will simply keep running. In a single call of the
    SingleThreadedTransExecutor.oneIteration(), all N rows in the batch are
    pushed through the transformation.

    Up until now I hadn't considered this to be a loop, but you are right, this
    would work for your situation.

    In the mean time I created http://jira.pentaho.com/browse/PDI-6157 and
    implemented the work. Actually Sven Boden did years ago when he implemented
    parameter support in the Job and Transformation entries.

    Cheers,
    Matt


    2011/5/6 Jens Bleuel <jbleuel (AT) pentaho (DOT) com>

    > A little bit unusual, but it happened:
    >
    > A) Have a looping option in a transformation.
    > Yes, I know this is normally out of scope for transformations, but the
    > following use case is there in regards to a continues data load / real time
    > processing:
    > - Query JMS or other queues with some sort of Input step
    > - This will stop when all the available data is processed (or for JMS until
    > a timeout is reached or just continue infinite - this is the only step I
    > know where this "infinite continue" is implemented)
    > - Now imagine, after a specific amount of time new data arrive and need to
    > be processed asap.
    >
    > Actually you may need to restart the transformation what costs performance,
    > need a looping logic outside of the transformation and all this leads to a
    > small delay.
    >
    > We have for example the JMS consumer step that can read continuously. A
    > problem here is: When you want to stop the transformation in a controlled
    > way, this is not possible at this time since all steps get the signal to
    > stop and rows may silently disappear. A feature request for this: react to a
    > signal that is sent only to this step and stop processing in a controlled
    > way. A JIRA needs to be created but in the project where this was found, a
    > loss of some rows (up to the buffer size) is not critical, really...
    >
    > From my point of view this request leads to an interesting loop design
    > proposal for transformations:
    > 1) Have an option for some input steps to just restart after they are
    > finished.
    > 2) The restart may be delayed for a specific amount of time.
    > 3) This step needs to listen to a specific signal to stop. This is
    > different from stopping the transformation.
    >
    > B) This type of looping option is actually possible within jobs with the
    > Start job entry (repeat functionality) whereas product management set this
    > feature to deprecated since a while. The recommendation was to restart and
    > loop by the scheduler or external process. But out of the above given
    > reasons (overhead, delays and even avoid overlapping job runs), I still
    > think the features of the start job entry are still valid. Especially since
    > the link between the scheduler and monitoring is not given, yet.
    > Adding the listener for a specific signal to stop a start job in a
    > controlled way and keep the repeat option, would be very nice to have.
    >
    > When a real looping logic within a job would be realized the delay or a
    > restart of a transformation may be acceptable in the above scenario.
    >
    > For looping we may think of a "for/next" job entry implementation with some
    > options like:
    > - maximum number of iterations
    > - idle time before a next cycle
    > - some conditions to check if it should continue or not (I know the phrase
    > "some conditions" may be a wide range, e.g. variables to check or checking a
    > date/time range)
    > - a break option to end the "for/next" loop premature
    > - nested "for/next" should be allowed thus we may need an ID to reference
    >
    > That are my thoughts for now...
    >
    > Cheers,
    > Jens
    >
    > Am 03.05.2011 16:41, schrieb Matt Casters:
    >
    >> Hi Kettle devs,
    >>
    >> It has occurred to me earlier and more recently to others that creating
    >> loops in jobs is somewhat a cumbersome process.
    >> So perhaps we can line up the top 5 of most common use-cases and find
    >> ease-of-use solutions to those?
    >>
    >> One use-case is where we loop over a DB result set (query), copy the
    >> rows to result, set variables and use those for each row in the result
    >> set.
    >> In that specific case I imagine we could wrap the "Table Input" step
    >> around a job entry, execute that, copy the rows to result, all in one
    >> job entry.
    >> Setting the variables is something we could cram into the
    >> "Transformation" or "Job" job entries without too much of a problem.
    >> That would mean we could eliminate 2 transformations: one to get the
    >> result set and one to set the variables inside the loop. All that
    >> remains are 2 job entries: "Table Input" and "Transformation/Job".
    >>
    >> So do me (and our users:-) a favor and let us know your most common
    >> use-case for loops in a job.
    >> If there is a pattern we could perhaps come up with a more clever way of
    >> doing this compared to writing N new job entries for "Table Input",
    >> "Text File Input" and so on.
    >>
    >> Thanks in advance!
    >>
    >> Matt
    >> --
    >> Matt Casters <mcasters (AT) pentaho (DOT) org <mailto:mcasters (AT) pentaho (DOT) org>>
    >>
    >> Chief Data Integration, Kettle founder, Author of Pentaho Kettle
    >> Solutions
    >> <
    >> http://www.amazon.com/Pentaho-Kettle-Solutions-Building-Integration/dp/0470635177>
    >> (Wiley
    >> <http://eu.wiley.com/WileyCDA/WileyTitle/productCd-0470635177.html>)
    >>
    >> Pentaho : The Commercial Open Source Alternative for Business Intelligence
    >>
    >>
    >> --
    >>
    >> You received this message because you are subscribed to the Google
    >> Groups "kettle-developers" group.
    >> To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
    >> To unsubscribe from this group, send email to
    >> kettle-developers+unsubscribe (AT) g...oups (DOT) com.
    >> For more options, visit this group at
    >> http://groups.google.com/group/kettle-developers?hl=en.
    >>

    >
    > --
    > You received this message because you are subscribed to the Google Groups
    > "kettle-developers" group.
    > To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
    > To unsubscribe from this group, send email to
    > kettle-developers+unsubscribe (AT) g...oups (DOT) com.
    > For more options, visit this group at
    > http://groups.google.com/group/kettle-developers?hl=en.
    >
    >



    --
    Matt Casters <mcasters (AT) pentaho (DOT) org>
    Chief Data Integration, Kettle founder, Author of Pentaho Kettle
    Solutions<http://www.amazon.com/Pentaho-Kettle-Solutions-Building-Integration/dp/0470635177>
    (Wiley <http://eu.wiley.com/WileyCDA/WileyTitle/productCd-0470635177.html>)
    Fonteinstraat 70, 9400 OKEGEM - Belgium - Cell : +32 486 97 29 37
    Pentaho : The Commercial Open Source Alternative for Business Intelligence

    --
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
    To unsubscribe from this group, send email to kettle-developers+unsubscribe (AT) g...oups (DOT) com.
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en.

  8. #8
    Jens Bleuel Guest

    Default Re: Loops in jobs

    Wow, amazing good news!

    Am 06.05.2011 22:29, schrieb Matt Casters:
    > Namaste Jens,
    >
    > I didn't know that anyone "deprecated" looping under the "Start" job
    > entry. There must have been a good reason to do so. However, I hereby
    > officially "un-deprecate" it for 4.2.0 after I did extensive testing to
    > make sure no memory leaks remain.
    >
    > http://jira.pentaho.com/browse/PDI-5502
    >
    > The first job entry to use the logic you descibed is the "HL7 MLLP
    > Input" job entry. That one gets a single record from a HL7 queue and
    > passes the message to the other job entries. It's very fast too ;-)
    >
    > The "re-start without stop" logic of steps will actually be possible
    > with the "Single Threaded Transformation" execution engine. My plan was
    > to drop that one behind a mapping-like step but it will become a
    > separate step.
    > The way that it will work is that you have say 1000 rows entering the
    > new engine. That will process the rows in batches of 1-N.
    > Certain steps will be restarted every time, like file reading steps or
    > steps that read a single element off a queue somewhere (your JMS
    > sample). Most steps will simply keep running. In a single call of the
    > SingleThreadedTransExecutor.oneIteration(), all N rows in the batch are
    > pushed through the transformation.
    >
    > Up until now I hadn't considered this to be a loop, but you are right,
    > this would work for your situation.
    >
    > In the mean time I created http://jira.pentaho.com/browse/PDI-6157 and
    > implemented the work. Actually Sven Boden did years ago when he
    > implemented parameter support in the Job and Transformation entries.
    >
    > Cheers,
    > Matt
    >
    >
    > 2011/5/6 Jens Bleuel <jbleuel (AT) pentaho (DOT) com <mailto:jbleuel (AT) pentaho (DOT) com>>
    >
    > A little bit unusual, but it happened:
    >
    > A) Have a looping option in a transformation.
    > Yes, I know this is normally out of scope for transformations, but
    > the following use case is there in regards to a continues data load
    > / real time processing:
    > - Query JMS or other queues with some sort of Input step
    > - This will stop when all the available data is processed (or for
    > JMS until a timeout is reached or just continue infinite - this is
    > the only step I know where this "infinite continue" is implemented)
    > - Now imagine, after a specific amount of time new data arrive and
    > need to be processed asap.
    >
    > Actually you may need to restart the transformation what costs
    > performance, need a looping logic outside of the transformation and
    > all this leads to a small delay.
    >
    > We have for example the JMS consumer step that can read
    > continuously. A problem here is: When you want to stop the
    > transformation in a controlled way, this is not possible at this
    > time since all steps get the signal to stop and rows may silently
    > disappear. A feature request for this: react to a signal that is
    > sent only to this step and stop processing in a controlled way. A
    > JIRA needs to be created but in the project where this was found, a
    > loss of some rows (up to the buffer size) is not critical, really...
    >
    > From my point of view this request leads to an interesting loop
    > design proposal for transformations:
    > 1) Have an option for some input steps to just restart after they
    > are finished.
    > 2) The restart may be delayed for a specific amount of time.
    > 3) This step needs to listen to a specific signal to stop. This is
    > different from stopping the transformation.
    >
    > B) This type of looping option is actually possible within jobs with
    > the Start job entry (repeat functionality) whereas product
    > management set this feature to deprecated since a while. The
    > recommendation was to restart and loop by the scheduler or external
    > process. But out of the above given reasons (overhead, delays and
    > even avoid overlapping job runs), I still think the features of the
    > start job entry are still valid. Especially since the link between
    > the scheduler and monitoring is not given, yet.
    > Adding the listener for a specific signal to stop a start job in a
    > controlled way and keep the repeat option, would be very nice to have.
    >
    > When a real looping logic within a job would be realized the delay
    > or a restart of a transformation may be acceptable in the above
    > scenario.
    >
    > For looping we may think of a "for/next" job entry implementation
    > with some options like:
    > - maximum number of iterations
    > - idle time before a next cycle
    > - some conditions to check if it should continue or not (I know the
    > phrase "some conditions" may be a wide range, e.g. variables to
    > check or checking a date/time range)
    > - a break option to end the "for/next" loop premature
    > - nested "for/next" should be allowed thus we may need an ID to
    > reference
    >
    > That are my thoughts for now...
    >
    > Cheers,
    > Jens
    >
    > Am 03.05.2011 16:41, schrieb Matt Casters:
    >
    > Hi Kettle devs,
    >
    > It has occurred to me earlier and more recently to others that
    > creating
    > loops in jobs is somewhat a cumbersome process.
    > So perhaps we can line up the top 5 of most common use-cases and
    > find
    > ease-of-use solutions to those?
    >
    > One use-case is where we loop over a DB result set (query), copy the
    > rows to result, set variables and use those for each row in the
    > result set.
    > In that specific case I imagine we could wrap the "Table Input" step
    > around a job entry, execute that, copy the rows to result, all
    > in one
    > job entry.
    > Setting the variables is something we could cram into the
    > "Transformation" or "Job" job entries without too much of a problem.
    > That would mean we could eliminate 2 transformations: one to get the
    > result set and one to set the variables inside the loop. All that
    > remains are 2 job entries: "Table Input" and "Transformation/Job".
    >
    > So do me (and our users:-) a favor and let us know your most common
    > use-case for loops in a job.
    > If there is a pattern we could perhaps come up with a more
    > clever way of
    > doing this compared to writing N new job entries for "Table Input",
    > "Text File Input" and so on.
    >
    > Thanks in advance!
    >
    > Matt
    > --
    > Matt Casters <mcasters (AT) pentaho (DOT) org <mailto:mcasters (AT) pentaho (DOT) org>
    > <mailto:mcasters (AT) pentaho (DOT) org <mailto:mcasters (AT) pentaho (DOT) org>>>
    >
    > Chief Data Integration, Kettle founder, Author of Pentaho Kettle
    > Solutions
    > <http://www.amazon.com/Pentaho-Kettle-Solutions-Building-Integration/dp/0470635177>
    > (Wiley
    > <http://eu.wiley.com/WileyCDA/WileyTitle/productCd-0470635177.html>)
    >
    > Pentaho : The Commercial Open Source Alternative for Business
    > Intelligence
    >
    >
    > --
    >
    > You received this message because you are subscribed to the Google
    > Groups "kettle-developers" group.
    > To post to this group, send email to
    > kettle-developers (AT) googlegroups (DOT) com
    > <mailto:kettle-developers (AT) googlegroups (DOT) com>.
    > To unsubscribe from this group, send email to
    > kettle-developers+unsubscribe (AT) g...oups (DOT) com
    > <mailto:kettle-developers%2Bunsubscribe (AT) googlegroups (DOT) com>.
    > For more options, visit this group at
    > http://groups.google.com/group/kettle-developers?hl=en.
    >
    >
    > --
    > You received this message because you are subscribed to the Google
    > Groups "kettle-developers" group.
    > To post to this group, send email to
    > kettle-developers (AT) googlegroups (DOT) com
    > <mailto:kettle-developers (AT) googlegroups (DOT) com>.
    > To unsubscribe from this group, send email to
    > kettle-developers+unsubscribe (AT) g...oups (DOT) com
    > <mailto:kettle-developers%2Bunsubscribe (AT) googlegroups (DOT) com>.
    > For more options, visit this group at
    > http://groups.google.com/group/kettle-developers?hl=en.
    >
    >
    >
    >
    > --
    > Matt Casters <mcasters (AT) pentaho (DOT) org <mailto:mcasters (AT) pentaho (DOT) org>>
    > Chief Data Integration, Kettle founder, Author of Pentaho Kettle
    > Solutions
    > <http://www.amazon.com/Pentaho-Kettle-Solutions-Building-Integration/dp/0470635177> (Wiley
    > <http://eu.wiley.com/WileyCDA/WileyTitle/productCd-0470635177.html>)
    > Fonteinstraat 70, 9400 OKEGEM - Belgium - Cell : +32 486 97 29 37
    > Pentaho : The Commercial Open Source Alternative for Business Intelligence
    >
    >
    > --
    > You received this message because you are subscribed to the Google
    > Groups "kettle-developers" group.
    > To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
    > To unsubscribe from this group, send email to
    > kettle-developers+unsubscribe (AT) g...oups (DOT) com.
    > For more options, visit this group at
    > http://groups.google.com/group/kettle-developers?hl=en.


    --
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
    To unsubscribe from this group, send email to kettle-developers+unsubscribe (AT) g...oups (DOT) com.
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en.

  9. #9
    Matt Casters Guest

    Default Re: Loops in jobs

    I finished the single threading engine last night with the encouragements of
    the gang on ##pentaho (thanks for that):

    http://www.ibridge.be/?p=200

    Take care,
    Matt

    2011/5/7 Jens Bleuel <jbleuel (AT) pentaho (DOT) com>

    > Wow, amazing good news!
    >
    > Am 06.05.2011 22:29, schrieb Matt Casters:
    >
    >> Namaste Jens,
    >>
    >> I didn't know that anyone "deprecated" looping under the "Start" job
    >> entry. There must have been a good reason to do so. However, I hereby
    >> officially "un-deprecate" it for 4.2.0 after I did extensive testing to
    >> make sure no memory leaks remain.
    >>
    >> http://jira.pentaho.com/browse/PDI-5502
    >>
    >> The first job entry to use the logic you descibed is the "HL7 MLLP
    >> Input" job entry. That one gets a single record from a HL7 queue and
    >> passes the message to the other job entries. It's very fast too ;-)
    >>
    >> The "re-start without stop" logic of steps will actually be possible
    >> with the "Single Threaded Transformation" execution engine. My plan was
    >> to drop that one behind a mapping-like step but it will become a
    >> separate step.
    >> The way that it will work is that you have say 1000 rows entering the
    >> new engine. That will process the rows in batches of 1-N.
    >> Certain steps will be restarted every time, like file reading steps or
    >> steps that read a single element off a queue somewhere (your JMS
    >> sample). Most steps will simply keep running. In a single call of the
    >> SingleThreadedTransExecutor.oneIteration(), all N rows in the batch are
    >> pushed through the transformation.
    >>
    >> Up until now I hadn't considered this to be a loop, but you are right,
    >> this would work for your situation.
    >>
    >> In the mean time I created http://jira.pentaho.com/browse/PDI-6157 and
    >> implemented the work. Actually Sven Boden did years ago when he
    >> implemented parameter support in the Job and Transformation entries.
    >>
    >> Cheers,
    >> Matt
    >>
    >>
    >> 2011/5/6 Jens Bleuel <jbleuel (AT) pentaho (DOT) com <mailto:jbleuel (AT) pentaho (DOT) com>>
    >>
    >>
    >> A little bit unusual, but it happened:
    >>
    >> A) Have a looping option in a transformation.
    >> Yes, I know this is normally out of scope for transformations, but
    >> the following use case is there in regards to a continues data load
    >> / real time processing:
    >> - Query JMS or other queues with some sort of Input step
    >> - This will stop when all the available data is processed (or for
    >> JMS until a timeout is reached or just continue infinite - this is
    >> the only step I know where this "infinite continue" is implemented)
    >> - Now imagine, after a specific amount of time new data arrive and
    >> need to be processed asap.
    >>
    >> Actually you may need to restart the transformation what costs
    >> performance, need a looping logic outside of the transformation and
    >> all this leads to a small delay.
    >>
    >> We have for example the JMS consumer step that can read
    >> continuously. A problem here is: When you want to stop the
    >> transformation in a controlled way, this is not possible at this
    >> time since all steps get the signal to stop and rows may silently
    >> disappear. A feature request for this: react to a signal that is
    >> sent only to this step and stop processing in a controlled way. A
    >> JIRA needs to be created but in the project where this was found, a
    >> loss of some rows (up to the buffer size) is not critical, really...
    >>
    >> From my point of view this request leads to an interesting loop
    >> design proposal for transformations:
    >> 1) Have an option for some input steps to just restart after they
    >> are finished.
    >> 2) The restart may be delayed for a specific amount of time.
    >> 3) This step needs to listen to a specific signal to stop. This is
    >> different from stopping the transformation.
    >>
    >> B) This type of looping option is actually possible within jobs with
    >> the Start job entry (repeat functionality) whereas product
    >> management set this feature to deprecated since a while. The
    >> recommendation was to restart and loop by the scheduler or external
    >> process. But out of the above given reasons (overhead, delays and
    >> even avoid overlapping job runs), I still think the features of the
    >> start job entry are still valid. Especially since the link between
    >> the scheduler and monitoring is not given, yet.
    >> Adding the listener for a specific signal to stop a start job in a
    >> controlled way and keep the repeat option, would be very nice to have.
    >>
    >> When a real looping logic within a job would be realized the delay
    >> or a restart of a transformation may be acceptable in the above
    >> scenario.
    >>
    >> For looping we may think of a "for/next" job entry implementation
    >> with some options like:
    >> - maximum number of iterations
    >> - idle time before a next cycle
    >> - some conditions to check if it should continue or not (I know the
    >> phrase "some conditions" may be a wide range, e.g. variables to
    >> check or checking a date/time range)
    >> - a break option to end the "for/next" loop premature
    >> - nested "for/next" should be allowed thus we may need an ID to
    >> reference
    >>
    >> That are my thoughts for now...
    >>
    >> Cheers,
    >> Jens
    >>
    >> Am 03.05.2011 16:41, schrieb Matt Casters:
    >>
    >> Hi Kettle devs,
    >>
    >> It has occurred to me earlier and more recently to others that
    >> creating
    >> loops in jobs is somewhat a cumbersome process.
    >> So perhaps we can line up the top 5 of most common use-cases and
    >> find
    >> ease-of-use solutions to those?
    >>
    >> One use-case is where we loop over a DB result set (query), copy
    >> the
    >> rows to result, set variables and use those for each row in the
    >> result set.
    >> In that specific case I imagine we could wrap the "Table Input"
    >> step
    >> around a job entry, execute that, copy the rows to result, all
    >> in one
    >> job entry.
    >> Setting the variables is something we could cram into the
    >> "Transformation" or "Job" job entries without too much of a
    >> problem.
    >> That would mean we could eliminate 2 transformations: one to get
    >> the
    >> result set and one to set the variables inside the loop. All that
    >> remains are 2 job entries: "Table Input" and "Transformation/Job".
    >>
    >> So do me (and our users:-) a favor and let us know your most common
    >> use-case for loops in a job.
    >> If there is a pattern we could perhaps come up with a more
    >> clever way of
    >> doing this compared to writing N new job entries for "Table Input",
    >> "Text File Input" and so on.
    >>
    >> Thanks in advance!
    >>
    >> Matt
    >> --
    >> Matt Casters <mcasters (AT) pentaho (DOT) org <mailto:mcasters (AT) pentaho (DOT) org>
    >> <mailto:mcasters (AT) pentaho (DOT) org <mailto:mcasters (AT) pentaho (DOT) org>>>
    >>
    >>
    >> Chief Data Integration, Kettle founder, Author of Pentaho Kettle
    >> Solutions
    >> <
    >> http://www.amazon.com/Pentaho-Kettle.../dp/0470635177
    >> >

    >> (Wiley
    >> <http://eu.wiley.com/WileyCDA/WileyTi...470635177.html
    >> >)

    >>
    >> Pentaho : The Commercial Open Source Alternative for Business
    >> Intelligence
    >>
    >>
    >> --
    >>
    >> You received this message because you are subscribed to the Google
    >> Groups "kettle-developers" group.
    >> To post to this group, send email to
    >> kettle-developers (AT) googlegroups (DOT) com
    >> <mailto:kettle-developers (AT) googlegroups (DOT) com>.
    >>
    >> To unsubscribe from this group, send email to
    >> kettle-developers+unsubscribe (AT) g...oups (DOT) com
    >> <mailto:kettle-developers%2Bunsubscribe (AT) googlegroups (DOT) com>.
    >>
    >> For more options, visit this group at
    >> http://groups.google.com/group/kettle-developers?hl=en.
    >>
    >>
    >> --
    >> You received this message because you are subscribed to the Google
    >> Groups "kettle-developers" group.
    >> To post to this group, send email to
    >> kettle-developers (AT) googlegroups (DOT) com
    >> <mailto:kettle-developers (AT) googlegroups (DOT) com>.
    >>
    >> To unsubscribe from this group, send email to
    >> kettle-developers+unsubscribe (AT) g...oups (DOT) com
    >> <mailto:kettle-developers%2Bunsubscribe (AT) googlegroups (DOT) com>.
    >>
    >> For more options, visit this group at
    >> http://groups.google.com/group/kettle-developers?hl=en.
    >>
    >>
    >>
    >>
    >> --
    >> Matt Casters <mcasters (AT) pentaho (DOT) org <mailto:mcasters (AT) pentaho (DOT) org>>
    >> Chief Data Integration, Kettle founder, Author of Pentaho Kettle
    >> Solutions
    >> <
    >> http://www.amazon.com/Pentaho-Kettle-Solutions-Building-Integration/dp/0470635177>
    >> (Wiley
    >> <http://eu.wiley.com/WileyCDA/WileyTitle/productCd-0470635177.html>)
    >> Fonteinstraat 70, 9400 OKEGEM - Belgium - Cell : +32 486 97 29 37
    >> Pentaho : The Commercial Open Source Alternative for Business Intelligence
    >>
    >>
    >> --
    >> You received this message because you are subscribed to the Google
    >> Groups "kettle-developers" group.
    >> To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
    >> To unsubscribe from this group, send email to
    >> kettle-developers+unsubscribe (AT) g...oups (DOT) com.
    >> For more options, visit this group at
    >> http://groups.google.com/group/kettle-developers?hl=en.
    >>

    >
    > --
    > You received this message because you are subscribed to the Google Groups
    > "kettle-developers" group.
    > To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
    > To unsubscribe from this group, send email to
    > kettle-developers+unsubscribe (AT) g...oups (DOT) com.
    > For more options, visit this group at
    > http://groups.google.com/group/kettle-developers?hl=en.
    >
    >



    --
    Matt Casters <mcasters (AT) pentaho (DOT) org>
    Chief Data Integration, Kettle founder, Author of Pentaho Kettle
    Solutions<http://www.amazon.com/Pentaho-Kettle-Solutions-Building-Integration/dp/0470635177>
    (Wiley <http://eu.wiley.com/WileyCDA/WileyTitle/productCd-0470635177.html>)
    Fonteinstraat 70, 9400 OKEGEM - Belgium - Cell : +32 486 97 29 37
    Pentaho : The Commercial Open Source Alternative for Business Intelligence

    --
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
    To unsubscribe from this group, send email to kettle-developers+unsubscribe (AT) g...oups (DOT) com.
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en.

  10. #10
    Matt Casters Guest

    Default Re: Loops in jobs

    To come back to Jens' loop question inside a transformation...

    A step can choose to implement the "batchComplete()" method.
    For example, the Sort rows step implements this.
    If you send the rows by 500 into the "single threader" step, they will be
    sorted in blocks of 500.

    /**
    * Calling this method will alert the step that we finished passing a
    batch of records to the step.
    * Specifically for steps like "Sort Rows" it means that the buffered
    rows can be sorted and passed on.
    * @throws KettleException In case an error occurs during the processing
    of the batch of rows.
    */
    public void batchComplete() throws KettleException;

    That is, the step needs to know that no more rows are coming or otherwise
    the transformation would block. That is since in this example only 500 rows
    will ever arrive and the transformation will never be considered Finished
    until the parent transformation is finished.

    Anyway, have fun with it!

    Matt



    2011/5/7 Matt Casters <mcasters (AT) pentaho (DOT) org>

    > I finished the single threading engine last night with the encouragements
    > of the gang on ##pentaho (thanks for that):
    >
    > http://www.ibridge.be/?p=200
    >
    > Take care,
    > Matt
    >
    >
    > 2011/5/7 Jens Bleuel <jbleuel (AT) pentaho (DOT) com>
    >
    >> Wow, amazing good news!
    >>
    >> Am 06.05.2011 22:29, schrieb Matt Casters:
    >>
    >>> Namaste Jens,
    >>>
    >>> I didn't know that anyone "deprecated" looping under the "Start" job
    >>> entry. There must have been a good reason to do so. However, I hereby
    >>> officially "un-deprecate" it for 4.2.0 after I did extensive testing to
    >>> make sure no memory leaks remain.
    >>>
    >>> http://jira.pentaho.com/browse/PDI-5502
    >>>
    >>> The first job entry to use the logic you descibed is the "HL7 MLLP
    >>> Input" job entry. That one gets a single record from a HL7 queue and
    >>> passes the message to the other job entries. It's very fast too ;-)
    >>>
    >>> The "re-start without stop" logic of steps will actually be possible
    >>> with the "Single Threaded Transformation" execution engine. My plan was
    >>> to drop that one behind a mapping-like step but it will become a
    >>> separate step.
    >>> The way that it will work is that you have say 1000 rows entering the
    >>> new engine. That will process the rows in batches of 1-N.
    >>> Certain steps will be restarted every time, like file reading steps or
    >>> steps that read a single element off a queue somewhere (your JMS
    >>> sample). Most steps will simply keep running. In a single call of the
    >>> SingleThreadedTransExecutor.oneIteration(), all N rows in the batch are
    >>> pushed through the transformation.
    >>>
    >>> Up until now I hadn't considered this to be a loop, but you are right,
    >>> this would work for your situation.
    >>>
    >>> In the mean time I created http://jira.pentaho.com/browse/PDI-6157 and
    >>> implemented the work. Actually Sven Boden did years ago when he
    >>> implemented parameter support in the Job and Transformation entries.
    >>>
    >>> Cheers,
    >>> Matt
    >>>
    >>>
    >>> 2011/5/6 Jens Bleuel <jbleuel (AT) pentaho (DOT) com <mailto:jbleuel (AT) pentaho (DOT) com>>
    >>>
    >>>
    >>> A little bit unusual, but it happened:
    >>>
    >>> A) Have a looping option in a transformation.
    >>> Yes, I know this is normally out of scope for transformations, but
    >>> the following use case is there in regards to a continues data load
    >>> / real time processing:
    >>> - Query JMS or other queues with some sort of Input step
    >>> - This will stop when all the available data is processed (or for
    >>> JMS until a timeout is reached or just continue infinite - this is
    >>> the only step I know where this "infinite continue" is implemented)
    >>> - Now imagine, after a specific amount of time new data arrive and
    >>> need to be processed asap.
    >>>
    >>> Actually you may need to restart the transformation what costs
    >>> performance, need a looping logic outside of the transformation and
    >>> all this leads to a small delay.
    >>>
    >>> We have for example the JMS consumer step that can read
    >>> continuously. A problem here is: When you want to stop the
    >>> transformation in a controlled way, this is not possible at this
    >>> time since all steps get the signal to stop and rows may silently
    >>> disappear. A feature request for this: react to a signal that is
    >>> sent only to this step and stop processing in a controlled way. A
    >>> JIRA needs to be created but in the project where this was found, a
    >>> loss of some rows (up to the buffer size) is not critical, really...
    >>>
    >>> From my point of view this request leads to an interesting loop
    >>> design proposal for transformations:
    >>> 1) Have an option for some input steps to just restart after they
    >>> are finished.
    >>> 2) The restart may be delayed for a specific amount of time.
    >>> 3) This step needs to listen to a specific signal to stop. This is
    >>> different from stopping the transformation.
    >>>
    >>> B) This type of looping option is actually possible within jobs with
    >>> the Start job entry (repeat functionality) whereas product
    >>> management set this feature to deprecated since a while. The
    >>> recommendation was to restart and loop by the scheduler or external
    >>> process. But out of the above given reasons (overhead, delays and
    >>> even avoid overlapping job runs), I still think the features of the
    >>> start job entry are still valid. Especially since the link between
    >>> the scheduler and monitoring is not given, yet.
    >>> Adding the listener for a specific signal to stop a start job in a
    >>> controlled way and keep the repeat option, would be very nice to have.
    >>>
    >>> When a real looping logic within a job would be realized the delay
    >>> or a restart of a transformation may be acceptable in the above
    >>> scenario.
    >>>
    >>> For looping we may think of a "for/next" job entry implementation
    >>> with some options like:
    >>> - maximum number of iterations
    >>> - idle time before a next cycle
    >>> - some conditions to check if it should continue or not (I know the
    >>> phrase "some conditions" may be a wide range, e.g. variables to
    >>> check or checking a date/time range)
    >>> - a break option to end the "for/next" loop premature
    >>> - nested "for/next" should be allowed thus we may need an ID to
    >>> reference
    >>>
    >>> That are my thoughts for now...
    >>>
    >>> Cheers,
    >>> Jens
    >>>
    >>> Am 03.05.2011 16:41, schrieb Matt Casters:
    >>>
    >>> Hi Kettle devs,
    >>>
    >>> It has occurred to me earlier and more recently to others that
    >>> creating
    >>> loops in jobs is somewhat a cumbersome process.
    >>> So perhaps we can line up the top 5 of most common use-cases and
    >>> find
    >>> ease-of-use solutions to those?
    >>>
    >>> One use-case is where we loop over a DB result set (query), copy
    >>> the
    >>> rows to result, set variables and use those for each row in the
    >>> result set.
    >>> In that specific case I imagine we could wrap the "Table Input"
    >>> step
    >>> around a job entry, execute that, copy the rows to result, all
    >>> in one
    >>> job entry.
    >>> Setting the variables is something we could cram into the
    >>> "Transformation" or "Job" job entries without too much of a
    >>> problem.
    >>> That would mean we could eliminate 2 transformations: one to get
    >>> the
    >>> result set and one to set the variables inside the loop. All that
    >>> remains are 2 job entries: "Table Input" and "Transformation/Job".
    >>>
    >>> So do me (and our users:-) a favor and let us know your most
    >>> common
    >>> use-case for loops in a job.
    >>> If there is a pattern we could perhaps come up with a more
    >>> clever way of
    >>> doing this compared to writing N new job entries for "Table
    >>> Input",
    >>> "Text File Input" and so on.
    >>>
    >>> Thanks in advance!
    >>>
    >>> Matt
    >>> --
    >>> Matt Casters <mcasters (AT) pentaho (DOT) org <mailto:mcasters (AT) pentaho (DOT) org>
    >>> <mailto:mcasters (AT) pentaho (DOT) org <mailto:mcasters (AT) pentaho (DOT) org>>>
    >>>
    >>>
    >>> Chief Data Integration, Kettle founder, Author of Pentaho Kettle
    >>> Solutions
    >>> <
    >>> http://www.amazon.com/Pentaho-Kettle.../dp/0470635177
    >>> >
    >>> (Wiley
    >>> <
    >>> http://eu.wiley.com/WileyCDA/WileyTitle/productCd-0470635177.html>)
    >>>
    >>> Pentaho : The Commercial Open Source Alternative for Business
    >>> Intelligence
    >>>
    >>>
    >>> --
    >>>
    >>> You received this message because you are subscribed to the Google
    >>> Groups "kettle-developers" group.
    >>> To post to this group, send email to
    >>> kettle-developers (AT) googlegroups (DOT) com
    >>> <mailto:kettle-developers (AT) googlegroups (DOT) com>.
    >>>
    >>> To unsubscribe from this group, send email to
    >>> kettle-developers+unsubscribe (AT) g...oups (DOT) com
    >>> <mailto:kettle-developers%2Bunsubscribe (AT) googlegroups (DOT) com>.
    >>>
    >>> For more options, visit this group at
    >>> http://groups.google.com/group/kettle-developers?hl=en.
    >>>
    >>>
    >>> --
    >>> You received this message because you are subscribed to the Google
    >>> Groups "kettle-developers" group.
    >>> To post to this group, send email to
    >>> kettle-developers (AT) googlegroups (DOT) com
    >>> <mailto:kettle-developers (AT) googlegroups (DOT) com>.
    >>>
    >>> To unsubscribe from this group, send email to
    >>> kettle-developers+unsubscribe (AT) g...oups (DOT) com
    >>> <mailto:kettle-developers%2Bunsubscribe (AT) googlegroups (DOT) com>.
    >>>
    >>> For more options, visit this group at
    >>> http://groups.google.com/group/kettle-developers?hl=en.
    >>>
    >>>
    >>>
    >>>
    >>> --
    >>> Matt Casters <mcasters (AT) pentaho (DOT) org <mailto:mcasters (AT) pentaho (DOT) org>>
    >>> Chief Data Integration, Kettle founder, Author of Pentaho Kettle
    >>> Solutions
    >>> <
    >>> http://www.amazon.com/Pentaho-Kettle-Solutions-Building-Integration/dp/0470635177>
    >>> (Wiley
    >>> <http://eu.wiley.com/WileyCDA/WileyTitle/productCd-0470635177.html>)
    >>> Fonteinstraat 70, 9400 OKEGEM - Belgium - Cell : +32 486 97 29 37
    >>> Pentaho : The Commercial Open Source Alternative for Business
    >>> Intelligence
    >>>
    >>>
    >>> --
    >>> You received this message because you are subscribed to the Google
    >>> Groups "kettle-developers" group.
    >>> To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
    >>> To unsubscribe from this group, send email to
    >>> kettle-developers+unsubscribe (AT) g...oups (DOT) com.
    >>> For more options, visit this group at
    >>> http://groups.google.com/group/kettle-developers?hl=en.
    >>>

    >>
    >> --
    >> You received this message because you are subscribed to the Google Groups
    >> "kettle-developers" group.
    >> To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
    >> To unsubscribe from this group, send email to
    >> kettle-developers+unsubscribe (AT) g...oups (DOT) com.
    >> For more options, visit this group at
    >> http://groups.google.com/group/kettle-developers?hl=en.
    >>
    >>

    >
    >
    > --
    > Matt Casters <mcasters (AT) pentaho (DOT) org>
    > Chief Data Integration, Kettle founder, Author of Pentaho Kettle Solutions<http://www.amazon.com/Pentaho-Kettle-Solutions-Building-Integration/dp/0470635177>
    > (Wiley<http://eu.wiley.com/WileyCDA/WileyTitle/productCd-0470635177.html>
    > )
    >
    > Fonteinstraat 70, 9400 OKEGEM - Belgium - Cell : +32 486 97 29 37
    > Pentaho : The Commercial Open Source Alternative for Business Intelligence
    >
    >
    >



    --
    Matt Casters <mcasters (AT) pentaho (DOT) org>
    Chief Data Integration, Kettle founder, Author of Pentaho Kettle
    Solutions<http://www.amazon.com/Pentaho-Kettle-Solutions-Building-Integration/dp/0470635177>
    (Wiley <http://eu.wiley.com/WileyCDA/WileyTitle/productCd-0470635177.html>)
    Fonteinstraat 70, 9400 OKEGEM - Belgium - Cell : +32 486 97 29 37
    Pentaho : The Commercial Open Source Alternative for Business Intelligence

    --
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
    To unsubscribe from this group, send email to kettle-developers+unsubscribe (AT) g...oups (DOT) com.
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.