-
Loops in jobs
Hi Kettle devs,
It has occurred to me earlier and more recently to others that creating
loops in jobs is somewhat a cumbersome process.
So perhaps we can line up the top 5 of most common use-cases and find
ease-of-use solutions to those?
One use-case is where we loop over a DB result set (query), copy the rows to
result, set variables and use those for each row in the result set.
In that specific case I imagine we could wrap the "Table Input" step around
a job entry, execute that, copy the rows to result, all in one job entry.
Setting the variables is something we could cram into the "Transformation"
or "Job" job entries without too much of a problem.
That would mean we could eliminate 2 transformations: one to get the result
set and one to set the variables inside the loop. All that remains are 2
job entries: "Table Input" and "Transformation/Job".
So do me (and our users:-) a favor and let us know your most common use-case
for loops in a job.
If there is a pattern we could perhaps come up with a more clever way of
doing this compared to writing N new job entries for "Table Input", "Text
File Input" and so on.
Thanks in advance!
Matt
--
Matt Casters <mcasters (AT) pentaho (DOT) org>
Chief Data Integration, Kettle founder, Author of Pentaho Kettle
Solutions<http://www.amazon.com/Pentaho-Kettle-Solutions-Building-Integration/dp/0470635177>
(Wiley <http://eu.wiley.com/WileyCDA/WileyTitle/productCd-0470635177.html>)
Pentaho : The Commercial Open Source Alternative for Business Intelligence
--
You received this message because you are subscribed to the Google Groups "kettle-developers" group.
To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
To unsubscribe from this group, send email to kettle-developers+unsubscribe (AT) g...oups (DOT) com.
For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en.
-
Re: Loops in jobs
There is one particular case I use in production, which is best
described as a "crawler". I'm not sure it could be cleanly abstracted
but I'll just throw it in the pot.
I have a website and I want to take it "offline" - get all pages, and
the pages they link to, and so on - either until I reach a predefined
depth, or until I found all links.
To control the loop, i have a configuration (.properties) file
containing some specifics such as the root url (and credentials) where
I can get the pages from, as well as the max_depth.
In addition to that I have a separate "initial-links" file that lists
all urls that serve as starting point (so I can lift several pages in
the same job run)
The looping is controlled by two things
1) I keep track of the current depth in a root-level variable
2) There are three additional "scratchpad" files, a "current-links" to
keep track of the links I need to examine during the current
iteration, an "all-links" file to keep track of the links I tracked
down already in any previous iterations, and a "new-links" file to
store any links I discover during the current iteration.
The loop itself is then implemented as such:
* before the loop, i set the depth variable to 0, and I overwrite the
"current-links" file with the "initial-links" file, and the
"all-links" file is emptied.
* for each iteration, I do:
1) check if the depth variable is less than or equal to the max_depth
configuration value. if not, we're done looping.
2) process all pages pointed to the links in the "current-links" file.
If I discover any links in those pages, I store them in the
"new-links" file.
3) do a diff between the "all-links" and the "new-links" files to
discard any "new-links" that were already processed. The real new
links are then dumped to the "current-links" file
4) check if the last iteration yielded any new links (iow check if
"current-links" is not empty). If it is, we're done looping. If not, I
increase the depth variable by 1 and re-enter at step 1)
Maybe there are simpler ways to do it but this is what I have now and
it works quite well.
I don't know if anyone thinks this pattern is useful, but if so, the
challenge will be to generalize it. I mean, I can see how this would
already be very useful for the typical html website case with <a>,
<img>, <script> and <link> tags ( and friends). But in my use case,
the pages are not really html pages but xml, and the links are not
really links but identifying numbers from which my transformation
logic can derive urls for new xml files.
Anyway - just an example where I rely on loops.
kind regards,
Roland
On Tue, May 3, 2011 at 4:41 PM, Matt Casters <mcasters (AT) pentaho (DOT) org> wrote:
> Hi Kettle devs,
>
> It has occurred to me earlier and more recently to others that creating
> loops in jobs is somewhat a cumbersome process.
> So perhaps we can line up the top 5 of most common use-cases and find
> ease-of-use solutions to those?
>
> One use-case is where we loop over a DB result set (query), copy the rows to
> result, set variables and use those for each row in the result set.
> In that specific case I imagine we could wrap the "Table Input" step around
> a job entry, execute that, copy the rows to result, all in one job entry.
> Setting the variables is something we could cram into the "Transformation"
> or "Job" job entries without too much of a problem.
> That would mean we could eliminate 2 transformations: one to get the result
> set and one to set the variables inside the loop. All that remains are 2
> job entries: "Table Input" and "Transformation/Job".
>
> So do me (and our users:-) a favor and let us know your most common use-case
> for loops in a job.
> If there is a pattern we could perhaps come up with a more clever way of
> doing this compared to writing N new job entries for "Table Input", "Text
> File Input" and so on.
>
> Thanks in advance!
>
> Matt
> --
> Matt Casters <mcasters (AT) pentaho (DOT) org>
> Chief Data Integration, Kettle founder,Author ofPentaho Kettle
> Solutions(Wiley)
> Pentaho : The Commercial Open Source Alternative for Business Intelligence
>
> --
> You received this message because you are subscribed to the Google Groups
> "kettle-developers" group.
> To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
> To unsubscribe from this group, send email to
> kettle-developers+unsubscribe (AT) g...oups (DOT) com.
> For more options, visit this group at
> http://groups.google.com/group/kettle-developers?hl=en.
>
--
Roland Bouman
blog: http://rpbouman.blogspot.com/
twitter: @rolandbouman
Author of "Pentaho Solutions: Business Intelligence and Data
Warehousing with Pentaho and MySQL",
http://tinyurl.com/lvxa88 (Wiley, ISBN: 978-0-470-48432-6)
Author of "Pentaho Kettle Solutions: Building Open Source ETL
Solutions with Pentaho Data Integration",
http://tinyurl.com/33r7a8m (Wiley, ISBN: 978-0-470-63517-9)
--
You received this message because you are subscribed to the Google Groups "kettle-developers" group.
To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
To unsubscribe from this group, send email to kettle-developers+unsubscribe (AT) g...oups (DOT) com.
For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en.
-
Re: Loops in jobs
Thanks Roland!
When I read the description I was actually expecting a job that looked way
more complex, so it's not all that bad.
Also thanks to the other folks that sent in use-cases in private, sometimes
with complete jobs attached. It's highly appreciated.
From all the use-cases I think I saw a few things that always came back with
respect to loops:
1) Get a bunch of rows (from a table or a file) and copy those rows to
result
2) Loop over the rows with a job. Setting a bunch of variables is the first
thing you do in the job
for 1) I'm thinking of something dramatic to help out in the form of a
dynamic xform builder.
for 2) it would be much easier if you could set the variables in the
"Transformation" or "Job" job entries themselves. I think it makes a lot of
sense to do it there anyway.
I'll wrap these up in 2 JIRA cases and I'll try to get something going for
them as soon as possible so you can all give more feedback.
Best regards,
Matt
2011/5/6 Roland Bouman <roland.bouman (AT) gmail (DOT) com>
> There is one particular case I use in production, which is best
> described as a "crawler". I'm not sure it could be cleanly abstracted
> but I'll just throw it in the pot.
>
> I have a website and I want to take it "offline" - get all pages, and
> the pages they link to, and so on - either until I reach a predefined
> depth, or until I found all links.
>
> To control the loop, i have a configuration (.properties) file
> containing some specifics such as the root url (and credentials) where
> I can get the pages from, as well as the max_depth.
> In addition to that I have a separate "initial-links" file that lists
> all urls that serve as starting point (so I can lift several pages in
> the same job run)
>
> The looping is controlled by two things
> 1) I keep track of the current depth in a root-level variable
>
> 2) There are three additional "scratchpad" files, a "current-links" to
> keep track of the links I need to examine during the current
> iteration, an "all-links" file to keep track of the links I tracked
> down already in any previous iterations, and a "new-links" file to
> store any links I discover during the current iteration.
>
> The loop itself is then implemented as such:
>
> * before the loop, i set the depth variable to 0, and I overwrite the
> "current-links" file with the "initial-links" file, and the
> "all-links" file is emptied.
> * for each iteration, I do:
> 1) check if the depth variable is less than or equal to the max_depth
> configuration value. if not, we're done looping.
> 2) process all pages pointed to the links in the "current-links" file.
> If I discover any links in those pages, I store them in the
> "new-links" file.
> 3) do a diff between the "all-links" and the "new-links" files to
> discard any "new-links" that were already processed. The real new
> links are then dumped to the "current-links" file
> 4) check if the last iteration yielded any new links (iow check if
> "current-links" is not empty). If it is, we're done looping. If not, I
> increase the depth variable by 1 and re-enter at step 1)
>
> Maybe there are simpler ways to do it but this is what I have now and
> it works quite well.
>
> I don't know if anyone thinks this pattern is useful, but if so, the
> challenge will be to generalize it. I mean, I can see how this would
> already be very useful for the typical html website case with <a>,
> <img>, <script> and <link> tags ( and friends). But in my use case,
> the pages are not really html pages but xml, and the links are not
> really links but identifying numbers from which my transformation
> logic can derive urls for new xml files.
>
> Anyway - just an example where I rely on loops.
>
> kind regards,
>
> Roland
>
> On Tue, May 3, 2011 at 4:41 PM, Matt Casters <mcasters (AT) pentaho (DOT) org> wrote:
> > Hi Kettle devs,
> >
> > It has occurred to me earlier and more recently to others that creating
> > loops in jobs is somewhat a cumbersome process.
> > So perhaps we can line up the top 5 of most common use-cases and find
> > ease-of-use solutions to those?
> >
> > One use-case is where we loop over a DB result set (query), copy the rows
> to
> > result, set variables and use those for each row in the result set.
> > In that specific case I imagine we could wrap the "Table Input" step
> around
> > a job entry, execute that, copy the rows to result, all in one job entry.
> > Setting the variables is something we could cram into the
> "Transformation"
> > or "Job" job entries without too much of a problem.
> > That would mean we could eliminate 2 transformations: one to get the
> result
> > set and one to set the variables inside the loop. All that remains are 2
> > job entries: "Table Input" and "Transformation/Job".
> >
> > So do me (and our users:-) a favor and let us know your most common
> use-case
> > for loops in a job.
> > If there is a pattern we could perhaps come up with a more clever way of
> > doing this compared to writing N new job entries for "Table Input", "Text
> > File Input" and so on.
> >
> > Thanks in advance!
> >
> > Matt
> > --
> > Matt Casters <mcasters (AT) pentaho (DOT) org>
> > Chief Data Integration, Kettle founder, Author of Pentaho Kettle
> > Solutions (Wiley)
> > Pentaho : The Commercial Open Source Alternative for Business
> Intelligence
> >
> > --
> > You received this message because you are subscribed to the Google Groups
> > "kettle-developers" group.
> > To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
> > To unsubscribe from this group, send email to
> > kettle-developers+unsubscribe (AT) g...oups (DOT) com.
> > For more options, visit this group at
> > http://groups.google.com/group/kettle-developers?hl=en.
> >
>
>
>
> --
> Roland Bouman
> blog: http://rpbouman.blogspot.com/
> twitter: @rolandbouman
>
> Author of "Pentaho Solutions: Business Intelligence and Data
> Warehousing with Pentaho and MySQL",
> http://tinyurl.com/lvxa88 (Wiley, ISBN: 978-0-470-48432-6)
>
> Author of "Pentaho Kettle Solutions: Building Open Source ETL
> Solutions with Pentaho Data Integration",
> http://tinyurl.com/33r7a8m (Wiley, ISBN: 978-0-470-63517-9)
>
> --
> You received this message because you are subscribed to the Google Groups
> "kettle-developers" group.
> To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
> To unsubscribe from this group, send email to
> kettle-developers+unsubscribe (AT) g...oups (DOT) com.
> For more options, visit this group at
> http://groups.google.com/group/kettle-developers?hl=en.
>
>
--
Matt Casters <mcasters (AT) pentaho (DOT) org>
Chief Data Integration, Kettle founder, Author of Pentaho Kettle
Solutions<http://www.amazon.com/Pentaho-Kettle-Solutions-Building-Integration/dp/0470635177>
(Wiley <http://eu.wiley.com/WileyCDA/WileyTitle/productCd-0470635177.html>)
Fonteinstraat 70, 9400 OKEGEM - Belgium - Cell : +32 486 97 29 37
Pentaho : The Commercial Open Source Alternative for Business Intelligence
--
You received this message because you are subscribed to the Google Groups "kettle-developers" group.
To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
To unsubscribe from this group, send email to kettle-developers+unsubscribe (AT) g...oups (DOT) com.
For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en.
-
Re: Loops in jobs
Hi Matt, all!
On Fri, May 6, 2011 at 1:44 PM, Matt Casters <mcasters (AT) pentaho (DOT) org> wrote:
> When I read the description I was actually expecting a job that looked way
> more complex, so it's not all that bad.
True. I got the entire thing done in a day (including logic inside the
transformations), I expected it would take me longer.
> From all the use-cases I think I saw a few things that always came back with
> respect to loops:
> 1) Get a bunch of rows (from a table or a file) and copy those rows to
> result
> 2) Loop over the rows with a job. Setting a bunch of variables is the first
> thing you do in the job
Not sure if that was clear from my example, but I designed mine
explicitly to avoid a "copy rows to result" step. I do everything with
my scratchpad files inside transformations.
I did this on purpose - I wasn't sure how large the resultsets could
be - the number of links has a tendency to explode quite rapidly and I
assumed it would give me better performance and scalability to keep
all the row processing inside the transformations (probably at the
expense of more io, but in my case the crawling is the limiting bit,
not reading links from the files)
kind regards,
Roland.
> for 1) I'm thinking of something dramatic to help out in the form of a
> dynamic xform builder.
> for 2) it would be much easier if you could set the variables in the
> "Transformation" or "Job" job entries themselves. I think it makes a lot of
> sense to do it there anyway.
>
> I'll wrap these up in 2 JIRA cases and I'll try to get something going for
> them as soon as possible so you can all give more feedback.
>
> Best regards,
>
> Matt
>
> 2011/5/6 Roland Bouman <roland.bouman (AT) gmail (DOT) com>
>>
>> There is one particular case I use in production, which is best
>> described as a "crawler". I'm not sure it could be cleanly abstracted
>> but I'll just throw it in the pot.
>>
>> I have a website and I want to take it "offline" - get all pages, and
>> the pages they link to, and so on - either until I reach a predefined
>> depth, or until I found all links.
>>
>> To control the loop, i have a configuration (.properties) file
>> containing some specifics such as the root url (and credentials) where
>> I can get the pages from, as well as the max_depth.
>> In addition to that I have a separate "initial-links" file that lists
>> all urls that serve as starting point (so I can lift several pages in
>> the same job run)
>>
>> The looping is controlled by two things
>> 1) I keep track of the current depth in a root-level variable
>>
>> 2) There are three additional "scratchpad" files, a "current-links" to
>> keep track of the links I need to examine during the current
>> iteration, an "all-links" file to keep track of the links I tracked
>> down already in any previous iterations, and a "new-links" file to
>> store any links I discover during the current iteration.
>>
>> The loop itself is then implemented as such:
>>
>> * before the loop, i set the depth variable to 0, and I overwrite the
>> "current-links" file with the "initial-links" file, and the
>> "all-links" file is emptied.
>> * for each iteration, I do:
>> 1) check if the depth variable is less than or equal to the max_depth
>> configuration value. if not, we're done looping.
>> 2) process all pages pointed to the links in the "current-links" file.
>> If I discover any links in those pages, I store them in the
>> "new-links" file.
>> 3) do a diff between the "all-links" and the "new-links" files to
>> discard any "new-links" that were already processed. The real new
>> links are then dumped to the "current-links" file
>> 4) check if the last iteration yielded any new links (iow check if
>> "current-links" is not empty). If it is, we're done looping. If not, I
>> increase the depth variable by 1 and re-enter at step 1)
>>
>> Maybe there are simpler ways to do it but this is what I have now and
>> it works quite well.
>>
>> I don't know if anyone thinks this pattern is useful, but if so, the
>> challenge will be to generalize it. I mean, I can see how this would
>> already be very useful for the typical html website case with <a>,
>> <img>, <script> and <link> tags ( and friends). But in my use case,
>> the pages are not really html pages but xml, and the links are not
>> really links but identifying numbers from which my transformation
>> logic can derive urls for new xml files.
>>
>> Anyway - just an example where I rely on loops.
>>
>> kind regards,
>>
>> Roland
>>
>> On Tue, May 3, 2011 at 4:41 PM, Matt Casters <mcasters (AT) pentaho (DOT) org> wrote:
>> > Hi Kettle devs,
>> >
>> > It has occurred to me earlier and more recently to others that creating
>> > loops in jobs is somewhat a cumbersome process.
>> > So perhaps we can line up the top 5 of most common use-cases and find
>> > ease-of-use solutions to those?
>> >
>> > One use-case is where we loop over a DB result set (query), copy the
>> > rows to
>> > result, set variables and use those for each row in the result set.
>> > In that specific case I imagine we could wrap the "Table Input" step
>> > around
>> > a job entry, execute that, copy the rows to result, all in one job
>> > entry.
>> > Setting the variables is something we could cram into the
>> > "Transformation"
>> > or "Job" job entries without too much of a problem.
>> > That would mean we could eliminate 2 transformations: one to get the
>> > result
>> > set and one to set the variables inside the loop. All that remains are
>> > 2
>> > job entries: "Table Input" and "Transformation/Job".
>> >
>> > So do me (and our users:-) a favor and let us know your most common
>> > use-case
>> > for loops in a job.
>> > If there is a pattern we could perhaps come up with a more clever way of
>> > doing this compared to writing N new job entries for "Table Input",
>> > "Text
>> > File Input" and so on.
>> >
>> > Thanks in advance!
>> >
>> > Matt
>> > --
>> > Matt Casters <mcasters (AT) pentaho (DOT) org>
>> > Chief Data Integration, Kettle founder,Author ofPentaho Kettle
>> > Solutions(Wiley)
>> > Pentaho : The Commercial Open Source Alternative for Business
>> > Intelligence
>> >
>> > --
>> > You received this message because you are subscribed to the Google
>> > Groups
>> > "kettle-developers" group.
>> > To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
>> > To unsubscribe from this group, send email to
>> > kettle-developers+unsubscribe (AT) g...oups (DOT) com.
>> > For more options, visit this group at
>> > http://groups.google.com/group/kettle-developers?hl=en.
>> >
>>
>>
>>
>> --
>> Roland Bouman
>> blog: http://rpbouman.blogspot.com/
>> twitter: @rolandbouman
>>
>> Author of "Pentaho Solutions: Business Intelligence and Data
>> Warehousing with Pentaho and MySQL",
>> http://tinyurl.com/lvxa88 (Wiley, ISBN: 978-0-470-48432-6)
>>
>> Author of "Pentaho Kettle Solutions: Building Open Source ETL
>> Solutions with Pentaho Data Integration",
>> http://tinyurl.com/33r7a8m (Wiley, ISBN: 978-0-470-63517-9)
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "kettle-developers" group.
>> To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
>> To unsubscribe from this group, send email to
>> kettle-developers+unsubscribe (AT) g...oups (DOT) com.
>> For more options, visit this group at
>> http://groups.google.com/group/kettle-developers?hl=en.
>>
>
>
>
> --
> Matt Casters <mcasters (AT) pentaho (DOT) org>
> Chief Data Integration, Kettle founder,Author ofPentaho Kettle
> Solutions(Wiley)
> Fonteinstraat 70, 9400 OKEGEM - Belgium - Cell : +32 486 97 29 37
> Pentaho : The Commercial Open Source Alternative for Business Intelligence
>
> --
> You received this message because you are subscribed to the Google Groups
> "kettle-developers" group.
> To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
> To unsubscribe from this group, send email to
> kettle-developers+unsubscribe (AT) g...oups (DOT) com.
> For more options, visit this group at
> http://groups.google.com/group/kettle-developers?hl=en.
>
--
Roland Bouman
blog: http://rpbouman.blogspot.com/
twitter: @rolandbouman
Author of "Pentaho Solutions: Business Intelligence and Data
Warehousing with Pentaho and MySQL",
http://tinyurl.com/lvxa88 (Wiley, ISBN: 978-0-470-48432-6)
Author of "Pentaho Kettle Solutions: Building Open Source ETL
Solutions with Pentaho Data Integration",
http://tinyurl.com/33r7a8m (Wiley, ISBN: 978-0-470-63517-9)
--
You received this message because you are subscribed to the Google Groups "kettle-developers" group.
To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
To unsubscribe from this group, send email to kettle-developers+unsubscribe (AT) g...oups (DOT) com.
For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en.
-
Re: Loops in jobs
We have two difference scenarios where we get 'layered' ETL.
1. We perfect an idea and it becomes a module, then it is wrapped up in a
bigger process which loops it.
2. We gather some inputs, and loop jobs containing transforms on the initial
input. (The most mentioned case here).
We could chalk this up to development lifecycle where we should revisit our
design and try to make it more sequential, than layered.
One area where I struggle is when something has been in production for a
while and we try moving to a later version of PDI and encounter problems.
Due to the layering, it takes time and effort to prepare the transformation
to be retested as an individual unit again. I have to add variables and set
them, which i must make sure I delete when reinserting the piece into the
production ETL. I also have to gather up all the inputs, which could be
determined in layers above to work towards diagnosing, repeating and
correcting the issue.
Example. We have PDI 4.0.1-CE. Works fantastically for a transform working
with text files and DB lookups to process therapy transactions from a sister
company. It completes in 1.5 minutes. I try it in any version after 4.0.1
and it takes so long, I've never seen completion. Logs don't create
errors, because there are no errors. From spoon, I cannot dive down into
the running job because an iteration a lower level would have been
finished. I've enabled logs in the transform and nothing get's
generated. So unless I either do not understand logs OR tear the tranform
completely apart and build up again, there is no hope of quickly zeroing in
on the issue.
Possible solutions come to mind:
Basically a way to switch into and out of testing mode. This implies more
than 'row output' or 'debug'. I want variable setting, activation of steps
that feed dummy data into the transform and special notes. When one exits
'testing mode', then those dummy data steps, variables and notes all are
disabled and disappear neatly away.
Just an idea. I have little pride. Please feel free to point me to
specific education resources if my suggestions reveal a lack of
understanding of some functionality that exists in PDI currently to help me
over the hurdle.
PS: Although it has improved greatly, visually and functionally, I still
find logging and performance management one of the most confusing aspects of
PDI, especially when ETL layers up and is hard to execute on a transform by
transform basis.
Thanks for the great product and the commitment to constant and never ending
improvement.
Brandon
On Tue, May 3, 2011 at 9:41 AM, Matt Casters <mcasters (AT) pentaho (DOT) org> wrote:
> Hi Kettle devs,
>
> It has occurred to me earlier and more recently to others that creating
> loops in jobs is somewhat a cumbersome process.
> So perhaps we can line up the top 5 of most common use-cases and find
> ease-of-use solutions to those?
>
> One use-case is where we loop over a DB result set (query), copy the rows
> to result, set variables and use those for each row in the result set.
> In that specific case I imagine we could wrap the "Table Input" step around
> a job entry, execute that, copy the rows to result, all in one job entry.
> Setting the variables is something we could cram into the "Transformation"
> or "Job" job entries without too much of a problem.
> That would mean we could eliminate 2 transformations: one to get the result
> set and one to set the variables inside the loop. All that remains are 2
> job entries: "Table Input" and "Transformation/Job".
>
> So do me (and our users:-) a favor and let us know your most common
> use-case for loops in a job.
> If there is a pattern we could perhaps come up with a more clever way of
> doing this compared to writing N new job entries for "Table Input", "Text
> File Input" and so on.
>
> Thanks in advance!
>
> Matt
> --
> Matt Casters <mcasters (AT) pentaho (DOT) org>
> Chief Data Integration, Kettle founder, Author of Pentaho Kettle Solutions<http://www.amazon.com/Pentaho-Kettle-Solutions-Building-Integration/dp/0470635177>
> (Wiley<http://eu.wiley.com/WileyCDA/WileyTitle/productCd-0470635177.html>
> )
> Pentaho : The Commercial Open Source Alternative for Business Intelligence
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "kettle-developers" group.
> To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
> To unsubscribe from this group, send email to
> kettle-developers+unsubscribe (AT) g...oups (DOT) com.
> For more options, visit this group at
> http://groups.google.com/group/kettle-developers?hl=en.
>
--
You received this message because you are subscribed to the Google Groups "kettle-developers" group.
To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
To unsubscribe from this group, send email to kettle-developers+unsubscribe (AT) g...oups (DOT) com.
For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en.
-
Re: Loops in jobs
A little bit unusual, but it happened:
A) Have a looping option in a transformation.
Yes, I know this is normally out of scope for transformations, but the
following use case is there in regards to a continues data load / real
time processing:
- Query JMS or other queues with some sort of Input step
- This will stop when all the available data is processed (or for JMS
until a timeout is reached or just continue infinite - this is the only
step I know where this "infinite continue" is implemented)
- Now imagine, after a specific amount of time new data arrive and need
to be processed asap.
Actually you may need to restart the transformation what costs
performance, need a looping logic outside of the transformation and all
this leads to a small delay.
We have for example the JMS consumer step that can read continuously. A
problem here is: When you want to stop the transformation in a
controlled way, this is not possible at this time since all steps get
the signal to stop and rows may silently disappear. A feature request
for this: react to a signal that is sent only to this step and stop
processing in a controlled way. A JIRA needs to be created but in the
project where this was found, a loss of some rows (up to the buffer
size) is not critical, really...
From my point of view this request leads to an interesting loop design
proposal for transformations:
1) Have an option for some input steps to just restart after they are
finished.
2) The restart may be delayed for a specific amount of time.
3) This step needs to listen to a specific signal to stop. This is
different from stopping the transformation.
B) This type of looping option is actually possible within jobs with the
Start job entry (repeat functionality) whereas product management set
this feature to deprecated since a while. The recommendation was to
restart and loop by the scheduler or external process. But out of the
above given reasons (overhead, delays and even avoid overlapping job
runs), I still think the features of the start job entry are still
valid. Especially since the link between the scheduler and monitoring is
not given, yet.
Adding the listener for a specific signal to stop a start job in a
controlled way and keep the repeat option, would be very nice to have.
When a real looping logic within a job would be realized the delay or a
restart of a transformation may be acceptable in the above scenario.
For looping we may think of a "for/next" job entry implementation with
some options like:
- maximum number of iterations
- idle time before a next cycle
- some conditions to check if it should continue or not (I know the
phrase "some conditions" may be a wide range, e.g. variables to check or
checking a date/time range)
- a break option to end the "for/next" loop premature
- nested "for/next" should be allowed thus we may need an ID to reference
That are my thoughts for now...
Cheers,
Jens
Am 03.05.2011 16:41, schrieb Matt Casters:
> Hi Kettle devs,
>
> It has occurred to me earlier and more recently to others that creating
> loops in jobs is somewhat a cumbersome process.
> So perhaps we can line up the top 5 of most common use-cases and find
> ease-of-use solutions to those?
>
> One use-case is where we loop over a DB result set (query), copy the
> rows to result, set variables and use those for each row in the result set.
> In that specific case I imagine we could wrap the "Table Input" step
> around a job entry, execute that, copy the rows to result, all in one
> job entry.
> Setting the variables is something we could cram into the
> "Transformation" or "Job" job entries without too much of a problem.
> That would mean we could eliminate 2 transformations: one to get the
> result set and one to set the variables inside the loop. All that
> remains are 2 job entries: "Table Input" and "Transformation/Job".
>
> So do me (and our users:-) a favor and let us know your most common
> use-case for loops in a job.
> If there is a pattern we could perhaps come up with a more clever way of
> doing this compared to writing N new job entries for "Table Input",
> "Text File Input" and so on.
>
> Thanks in advance!
>
> Matt
> --
> Matt Casters <mcasters (AT) pentaho (DOT) org <mailto:mcasters (AT) pentaho (DOT) org>>
> Chief Data Integration, Kettle founder, Author of Pentaho Kettle
> Solutions
> <http://www.amazon.com/Pentaho-Kettle-Solutions-Building-Integration/dp/0470635177> (Wiley
> <http://eu.wiley.com/WileyCDA/WileyTitle/productCd-0470635177.html>)
> Pentaho : The Commercial Open Source Alternative for Business Intelligence
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "kettle-developers" group.
> To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
> To unsubscribe from this group, send email to
> kettle-developers+unsubscribe (AT) g...oups (DOT) com.
> For more options, visit this group at
> http://groups.google.com/group/kettle-developers?hl=en.
--
You received this message because you are subscribed to the Google Groups "kettle-developers" group.
To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
To unsubscribe from this group, send email to kettle-developers+unsubscribe (AT) g...oups (DOT) com.
For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en.
-
Re: Loops in jobs
Namaste Jens,
I didn't know that anyone "deprecated" looping under the "Start" job entry.
There must have been a good reason to do so. However, I hereby officially
"un-deprecate" it for 4.2.0 after I did extensive testing to make sure no
memory leaks remain.
http://jira.pentaho.com/browse/PDI-5502
The first job entry to use the logic you descibed is the "HL7 MLLP Input"
job entry. That one gets a single record from a HL7 queue and passes the
message to the other job entries. It's very fast too ;-)
The "re-start without stop" logic of steps will actually be possible with
the "Single Threaded Transformation" execution engine. My plan was to drop
that one behind a mapping-like step but it will become a separate step.
The way that it will work is that you have say 1000 rows entering the new
engine. That will process the rows in batches of 1-N.
Certain steps will be restarted every time, like file reading steps or steps
that read a single element off a queue somewhere (your JMS sample). Most
steps will simply keep running. In a single call of the
SingleThreadedTransExecutor.oneIteration(), all N rows in the batch are
pushed through the transformation.
Up until now I hadn't considered this to be a loop, but you are right, this
would work for your situation.
In the mean time I created http://jira.pentaho.com/browse/PDI-6157 and
implemented the work. Actually Sven Boden did years ago when he implemented
parameter support in the Job and Transformation entries.
Cheers,
Matt
2011/5/6 Jens Bleuel <jbleuel (AT) pentaho (DOT) com>
> A little bit unusual, but it happened:
>
> A) Have a looping option in a transformation.
> Yes, I know this is normally out of scope for transformations, but the
> following use case is there in regards to a continues data load / real time
> processing:
> - Query JMS or other queues with some sort of Input step
> - This will stop when all the available data is processed (or for JMS until
> a timeout is reached or just continue infinite - this is the only step I
> know where this "infinite continue" is implemented)
> - Now imagine, after a specific amount of time new data arrive and need to
> be processed asap.
>
> Actually you may need to restart the transformation what costs performance,
> need a looping logic outside of the transformation and all this leads to a
> small delay.
>
> We have for example the JMS consumer step that can read continuously. A
> problem here is: When you want to stop the transformation in a controlled
> way, this is not possible at this time since all steps get the signal to
> stop and rows may silently disappear. A feature request for this: react to a
> signal that is sent only to this step and stop processing in a controlled
> way. A JIRA needs to be created but in the project where this was found, a
> loss of some rows (up to the buffer size) is not critical, really...
>
> From my point of view this request leads to an interesting loop design
> proposal for transformations:
> 1) Have an option for some input steps to just restart after they are
> finished.
> 2) The restart may be delayed for a specific amount of time.
> 3) This step needs to listen to a specific signal to stop. This is
> different from stopping the transformation.
>
> B) This type of looping option is actually possible within jobs with the
> Start job entry (repeat functionality) whereas product management set this
> feature to deprecated since a while. The recommendation was to restart and
> loop by the scheduler or external process. But out of the above given
> reasons (overhead, delays and even avoid overlapping job runs), I still
> think the features of the start job entry are still valid. Especially since
> the link between the scheduler and monitoring is not given, yet.
> Adding the listener for a specific signal to stop a start job in a
> controlled way and keep the repeat option, would be very nice to have.
>
> When a real looping logic within a job would be realized the delay or a
> restart of a transformation may be acceptable in the above scenario.
>
> For looping we may think of a "for/next" job entry implementation with some
> options like:
> - maximum number of iterations
> - idle time before a next cycle
> - some conditions to check if it should continue or not (I know the phrase
> "some conditions" may be a wide range, e.g. variables to check or checking a
> date/time range)
> - a break option to end the "for/next" loop premature
> - nested "for/next" should be allowed thus we may need an ID to reference
>
> That are my thoughts for now...
>
> Cheers,
> Jens
>
> Am 03.05.2011 16:41, schrieb Matt Casters:
>
>> Hi Kettle devs,
>>
>> It has occurred to me earlier and more recently to others that creating
>> loops in jobs is somewhat a cumbersome process.
>> So perhaps we can line up the top 5 of most common use-cases and find
>> ease-of-use solutions to those?
>>
>> One use-case is where we loop over a DB result set (query), copy the
>> rows to result, set variables and use those for each row in the result
>> set.
>> In that specific case I imagine we could wrap the "Table Input" step
>> around a job entry, execute that, copy the rows to result, all in one
>> job entry.
>> Setting the variables is something we could cram into the
>> "Transformation" or "Job" job entries without too much of a problem.
>> That would mean we could eliminate 2 transformations: one to get the
>> result set and one to set the variables inside the loop. All that
>> remains are 2 job entries: "Table Input" and "Transformation/Job".
>>
>> So do me (and our users:-) a favor and let us know your most common
>> use-case for loops in a job.
>> If there is a pattern we could perhaps come up with a more clever way of
>> doing this compared to writing N new job entries for "Table Input",
>> "Text File Input" and so on.
>>
>> Thanks in advance!
>>
>> Matt
>> --
>> Matt Casters <mcasters (AT) pentaho (DOT) org <mailto:mcasters (AT) pentaho (DOT) org>>
>>
>> Chief Data Integration, Kettle founder, Author of Pentaho Kettle
>> Solutions
>> <
>> http://www.amazon.com/Pentaho-Kettle-Solutions-Building-Integration/dp/0470635177>
>> (Wiley
>> <http://eu.wiley.com/WileyCDA/WileyTitle/productCd-0470635177.html>)
>>
>> Pentaho : The Commercial Open Source Alternative for Business Intelligence
>>
>>
>> --
>>
>> You received this message because you are subscribed to the Google
>> Groups "kettle-developers" group.
>> To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
>> To unsubscribe from this group, send email to
>> kettle-developers+unsubscribe (AT) g...oups (DOT) com.
>> For more options, visit this group at
>> http://groups.google.com/group/kettle-developers?hl=en.
>>
>
> --
> You received this message because you are subscribed to the Google Groups
> "kettle-developers" group.
> To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
> To unsubscribe from this group, send email to
> kettle-developers+unsubscribe (AT) g...oups (DOT) com.
> For more options, visit this group at
> http://groups.google.com/group/kettle-developers?hl=en.
>
>
--
Matt Casters <mcasters (AT) pentaho (DOT) org>
Chief Data Integration, Kettle founder, Author of Pentaho Kettle
Solutions<http://www.amazon.com/Pentaho-Kettle-Solutions-Building-Integration/dp/0470635177>
(Wiley <http://eu.wiley.com/WileyCDA/WileyTitle/productCd-0470635177.html>)
Fonteinstraat 70, 9400 OKEGEM - Belgium - Cell : +32 486 97 29 37
Pentaho : The Commercial Open Source Alternative for Business Intelligence
--
You received this message because you are subscribed to the Google Groups "kettle-developers" group.
To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
To unsubscribe from this group, send email to kettle-developers+unsubscribe (AT) g...oups (DOT) com.
For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en.
-
Re: Loops in jobs
Wow, amazing good news!
Am 06.05.2011 22:29, schrieb Matt Casters:
> Namaste Jens,
>
> I didn't know that anyone "deprecated" looping under the "Start" job
> entry. There must have been a good reason to do so. However, I hereby
> officially "un-deprecate" it for 4.2.0 after I did extensive testing to
> make sure no memory leaks remain.
>
> http://jira.pentaho.com/browse/PDI-5502
>
> The first job entry to use the logic you descibed is the "HL7 MLLP
> Input" job entry. That one gets a single record from a HL7 queue and
> passes the message to the other job entries. It's very fast too ;-)
>
> The "re-start without stop" logic of steps will actually be possible
> with the "Single Threaded Transformation" execution engine. My plan was
> to drop that one behind a mapping-like step but it will become a
> separate step.
> The way that it will work is that you have say 1000 rows entering the
> new engine. That will process the rows in batches of 1-N.
> Certain steps will be restarted every time, like file reading steps or
> steps that read a single element off a queue somewhere (your JMS
> sample). Most steps will simply keep running. In a single call of the
> SingleThreadedTransExecutor.oneIteration(), all N rows in the batch are
> pushed through the transformation.
>
> Up until now I hadn't considered this to be a loop, but you are right,
> this would work for your situation.
>
> In the mean time I created http://jira.pentaho.com/browse/PDI-6157 and
> implemented the work. Actually Sven Boden did years ago when he
> implemented parameter support in the Job and Transformation entries.
>
> Cheers,
> Matt
>
>
> 2011/5/6 Jens Bleuel <jbleuel (AT) pentaho (DOT) com <mailto:jbleuel (AT) pentaho (DOT) com>>
>
> A little bit unusual, but it happened:
>
> A) Have a looping option in a transformation.
> Yes, I know this is normally out of scope for transformations, but
> the following use case is there in regards to a continues data load
> / real time processing:
> - Query JMS or other queues with some sort of Input step
> - This will stop when all the available data is processed (or for
> JMS until a timeout is reached or just continue infinite - this is
> the only step I know where this "infinite continue" is implemented)
> - Now imagine, after a specific amount of time new data arrive and
> need to be processed asap.
>
> Actually you may need to restart the transformation what costs
> performance, need a looping logic outside of the transformation and
> all this leads to a small delay.
>
> We have for example the JMS consumer step that can read
> continuously. A problem here is: When you want to stop the
> transformation in a controlled way, this is not possible at this
> time since all steps get the signal to stop and rows may silently
> disappear. A feature request for this: react to a signal that is
> sent only to this step and stop processing in a controlled way. A
> JIRA needs to be created but in the project where this was found, a
> loss of some rows (up to the buffer size) is not critical, really...
>
> From my point of view this request leads to an interesting loop
> design proposal for transformations:
> 1) Have an option for some input steps to just restart after they
> are finished.
> 2) The restart may be delayed for a specific amount of time.
> 3) This step needs to listen to a specific signal to stop. This is
> different from stopping the transformation.
>
> B) This type of looping option is actually possible within jobs with
> the Start job entry (repeat functionality) whereas product
> management set this feature to deprecated since a while. The
> recommendation was to restart and loop by the scheduler or external
> process. But out of the above given reasons (overhead, delays and
> even avoid overlapping job runs), I still think the features of the
> start job entry are still valid. Especially since the link between
> the scheduler and monitoring is not given, yet.
> Adding the listener for a specific signal to stop a start job in a
> controlled way and keep the repeat option, would be very nice to have.
>
> When a real looping logic within a job would be realized the delay
> or a restart of a transformation may be acceptable in the above
> scenario.
>
> For looping we may think of a "for/next" job entry implementation
> with some options like:
> - maximum number of iterations
> - idle time before a next cycle
> - some conditions to check if it should continue or not (I know the
> phrase "some conditions" may be a wide range, e.g. variables to
> check or checking a date/time range)
> - a break option to end the "for/next" loop premature
> - nested "for/next" should be allowed thus we may need an ID to
> reference
>
> That are my thoughts for now...
>
> Cheers,
> Jens
>
> Am 03.05.2011 16:41, schrieb Matt Casters:
>
> Hi Kettle devs,
>
> It has occurred to me earlier and more recently to others that
> creating
> loops in jobs is somewhat a cumbersome process.
> So perhaps we can line up the top 5 of most common use-cases and
> find
> ease-of-use solutions to those?
>
> One use-case is where we loop over a DB result set (query), copy the
> rows to result, set variables and use those for each row in the
> result set.
> In that specific case I imagine we could wrap the "Table Input" step
> around a job entry, execute that, copy the rows to result, all
> in one
> job entry.
> Setting the variables is something we could cram into the
> "Transformation" or "Job" job entries without too much of a problem.
> That would mean we could eliminate 2 transformations: one to get the
> result set and one to set the variables inside the loop. All that
> remains are 2 job entries: "Table Input" and "Transformation/Job".
>
> So do me (and our users:-) a favor and let us know your most common
> use-case for loops in a job.
> If there is a pattern we could perhaps come up with a more
> clever way of
> doing this compared to writing N new job entries for "Table Input",
> "Text File Input" and so on.
>
> Thanks in advance!
>
> Matt
> --
> Matt Casters <mcasters (AT) pentaho (DOT) org <mailto:mcasters (AT) pentaho (DOT) org>
> <mailto:mcasters (AT) pentaho (DOT) org <mailto:mcasters (AT) pentaho (DOT) org>>>
>
> Chief Data Integration, Kettle founder, Author of Pentaho Kettle
> Solutions
> <http://www.amazon.com/Pentaho-Kettle-Solutions-Building-Integration/dp/0470635177>
> (Wiley
> <http://eu.wiley.com/WileyCDA/WileyTitle/productCd-0470635177.html>)
>
> Pentaho : The Commercial Open Source Alternative for Business
> Intelligence
>
>
> --
>
> You received this message because you are subscribed to the Google
> Groups "kettle-developers" group.
> To post to this group, send email to
> kettle-developers (AT) googlegroups (DOT) com
> <mailto:kettle-developers (AT) googlegroups (DOT) com>.
> To unsubscribe from this group, send email to
> kettle-developers+unsubscribe (AT) g...oups (DOT) com
> <mailto:kettle-developers%2Bunsubscribe (AT) googlegroups (DOT) com>.
> For more options, visit this group at
> http://groups.google.com/group/kettle-developers?hl=en.
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "kettle-developers" group.
> To post to this group, send email to
> kettle-developers (AT) googlegroups (DOT) com
> <mailto:kettle-developers (AT) googlegroups (DOT) com>.
> To unsubscribe from this group, send email to
> kettle-developers+unsubscribe (AT) g...oups (DOT) com
> <mailto:kettle-developers%2Bunsubscribe (AT) googlegroups (DOT) com>.
> For more options, visit this group at
> http://groups.google.com/group/kettle-developers?hl=en.
>
>
>
>
> --
> Matt Casters <mcasters (AT) pentaho (DOT) org <mailto:mcasters (AT) pentaho (DOT) org>>
> Chief Data Integration, Kettle founder, Author of Pentaho Kettle
> Solutions
> <http://www.amazon.com/Pentaho-Kettle-Solutions-Building-Integration/dp/0470635177> (Wiley
> <http://eu.wiley.com/WileyCDA/WileyTitle/productCd-0470635177.html>)
> Fonteinstraat 70, 9400 OKEGEM - Belgium - Cell : +32 486 97 29 37
> Pentaho : The Commercial Open Source Alternative for Business Intelligence
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "kettle-developers" group.
> To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
> To unsubscribe from this group, send email to
> kettle-developers+unsubscribe (AT) g...oups (DOT) com.
> For more options, visit this group at
> http://groups.google.com/group/kettle-developers?hl=en.
--
You received this message because you are subscribed to the Google Groups "kettle-developers" group.
To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
To unsubscribe from this group, send email to kettle-developers+unsubscribe (AT) g...oups (DOT) com.
For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en.
-
Re: Loops in jobs
I finished the single threading engine last night with the encouragements of
the gang on ##pentaho (thanks for that):
http://www.ibridge.be/?p=200
Take care,
Matt
2011/5/7 Jens Bleuel <jbleuel (AT) pentaho (DOT) com>
> Wow, amazing good news!
>
> Am 06.05.2011 22:29, schrieb Matt Casters:
>
>> Namaste Jens,
>>
>> I didn't know that anyone "deprecated" looping under the "Start" job
>> entry. There must have been a good reason to do so. However, I hereby
>> officially "un-deprecate" it for 4.2.0 after I did extensive testing to
>> make sure no memory leaks remain.
>>
>> http://jira.pentaho.com/browse/PDI-5502
>>
>> The first job entry to use the logic you descibed is the "HL7 MLLP
>> Input" job entry. That one gets a single record from a HL7 queue and
>> passes the message to the other job entries. It's very fast too ;-)
>>
>> The "re-start without stop" logic of steps will actually be possible
>> with the "Single Threaded Transformation" execution engine. My plan was
>> to drop that one behind a mapping-like step but it will become a
>> separate step.
>> The way that it will work is that you have say 1000 rows entering the
>> new engine. That will process the rows in batches of 1-N.
>> Certain steps will be restarted every time, like file reading steps or
>> steps that read a single element off a queue somewhere (your JMS
>> sample). Most steps will simply keep running. In a single call of the
>> SingleThreadedTransExecutor.oneIteration(), all N rows in the batch are
>> pushed through the transformation.
>>
>> Up until now I hadn't considered this to be a loop, but you are right,
>> this would work for your situation.
>>
>> In the mean time I created http://jira.pentaho.com/browse/PDI-6157 and
>> implemented the work. Actually Sven Boden did years ago when he
>> implemented parameter support in the Job and Transformation entries.
>>
>> Cheers,
>> Matt
>>
>>
>> 2011/5/6 Jens Bleuel <jbleuel (AT) pentaho (DOT) com <mailto:jbleuel (AT) pentaho (DOT) com>>
>>
>>
>> A little bit unusual, but it happened:
>>
>> A) Have a looping option in a transformation.
>> Yes, I know this is normally out of scope for transformations, but
>> the following use case is there in regards to a continues data load
>> / real time processing:
>> - Query JMS or other queues with some sort of Input step
>> - This will stop when all the available data is processed (or for
>> JMS until a timeout is reached or just continue infinite - this is
>> the only step I know where this "infinite continue" is implemented)
>> - Now imagine, after a specific amount of time new data arrive and
>> need to be processed asap.
>>
>> Actually you may need to restart the transformation what costs
>> performance, need a looping logic outside of the transformation and
>> all this leads to a small delay.
>>
>> We have for example the JMS consumer step that can read
>> continuously. A problem here is: When you want to stop the
>> transformation in a controlled way, this is not possible at this
>> time since all steps get the signal to stop and rows may silently
>> disappear. A feature request for this: react to a signal that is
>> sent only to this step and stop processing in a controlled way. A
>> JIRA needs to be created but in the project where this was found, a
>> loss of some rows (up to the buffer size) is not critical, really...
>>
>> From my point of view this request leads to an interesting loop
>> design proposal for transformations:
>> 1) Have an option for some input steps to just restart after they
>> are finished.
>> 2) The restart may be delayed for a specific amount of time.
>> 3) This step needs to listen to a specific signal to stop. This is
>> different from stopping the transformation.
>>
>> B) This type of looping option is actually possible within jobs with
>> the Start job entry (repeat functionality) whereas product
>> management set this feature to deprecated since a while. The
>> recommendation was to restart and loop by the scheduler or external
>> process. But out of the above given reasons (overhead, delays and
>> even avoid overlapping job runs), I still think the features of the
>> start job entry are still valid. Especially since the link between
>> the scheduler and monitoring is not given, yet.
>> Adding the listener for a specific signal to stop a start job in a
>> controlled way and keep the repeat option, would be very nice to have.
>>
>> When a real looping logic within a job would be realized the delay
>> or a restart of a transformation may be acceptable in the above
>> scenario.
>>
>> For looping we may think of a "for/next" job entry implementation
>> with some options like:
>> - maximum number of iterations
>> - idle time before a next cycle
>> - some conditions to check if it should continue or not (I know the
>> phrase "some conditions" may be a wide range, e.g. variables to
>> check or checking a date/time range)
>> - a break option to end the "for/next" loop premature
>> - nested "for/next" should be allowed thus we may need an ID to
>> reference
>>
>> That are my thoughts for now...
>>
>> Cheers,
>> Jens
>>
>> Am 03.05.2011 16:41, schrieb Matt Casters:
>>
>> Hi Kettle devs,
>>
>> It has occurred to me earlier and more recently to others that
>> creating
>> loops in jobs is somewhat a cumbersome process.
>> So perhaps we can line up the top 5 of most common use-cases and
>> find
>> ease-of-use solutions to those?
>>
>> One use-case is where we loop over a DB result set (query), copy
>> the
>> rows to result, set variables and use those for each row in the
>> result set.
>> In that specific case I imagine we could wrap the "Table Input"
>> step
>> around a job entry, execute that, copy the rows to result, all
>> in one
>> job entry.
>> Setting the variables is something we could cram into the
>> "Transformation" or "Job" job entries without too much of a
>> problem.
>> That would mean we could eliminate 2 transformations: one to get
>> the
>> result set and one to set the variables inside the loop. All that
>> remains are 2 job entries: "Table Input" and "Transformation/Job".
>>
>> So do me (and our users:-) a favor and let us know your most common
>> use-case for loops in a job.
>> If there is a pattern we could perhaps come up with a more
>> clever way of
>> doing this compared to writing N new job entries for "Table Input",
>> "Text File Input" and so on.
>>
>> Thanks in advance!
>>
>> Matt
>> --
>> Matt Casters <mcasters (AT) pentaho (DOT) org <mailto:mcasters (AT) pentaho (DOT) org>
>> <mailto:mcasters (AT) pentaho (DOT) org <mailto:mcasters (AT) pentaho (DOT) org>>>
>>
>>
>> Chief Data Integration, Kettle founder, Author of Pentaho Kettle
>> Solutions
>> <
>> http://www.amazon.com/Pentaho-Kettle.../dp/0470635177
>> >
>> (Wiley
>> <http://eu.wiley.com/WileyCDA/WileyTi...470635177.html
>> >)
>>
>> Pentaho : The Commercial Open Source Alternative for Business
>> Intelligence
>>
>>
>> --
>>
>> You received this message because you are subscribed to the Google
>> Groups "kettle-developers" group.
>> To post to this group, send email to
>> kettle-developers (AT) googlegroups (DOT) com
>> <mailto:kettle-developers (AT) googlegroups (DOT) com>.
>>
>> To unsubscribe from this group, send email to
>> kettle-developers+unsubscribe (AT) g...oups (DOT) com
>> <mailto:kettle-developers%2Bunsubscribe (AT) googlegroups (DOT) com>.
>>
>> For more options, visit this group at
>> http://groups.google.com/group/kettle-developers?hl=en.
>>
>>
>> --
>> You received this message because you are subscribed to the Google
>> Groups "kettle-developers" group.
>> To post to this group, send email to
>> kettle-developers (AT) googlegroups (DOT) com
>> <mailto:kettle-developers (AT) googlegroups (DOT) com>.
>>
>> To unsubscribe from this group, send email to
>> kettle-developers+unsubscribe (AT) g...oups (DOT) com
>> <mailto:kettle-developers%2Bunsubscribe (AT) googlegroups (DOT) com>.
>>
>> For more options, visit this group at
>> http://groups.google.com/group/kettle-developers?hl=en.
>>
>>
>>
>>
>> --
>> Matt Casters <mcasters (AT) pentaho (DOT) org <mailto:mcasters (AT) pentaho (DOT) org>>
>> Chief Data Integration, Kettle founder, Author of Pentaho Kettle
>> Solutions
>> <
>> http://www.amazon.com/Pentaho-Kettle-Solutions-Building-Integration/dp/0470635177>
>> (Wiley
>> <http://eu.wiley.com/WileyCDA/WileyTitle/productCd-0470635177.html>)
>> Fonteinstraat 70, 9400 OKEGEM - Belgium - Cell : +32 486 97 29 37
>> Pentaho : The Commercial Open Source Alternative for Business Intelligence
>>
>>
>> --
>> You received this message because you are subscribed to the Google
>> Groups "kettle-developers" group.
>> To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
>> To unsubscribe from this group, send email to
>> kettle-developers+unsubscribe (AT) g...oups (DOT) com.
>> For more options, visit this group at
>> http://groups.google.com/group/kettle-developers?hl=en.
>>
>
> --
> You received this message because you are subscribed to the Google Groups
> "kettle-developers" group.
> To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
> To unsubscribe from this group, send email to
> kettle-developers+unsubscribe (AT) g...oups (DOT) com.
> For more options, visit this group at
> http://groups.google.com/group/kettle-developers?hl=en.
>
>
--
Matt Casters <mcasters (AT) pentaho (DOT) org>
Chief Data Integration, Kettle founder, Author of Pentaho Kettle
Solutions<http://www.amazon.com/Pentaho-Kettle-Solutions-Building-Integration/dp/0470635177>
(Wiley <http://eu.wiley.com/WileyCDA/WileyTitle/productCd-0470635177.html>)
Fonteinstraat 70, 9400 OKEGEM - Belgium - Cell : +32 486 97 29 37
Pentaho : The Commercial Open Source Alternative for Business Intelligence
--
You received this message because you are subscribed to the Google Groups "kettle-developers" group.
To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
To unsubscribe from this group, send email to kettle-developers+unsubscribe (AT) g...oups (DOT) com.
For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en.
-
Re: Loops in jobs
To come back to Jens' loop question inside a transformation...
A step can choose to implement the "batchComplete()" method.
For example, the Sort rows step implements this.
If you send the rows by 500 into the "single threader" step, they will be
sorted in blocks of 500.
/**
* Calling this method will alert the step that we finished passing a
batch of records to the step.
* Specifically for steps like "Sort Rows" it means that the buffered
rows can be sorted and passed on.
* @throws KettleException In case an error occurs during the processing
of the batch of rows.
*/
public void batchComplete() throws KettleException;
That is, the step needs to know that no more rows are coming or otherwise
the transformation would block. That is since in this example only 500 rows
will ever arrive and the transformation will never be considered Finished
until the parent transformation is finished.
Anyway, have fun with it!
Matt
2011/5/7 Matt Casters <mcasters (AT) pentaho (DOT) org>
> I finished the single threading engine last night with the encouragements
> of the gang on ##pentaho (thanks for that):
>
> http://www.ibridge.be/?p=200
>
> Take care,
> Matt
>
>
> 2011/5/7 Jens Bleuel <jbleuel (AT) pentaho (DOT) com>
>
>> Wow, amazing good news!
>>
>> Am 06.05.2011 22:29, schrieb Matt Casters:
>>
>>> Namaste Jens,
>>>
>>> I didn't know that anyone "deprecated" looping under the "Start" job
>>> entry. There must have been a good reason to do so. However, I hereby
>>> officially "un-deprecate" it for 4.2.0 after I did extensive testing to
>>> make sure no memory leaks remain.
>>>
>>> http://jira.pentaho.com/browse/PDI-5502
>>>
>>> The first job entry to use the logic you descibed is the "HL7 MLLP
>>> Input" job entry. That one gets a single record from a HL7 queue and
>>> passes the message to the other job entries. It's very fast too ;-)
>>>
>>> The "re-start without stop" logic of steps will actually be possible
>>> with the "Single Threaded Transformation" execution engine. My plan was
>>> to drop that one behind a mapping-like step but it will become a
>>> separate step.
>>> The way that it will work is that you have say 1000 rows entering the
>>> new engine. That will process the rows in batches of 1-N.
>>> Certain steps will be restarted every time, like file reading steps or
>>> steps that read a single element off a queue somewhere (your JMS
>>> sample). Most steps will simply keep running. In a single call of the
>>> SingleThreadedTransExecutor.oneIteration(), all N rows in the batch are
>>> pushed through the transformation.
>>>
>>> Up until now I hadn't considered this to be a loop, but you are right,
>>> this would work for your situation.
>>>
>>> In the mean time I created http://jira.pentaho.com/browse/PDI-6157 and
>>> implemented the work. Actually Sven Boden did years ago when he
>>> implemented parameter support in the Job and Transformation entries.
>>>
>>> Cheers,
>>> Matt
>>>
>>>
>>> 2011/5/6 Jens Bleuel <jbleuel (AT) pentaho (DOT) com <mailto:jbleuel (AT) pentaho (DOT) com>>
>>>
>>>
>>> A little bit unusual, but it happened:
>>>
>>> A) Have a looping option in a transformation.
>>> Yes, I know this is normally out of scope for transformations, but
>>> the following use case is there in regards to a continues data load
>>> / real time processing:
>>> - Query JMS or other queues with some sort of Input step
>>> - This will stop when all the available data is processed (or for
>>> JMS until a timeout is reached or just continue infinite - this is
>>> the only step I know where this "infinite continue" is implemented)
>>> - Now imagine, after a specific amount of time new data arrive and
>>> need to be processed asap.
>>>
>>> Actually you may need to restart the transformation what costs
>>> performance, need a looping logic outside of the transformation and
>>> all this leads to a small delay.
>>>
>>> We have for example the JMS consumer step that can read
>>> continuously. A problem here is: When you want to stop the
>>> transformation in a controlled way, this is not possible at this
>>> time since all steps get the signal to stop and rows may silently
>>> disappear. A feature request for this: react to a signal that is
>>> sent only to this step and stop processing in a controlled way. A
>>> JIRA needs to be created but in the project where this was found, a
>>> loss of some rows (up to the buffer size) is not critical, really...
>>>
>>> From my point of view this request leads to an interesting loop
>>> design proposal for transformations:
>>> 1) Have an option for some input steps to just restart after they
>>> are finished.
>>> 2) The restart may be delayed for a specific amount of time.
>>> 3) This step needs to listen to a specific signal to stop. This is
>>> different from stopping the transformation.
>>>
>>> B) This type of looping option is actually possible within jobs with
>>> the Start job entry (repeat functionality) whereas product
>>> management set this feature to deprecated since a while. The
>>> recommendation was to restart and loop by the scheduler or external
>>> process. But out of the above given reasons (overhead, delays and
>>> even avoid overlapping job runs), I still think the features of the
>>> start job entry are still valid. Especially since the link between
>>> the scheduler and monitoring is not given, yet.
>>> Adding the listener for a specific signal to stop a start job in a
>>> controlled way and keep the repeat option, would be very nice to have.
>>>
>>> When a real looping logic within a job would be realized the delay
>>> or a restart of a transformation may be acceptable in the above
>>> scenario.
>>>
>>> For looping we may think of a "for/next" job entry implementation
>>> with some options like:
>>> - maximum number of iterations
>>> - idle time before a next cycle
>>> - some conditions to check if it should continue or not (I know the
>>> phrase "some conditions" may be a wide range, e.g. variables to
>>> check or checking a date/time range)
>>> - a break option to end the "for/next" loop premature
>>> - nested "for/next" should be allowed thus we may need an ID to
>>> reference
>>>
>>> That are my thoughts for now...
>>>
>>> Cheers,
>>> Jens
>>>
>>> Am 03.05.2011 16:41, schrieb Matt Casters:
>>>
>>> Hi Kettle devs,
>>>
>>> It has occurred to me earlier and more recently to others that
>>> creating
>>> loops in jobs is somewhat a cumbersome process.
>>> So perhaps we can line up the top 5 of most common use-cases and
>>> find
>>> ease-of-use solutions to those?
>>>
>>> One use-case is where we loop over a DB result set (query), copy
>>> the
>>> rows to result, set variables and use those for each row in the
>>> result set.
>>> In that specific case I imagine we could wrap the "Table Input"
>>> step
>>> around a job entry, execute that, copy the rows to result, all
>>> in one
>>> job entry.
>>> Setting the variables is something we could cram into the
>>> "Transformation" or "Job" job entries without too much of a
>>> problem.
>>> That would mean we could eliminate 2 transformations: one to get
>>> the
>>> result set and one to set the variables inside the loop. All that
>>> remains are 2 job entries: "Table Input" and "Transformation/Job".
>>>
>>> So do me (and our users:-) a favor and let us know your most
>>> common
>>> use-case for loops in a job.
>>> If there is a pattern we could perhaps come up with a more
>>> clever way of
>>> doing this compared to writing N new job entries for "Table
>>> Input",
>>> "Text File Input" and so on.
>>>
>>> Thanks in advance!
>>>
>>> Matt
>>> --
>>> Matt Casters <mcasters (AT) pentaho (DOT) org <mailto:mcasters (AT) pentaho (DOT) org>
>>> <mailto:mcasters (AT) pentaho (DOT) org <mailto:mcasters (AT) pentaho (DOT) org>>>
>>>
>>>
>>> Chief Data Integration, Kettle founder, Author of Pentaho Kettle
>>> Solutions
>>> <
>>> http://www.amazon.com/Pentaho-Kettle.../dp/0470635177
>>> >
>>> (Wiley
>>> <
>>> http://eu.wiley.com/WileyCDA/WileyTitle/productCd-0470635177.html>)
>>>
>>> Pentaho : The Commercial Open Source Alternative for Business
>>> Intelligence
>>>
>>>
>>> --
>>>
>>> You received this message because you are subscribed to the Google
>>> Groups "kettle-developers" group.
>>> To post to this group, send email to
>>> kettle-developers (AT) googlegroups (DOT) com
>>> <mailto:kettle-developers (AT) googlegroups (DOT) com>.
>>>
>>> To unsubscribe from this group, send email to
>>> kettle-developers+unsubscribe (AT) g...oups (DOT) com
>>> <mailto:kettle-developers%2Bunsubscribe (AT) googlegroups (DOT) com>.
>>>
>>> For more options, visit this group at
>>> http://groups.google.com/group/kettle-developers?hl=en.
>>>
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "kettle-developers" group.
>>> To post to this group, send email to
>>> kettle-developers (AT) googlegroups (DOT) com
>>> <mailto:kettle-developers (AT) googlegroups (DOT) com>.
>>>
>>> To unsubscribe from this group, send email to
>>> kettle-developers+unsubscribe (AT) g...oups (DOT) com
>>> <mailto:kettle-developers%2Bunsubscribe (AT) googlegroups (DOT) com>.
>>>
>>> For more options, visit this group at
>>> http://groups.google.com/group/kettle-developers?hl=en.
>>>
>>>
>>>
>>>
>>> --
>>> Matt Casters <mcasters (AT) pentaho (DOT) org <mailto:mcasters (AT) pentaho (DOT) org>>
>>> Chief Data Integration, Kettle founder, Author of Pentaho Kettle
>>> Solutions
>>> <
>>> http://www.amazon.com/Pentaho-Kettle-Solutions-Building-Integration/dp/0470635177>
>>> (Wiley
>>> <http://eu.wiley.com/WileyCDA/WileyTitle/productCd-0470635177.html>)
>>> Fonteinstraat 70, 9400 OKEGEM - Belgium - Cell : +32 486 97 29 37
>>> Pentaho : The Commercial Open Source Alternative for Business
>>> Intelligence
>>>
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "kettle-developers" group.
>>> To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
>>> To unsubscribe from this group, send email to
>>> kettle-developers+unsubscribe (AT) g...oups (DOT) com.
>>> For more options, visit this group at
>>> http://groups.google.com/group/kettle-developers?hl=en.
>>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "kettle-developers" group.
>> To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
>> To unsubscribe from this group, send email to
>> kettle-developers+unsubscribe (AT) g...oups (DOT) com.
>> For more options, visit this group at
>> http://groups.google.com/group/kettle-developers?hl=en.
>>
>>
>
>
> --
> Matt Casters <mcasters (AT) pentaho (DOT) org>
> Chief Data Integration, Kettle founder, Author of Pentaho Kettle Solutions<http://www.amazon.com/Pentaho-Kettle-Solutions-Building-Integration/dp/0470635177>
> (Wiley<http://eu.wiley.com/WileyCDA/WileyTitle/productCd-0470635177.html>
> )
>
> Fonteinstraat 70, 9400 OKEGEM - Belgium - Cell : +32 486 97 29 37
> Pentaho : The Commercial Open Source Alternative for Business Intelligence
>
>
>
--
Matt Casters <mcasters (AT) pentaho (DOT) org>
Chief Data Integration, Kettle founder, Author of Pentaho Kettle
Solutions<http://www.amazon.com/Pentaho-Kettle-Solutions-Building-Integration/dp/0470635177>
(Wiley <http://eu.wiley.com/WileyCDA/WileyTitle/productCd-0470635177.html>)
Fonteinstraat 70, 9400 OKEGEM - Belgium - Cell : +32 486 97 29 37
Pentaho : The Commercial Open Source Alternative for Business Intelligence
--
You received this message because you are subscribed to the Google Groups "kettle-developers" group.
To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
To unsubscribe from this group, send email to kettle-developers+unsubscribe (AT) g...oups (DOT) com.
For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en.
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules