Hitachi Vantara Pentaho Community Forums
Results 1 to 9 of 9

Thread: PDI over Hadoop?

  1. #1
    Yuval Oren Guest

    Default PDI over Hadoop?

    Hello,

    I'm a big fan of PDI but have had to switch to Hadoop for some data flows
    for scalability reasons. I miss the nice UI tools of PDI, though. Has anyone
    thought of building a Hadoop implementation of the PDI steps? It's a huge
    project for sure, but it's something Hadoop is sorely lacking.

    Cheers,
    Yuval

    --
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
    To unsubscribe from this group, send email to kettle-developers+unsubscribe (AT) googlegroups (DOT) com.
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en.

  2. #2
    Matt Casters Guest

    Default Re: PDI over Hadoop?

    Hi Yuval,

    It's certainly not the first time we heard the question.
    Some brainstorming sessions have been done already and some steps have been taken in the architecture. Still, this is not something for the 4.0 timeframe.
    We will try to get "something" done later this year.

    As usual, input from your side is important. What would you like to see Kettle do specifically? Generate Hadoop jobs or have Hadoop run steps?
    The way I see it, folks are mostly interesting in the UI part. How do you see this?

    Take care,
    Matt
    --
    Matt Casters <mcasters (AT) pentaho (DOT) org>
    Chief Data Integration
    Fonteinstraat 70, 9400 OKEGEM - Belgium - Cell : +32 486 97 29 37
    Pentaho : The Commercial Open Source Alternative for Business Intelligence

    On Monday 15 March 2010 17:07:48 Yuval Oren wrote:
    > Hello,
    >
    > I'm a big fan of PDI but have had to switch to Hadoop for some data flows
    > for scalability reasons. I miss the nice UI tools of PDI, though. Has anyone
    > thought of building a Hadoop implementation of the PDI steps? It's a huge
    > project for sure, but it's something Hadoop is sorely lacking.
    >
    > Cheers,
    > Yuval
    >
    >


    --
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
    To unsubscribe from this group, send email to kettle-developers+unsubscribe (AT) googlegroups (DOT) com.
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en.

  3. #3
    Yuval Oren Guest

    Default Re: PDI over Hadoop?

    Thanks for the quick reply, Matt. I certainly wasn't expecting this in the
    4.0 timeframe. The devil is in the details, but at a high level, here's what
    I was thinking for a Hadoop-based transformation runner:

    1. Read a transfomation definition
    2. Construct a Hadoop job to perform the equivalent of what Kettle does now.
    3. Execute the job.

    Kettle could have a library of MapReduce classes, one for each
    transformation step. On Hadoop 0.20+, MapReduce classes can actually be
    chained together, so this mirrors Kettle's architecture quite nicely. A
    Kettle transformation = a Hadoop job running a single ChainMapper, and a
    Kettle step = a Hadoop MapReduce class. This means that you wouldn't have to
    generate any Java code; the transformation runner would use Hadoop methods
    to construct the execution graph and then execute a single job.

    Psuedo-code might look a little like this:
    http://hadoop.apache.org/common/docs...ainMapper.html

    I'm sure there are many special cases that would have to be handled. One I
    can think of is input and output. For example, you probably wouldn't want
    every machine in your cluster doing a database query. For such an input,
    you'd probably want the "driver" machine -- the one executing the job -- to
    do the query and write it to the Hadoop DFS so cluster machines can just
    read from a file.

    Cheers,
    Yuval

    On Mon, Mar 15, 2010 at 9:14 AM, Matt Casters <mcasters (AT) pentaho (DOT) org> wrote:

    > Hi Yuval,
    >
    > It's certainly not the first time we heard the question.
    > Some brainstorming sessions have been done already and some steps have been
    > taken in the architecture. Still, this is not something for the 4.0
    > timeframe.
    > We will try to get "something" done later this year.
    >
    > As usual, input from your side is important. What would you like to see
    > Kettle do specifically? Generate Hadoop jobs or have Hadoop run steps?
    > The way I see it, folks are mostly interesting in the UI part. How do you
    > see this?
    >
    > Take care,
    > Matt
    > --
    > Matt Casters <mcasters (AT) pentaho (DOT) org>
    > Chief Data Integration
    > Fonteinstraat 70, 9400 OKEGEM - Belgium - Cell : +32 486 97 29 37
    > Pentaho : The Commercial Open Source Alternative for Business Intelligence
    >
    > On Monday 15 March 2010 17:07:48 Yuval Oren wrote:
    > > Hello,
    > >
    > > I'm a big fan of PDI but have had to switch to Hadoop for some data flows
    > > for scalability reasons. I miss the nice UI tools of PDI, though. Has

    > anyone
    > > thought of building a Hadoop implementation of the PDI steps? It's a huge
    > > project for sure, but it's something Hadoop is sorely lacking.
    > >
    > > Cheers,
    > > Yuval
    > >
    > >

    >
    > --
    > You received this message because you are subscribed to the Google Groups
    > "kettle-developers" group.
    > To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
    > To unsubscribe from this group, send email to
    > kettle-developers+unsubscribe (AT) googlegroups (DOT) com<kettle-developers%2Bunsubscribe (AT) googlegroups (DOT) com>
    > .
    > For more options, visit this group at
    > http://groups.google.com/group/kettle-developers?hl=en.
    >
    >


    --
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
    To unsubscribe from this group, send email to kettle-developers+unsubscribe (AT) googlegroups (DOT) com.
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en.

  4. #4
    Biswapesh Chattopadhyay Guest

    Default Re: PDI over Hadoop?

    This is a neat idea. We had initially evaluated KETTLE but had to go
    for a MapReduce based solution for scalability reasons; however we had
    to write XML files by hand to get it working and it was a big pain. I
    think the way to go here is to get the XML generated by the UI and
    write special Mappers and Reducers corresponding to the different
    transformations; then having a step which generates the Hadoop job
    using these components from the XML. One issue we have seen though is
    that the number of mapreduces increases significantly when you do that
    and that makes the ETL quite a bit slower than hand-coded mapreduces,
    so you may consider developing some simple optimizations while
    generating the hadoop graph.

    On Mon, Mar 15, 2010 at 10:17 AM, Yuval Oren <trumpety (AT) gmail (DOT) com> wrote:[color=blue]
    > Thanks for the quick reply, Matt. I certainly wasn't expecting this in the
    > 4.0 timeframe. The devil is in the details, but at a high level, here's what
    > I was thinking for a Hadoop-based transformation runner:
    >
    > 1. Read a transfomation definition
    > 2. Construct a Hadoop job to perform the equivalent of what Kettle does now.
    > 3. Execute the job.
    >
    > Kettle could have a library of MapReduce classes, one for each
    > transformation step. On Hadoop 0.20+, MapReduce classes can actually be
    > chained together, so this mirrors Kettle's architecture quite nicely. A
    > Kettle transformation = a Hadoop job running a single ChainMapper, and a
    > Kettle step = a Hadoop MapReduce class. This means that you wouldn't have to
    > generate any Java code; the transformation runner would use Hadoop methods
    > to construct the execution graph and then execute a single job.
    >
    > Psuedo-code might look a little like this:
    > http://hadoop.apache.org/common/docs...ainMapper.html
    >
    > I'm sure there are many special cases that would have to be handled. One I
    > can think of is input and output. For example, you probably wouldn't want
    > every machine in your cluster doing a database query. For such an input,
    > you'd probably want the "driver" machine -- the one executing the job -- to
    > do the query and write it to the Hadoop DFS so cluster machines can just
    > read from a file.
    >
    > Cheers,
    > Yuval
    >
    > On Mon, Mar 15, 2010 at 9:14 AM, Matt Casters <mcasters (AT) pentaho (DOT) org> wrote:[color=green]
    >>
    >> Hi Yuval,
    >>
    >> It's certainly not the first time we heard the question.
    >> Some brainstorming sessions have been done already and some steps have
    >> been taken in the architecture. Still, this is not something for the 4.0
    >> timeframe.
    >> We will try to get "something" done later this year.
    >>
    >> As usual, input from your side is important.

  5. #5
    Nicholas Goodman Guest

    Default Re: PDI over Hadoop?

    Perhaps an intermediate model generation to something like Cascadings can be of help as well. The overall object models for PDI xforms and Cascadings are not wildly dissimilar.

    However, as the author of the whitepaper that showed compelling PDI results on commodity, cloud based text file processing (which is a heck of a lot like what happens on Hadoop clusters) I'm in no hurry. MR has a bunch of nice things (recoverability, node recovery, auto file sharding/dist etc) but PDI can scale out to billions of records and100+ nodes.

    So... we're all in agreement it's a good idea, but since PDI can already "go big" just not "go big with auto recovery/self managing work distribution" I don't know it's a burning requirement. Unless - is there something else that MR buys other than the items I just mentioned?

    Nick

    On Mar 18, 2010, at 11:09 AM, Biswapesh Chattopadhyay wrote:

    > This is a neat idea. We had initially evaluated KETTLE but had to go
    > for a MapReduce based solution for scalability reasons; however we had
    > to write XML files by hand to get it working and it was a big pain. I
    > think the way to go here is to get the XML generated by the UI and
    > write special Mappers and Reducers corresponding to the different
    > transformations; then having a step which generates the Hadoop job
    > using these components from the XML. One issue we have seen though is
    > that the number of mapreduces increases significantly when you do that
    > and that makes the ETL quite a bit slower than hand-coded mapreduces,
    > so you may consider developing some simple optimizations while
    > generating the hadoop graph.
    >
    > On Mon, Mar 15, 2010 at 10:17 AM, Yuval Oren <trumpety (AT) gmail (DOT) com> wrote:
    >> Thanks for the quick reply, Matt. I certainly wasn't expecting this in the
    >> 4.0 timeframe. The devil is in the details, but at a high level, here's what
    >> I was thinking for a Hadoop-based transformation runner:
    >>
    >> 1. Read a transfomation definition
    >> 2. Construct a Hadoop job to perform the equivalent of what Kettle does now.
    >> 3. Execute the job.
    >>
    >> Kettle could have a library of MapReduce classes, one for each
    >> transformation step. On Hadoop 0.20+, MapReduce classes can actually be
    >> chained together, so this mirrors Kettle's architecture quite nicely. A
    >> Kettle transformation = a Hadoop job running a single ChainMapper, and a
    >> Kettle step = a Hadoop MapReduce class. This means that you wouldn't have to
    >> generate any Java code; the transformation runner would use Hadoop methods
    >> to construct the execution graph and then execute a single job.
    >>
    >> Psuedo-code might look a little like this:
    >> http://hadoop.apache.org/common/docs...ainMapper.html
    >>
    >> I'm sure there are many special cases that would have to be handled. One I
    >> can think of is input and output. For example, you probably wouldn't want
    >> every machine in your cluster doing a database query. For such an input,
    >> you'd probably want the "driver" machine -- the one executing the job -- to
    >> do the query and write it to the Hadoop DFS so cluster machines can just
    >> read from a file.
    >>
    >> Cheers,
    >> Yuval
    >>
    >> On Mon, Mar 15, 2010 at 9:14 AM, Matt Casters <mcasters (AT) pentaho (DOT) org> wrote:
    >>>
    >>> Hi Yuval,
    >>>
    >>> It's certainly not the first time we heard the question.
    >>> Some brainstorming sessions have been done already and some steps have
    >>> been taken in the architecture. Still, this is not something for the 4.0
    >>> timeframe.
    >>> We will try to get "something" done later this year.
    >>>
    >>> As usual, input from your side is important. What would you like to see
    >>> Kettle do specifically? Generate Hadoop jobs or have Hadoop run steps?
    >>> The way I see it, folks are mostly interesting in the UI part. How do you
    >>> see this?
    >>>
    >>> Take care,
    >>> Matt
    >>> --
    >>> Matt Casters <mcasters (AT) pentaho (DOT) org>
    >>> Chief Data Integration
    >>> Fonteinstraat 70, 9400 OKEGEM - Belgium - Cell : +32 486 97 29 37
    >>> Pentaho : The Commercial Open Source Alternative for Business Intelligence
    >>>
    >>> On Monday 15 March 2010 17:07:48 Yuval Oren wrote:
    >>>> Hello,
    >>>>
    >>>> I'm a big fan of PDI but have had to switch to Hadoop for some data
    >>>> flows
    >>>> for scalability reasons. I miss the nice UI tools of PDI, though. Has
    >>>> anyone
    >>>> thought of building a Hadoop implementation of the PDI steps? It's a
    >>>> huge
    >>>> project for sure, but it's something Hadoop is sorely lacking.
    >>>>
    >>>> Cheers,
    >>>> Yuval
    >>>>
    >>>>
    >>>
    >>> --
    >>> You received this message because you are subscribed to the Google Groups
    >>> "kettle-developers" group.
    >>> To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com..
    >>> To unsubscribe from this group, send email to
    >>> kettle-developers+unsubscribe (AT) googlegroups (DOT) com.
    >>> For more options, visit this group at
    >>> http://groups.google.com/group/kettle-developers?hl=en.
    >>>

    >>
    >> --
    >> You received this message because you are subscribed to the Google Groups
    >> "kettle-developers" group.
    >> To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
    >> To unsubscribe from this group, send email to
    >> kettle-developers+unsubscribe (AT) googlegroups (DOT) com.
    >> For more options, visit this group at
    >> http://groups.google.com/group/kettle-developers?hl=en.
    >>

    >
    > --
    > You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    > To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
    > To unsubscribe from this group, send email to kettle-developers+unsubscribe (AT) googlegroups (DOT) com.
    > For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en.
    >


    --
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
    To unsubscribe from this group, send email to kettle-developers+unsubscribe (AT) googlegroups (DOT) com.
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en.

  6. #6
    Biswapesh Chattopadhyay Guest

    Default Re: PDI over Hadoop?

    Commodity hardware + Large scale long running process + No failover /
    recoverability == No reliability. And where I come from, billions is
    considered small volumes :-)

    On Thu, Mar 18, 2010 at 12:48 PM, Nicholas Goodman
    <ngoodman (AT) bayontechnologies (DOT) com> wrote:[color=blue]
    > Perhaps an intermediate model generation to something like Cascadings can be of help as well.

  7. #7
    Yuval Oren Guest

    Default Re: PDI over Hadoop?

    I haven't played around with Kettle's distributed features, so I don't have
    the complete picture here. I do really like in Hadoop that you don't have to
    think (much, anyway) about distributing the processing. It just happens. I
    think that would be possible with Kettle over Hadoop.

    Lack of reliability and recoverability is also a dealbreaker for us and, I
    imagine, many others. My company has enough machines that failures are a
    regular occurrence, and we don't have that many, relatively speaking.

    As Biswapesh points out, though, we'd need to manage the number of MR
    instances. Maybe transformations could be grouped together, and each MR
    class could use Kettle itself to run sub-transformations. As the MR graph is
    built, certain types of steps, such as sorts and groups, would always
    trigger a new sub-transformation.

    Yuval


    On Thu, Mar 18, 2010 at 1:33 PM, Biswapesh Chattopadhyay <
    biswapesh (AT) gmail (DOT) com> wrote:

    > Commodity hardware + Large scale long running process + No failover /
    > recoverability == No reliability. And where I come from, billions is
    > considered small volumes :-)
    >
    > On Thu, Mar 18, 2010 at 12:48 PM, Nicholas Goodman
    > <ngoodman (AT) bayontechnologies (DOT) com> wrote:
    > > Perhaps an intermediate model generation to something like Cascadings can

    > be of help as well. The overall object models for PDI xforms and Cascadings
    > are not wildly dissimilar.
    > >
    > > However, as the author of the whitepaper that showed compelling PDI

    > results on commodity, cloud based text file processing (which is a heck of a
    > lot like what happens on Hadoop clusters) I'm in no hurry. MR has a bunch
    > of nice things (recoverability, node recovery, auto file sharding/dist etc)
    > but PDI can scale out to billions of records and100+ nodes.
    > >
    > > So... we're all in agreement it's a good idea, but since PDI can already

    > "go big" just not "go big with auto recovery/self managing work
    > distribution" I don't know it's a burning requirement. Unless - is there
    > something else that MR buys other than the items I just mentioned?
    > >
    > > Nick
    > >
    > > On Mar 18, 2010, at 11:09 AM, Biswapesh Chattopadhyay wrote:
    > >
    > >> This is a neat idea. We had initially evaluated KETTLE but had to go
    > >> for a MapReduce based solution for scalability reasons; however we had
    > >> to write XML files by hand to get it working and it was a big pain. I
    > >> think the way to go here is to get the XML generated by the UI and
    > >> write special Mappers and Reducers corresponding to the different
    > >> transformations; then having a step which generates the Hadoop job
    > >> using these components from the XML. One issue we have seen though is
    > >> that the number of mapreduces increases significantly when you do that
    > >> and that makes the ETL quite a bit slower than hand-coded mapreduces,
    > >> so you may consider developing some simple optimizations while
    > >> generating the hadoop graph.
    > >>
    > >> On Mon, Mar 15, 2010 at 10:17 AM, Yuval Oren <trumpety (AT) gmail (DOT) com>

    > wrote:
    > >>> Thanks for the quick reply, Matt. I certainly wasn't expecting this in

    > the
    > >>> 4.0 timeframe. The devil is in the details, but at a high level, here's

    > what
    > >>> I was thinking for a Hadoop-based transformation runner:
    > >>>
    > >>> 1. Read a transfomation definition
    > >>> 2. Construct a Hadoop job to perform the equivalent of what Kettle does

    > now.
    > >>> 3. Execute the job.
    > >>>
    > >>> Kettle could have a library of MapReduce classes, one for each
    > >>> transformation step. On Hadoop 0.20+, MapReduce classes can actually be
    > >>> chained together, so this mirrors Kettle's architecture quite nicely. A
    > >>> Kettle transformation = a Hadoop job running a single ChainMapper, and

    > a
    > >>> Kettle step = a Hadoop MapReduce class. This means that you wouldn't

    > have to
    > >>> generate any Java code; the transformation runner would use Hadoop

    > methods
    > >>> to construct the execution graph and then execute a single job.
    > >>>
    > >>> Psuedo-code might look a little like this:
    > >>>

    > http://hadoop.apache.org/common/docs...ainMapper.html
    > >>>
    > >>> I'm sure there are many special cases that would have to be handled.

    > One I
    > >>> can think of is input and output. For example, you probably wouldn't

    > want
    > >>> every machine in your cluster doing a database query. For such an

    > input,
    > >>> you'd probably want the "driver" machine -- the one executing the job

    > -- to
    > >>> do the query and write it to the Hadoop DFS so cluster machines can

    > just
    > >>> read from a file.
    > >>>
    > >>> Cheers,
    > >>> Yuval
    > >>>
    > >>> On Mon, Mar 15, 2010 at 9:14 AM, Matt Casters <mcasters (AT) pentaho (DOT) org>

    > wrote:
    > >>>>
    > >>>> Hi Yuval,
    > >>>>
    > >>>> It's certainly not the first time we heard the question.
    > >>>> Some brainstorming sessions have been done already and some steps have
    > >>>> been taken in the architecture. Still, this is not something for the

    > 4.0
    > >>>> timeframe.
    > >>>> We will try to get "something" done later this year.
    > >>>>
    > >>>> As usual, input from your side is important. What would you like to

    > see
    > >>>> Kettle do specifically? Generate Hadoop jobs or have Hadoop run

    > steps?
    > >>>> The way I see it, folks are mostly interesting in the UI part. How do

    > you
    > >>>> see this?
    > >>>>
    > >>>> Take care,
    > >>>> Matt
    > >>>> --
    > >>>> Matt Casters <mcasters (AT) pentaho (DOT) org>
    > >>>> Chief Data Integration
    > >>>> Fonteinstraat 70, 9400 OKEGEM - Belgium - Cell : +32 486 97 29 37
    > >>>> Pentaho : The Commercial Open Source Alternative for Business

    > Intelligence
    > >>>>
    > >>>> On Monday 15 March 2010 17:07:48 Yuval Oren wrote:
    > >>>>> Hello,
    > >>>>>
    > >>>>> I'm a big fan of PDI but have had to switch to Hadoop for some data
    > >>>>> flows
    > >>>>> for scalability reasons. I miss the nice UI tools of PDI, though. Has
    > >>>>> anyone
    > >>>>> thought of building a Hadoop implementation of the PDI steps? It's a
    > >>>>> huge
    > >>>>> project for sure, but it's something Hadoop is sorely lacking.
    > >>>>>
    > >>>>> Cheers,
    > >>>>> Yuval
    > >>>>>
    > >>>>>
    > >>>>
    > >>>> --
    > >>>> You received this message because you are subscribed to the Google

    > Groups
    > >>>> "kettle-developers" group.
    > >>>> To post to this group, send email to

    > kettle-developers (AT) googlegroups (DOT) com.
    > >>>> To unsubscribe from this group, send email to
    > >>>> kettle-developers+unsubscribe (AT) googlegroups (DOT) com<kettle-developers%2Bunsubscribe (AT) googlegroups (DOT) com>

    > .
    > >>>> For more options, visit this group at
    > >>>> http://groups.google.com/group/kettle-developers?hl=en.
    > >>>>
    > >>>
    > >>> --
    > >>> You received this message because you are subscribed to the Google

    > Groups
    > >>> "kettle-developers" group.
    > >>> To post to this group, send email to

    > kettle-developers (AT) googlegroups (DOT) com.
    > >>> To unsubscribe from this group, send email to
    > >>> kettle-developers+unsubscribe (AT) googlegroups (DOT) com<kettle-developers%2Bunsubscribe (AT) googlegroups (DOT) com>

    > .
    > >>> For more options, visit this group at
    > >>> http://groups.google.com/group/kettle-developers?hl=en.
    > >>>
    > >>
    > >> --
    > >> You received this message because you are subscribed to the Google

    > Groups "kettle-developers" group.
    > >> To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com

    > .
    > >> To unsubscribe from this group, send email to

    > kettle-developers+unsubscribe (AT) googlegroups (DOT) com<kettle-developers%2Bunsubscribe (AT) googlegroups (DOT) com>
    > .
    > >> For more options, visit this group at

    > http://groups.google.com/group/kettle-developers?hl=en.
    > >>

    > >
    > > --
    > > You received this message because you are subscribed to the Google Groups

    > "kettle-developers" group.
    > > To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
    > > To unsubscribe from this group, send email to

    > kettle-developers+unsubscribe (AT) googlegroups (DOT) com<kettle-developers%2Bunsubscribe (AT) googlegroups (DOT) com>
    > .
    > > For more options, visit this group at

    > http://groups.google.com/group/kettle-developers?hl=en.
    > >
    > >

    >
    > --
    > You received this message because you are subscribed to the Google Groups
    > "kettle-developers" group.
    > To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
    > To unsubscribe from this group, send email to
    > kettle-developers+unsubscribe (AT) googlegroups (DOT) com<kettle-developers%2Bunsubscribe (AT) googlegroups (DOT) com>
    > .
    > For more options, visit this group at
    > http://groups.google.com/group/kettle-developers?hl=en.
    >
    >


    --
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
    To unsubscribe from this group, send email to kettle-developers+unsubscribe (AT) googlegroups (DOT) com.
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en.

  8. #8
    Nicholas Goodman Guest

    Default Re: PDI over Hadoop?

    Hi Yuval - hope you're doing well!

    You guys are both correct: for large scale (where "billions" is considered small), long running (10hr+) 200 node data crunching MR, and the capabilities I already mentioned (recov, auto balancing, file sharding, etc) are great features.

    I think there'd be ways to, in a meta data driven way split Kettle xforms into individual jobs (using the Hadoop chaining pieces Yuval mentioned or an intermediate meta model like Cascading). I think this would be very very cool. I think you two should do it! I'm pretty sure that Matt won't say no to committing a great Hadoop runtime for PDI.

    Nick

    On Mar 18, 2010, at 2:17 PM, Yuval Oren wrote:

    > I haven't played around with Kettle's distributed features, so I don't have the complete picture here. I do really like in Hadoop that you don't have to think (much, anyway) about distributing the processing. It just happens. I think that would be possible with Kettle over Hadoop.
    >
    > Lack of reliability and recoverability is also a dealbreaker for us and, I imagine, many others. My company has enough machines that failures are a regular occurrence, and we don't have that many, relatively speaking.
    >
    > As Biswapesh points out, though, we'd need to manage the number of MR instances. Maybe transformations could be grouped together, and each MR class could use Kettle itself to run sub-transformations. As the MR graph is built, certain types of steps, such as sorts and groups, would always trigger a new sub-transformation.
    >
    > Yuval
    >
    >
    > On Thu, Mar 18, 2010 at 1:33 PM, Biswapesh Chattopadhyay <biswapesh (AT) gmail (DOT) .com> wrote:
    > Commodity hardware + Large scale long running process + No failover /
    > recoverability == No reliability. And where I come from, billions is
    > considered small volumes :-)
    >
    > On Thu, Mar 18, 2010 at 12:48 PM, Nicholas Goodman
    > <ngoodman (AT) bayontechnologies (DOT) com> wrote:
    > > Perhaps an intermediate model generation to something like Cascadings can be of help as well. The overall object models for PDI xforms and Cascadings are not wildly dissimilar.
    > >
    > > However, as the author of the whitepaper that showed compelling PDI results on commodity, cloud based text file processing (which is a heck of a lot like what happens on Hadoop clusters) I'm in no hurry. MR has a bunch of nice things (recoverability, node recovery, auto file sharding/dist etc) but PDI can scale out to billions of records and100+ nodes.
    > >
    > > So... we're all in agreement it's a good idea, but since PDI can already "go big" just not "go big with auto recovery/self managing work distribution" I don't know it's a burning requirement. Unless - is there something else that MR buys other than the items I just mentioned?
    > >
    > > Nick
    > >
    > > On Mar 18, 2010, at 11:09 AM, Biswapesh Chattopadhyay wrote:
    > >
    > >> This is a neat idea. We had initially evaluated KETTLE but had to go
    > >> for a MapReduce based solution for scalability reasons; however we had
    > >> to write XML files by hand to get it working and it was a big pain. I
    > >> think the way to go here is to get the XML generated by the UI and
    > >> write special Mappers and Reducers corresponding to the different
    > >> transformations; then having a step which generates the Hadoop job
    > >> using these components from the XML. One issue we have seen though is
    > >> that the number of mapreduces increases significantly when you do that
    > >> and that makes the ETL quite a bit slower than hand-coded mapreduces,
    > >> so you may consider developing some simple optimizations while
    > >> generating the hadoop graph.
    > >>
    > >> On Mon, Mar 15, 2010 at 10:17 AM, Yuval Oren <trumpety (AT) gmail (DOT) com> wrote:
    > >>> Thanks for the quick reply, Matt. I certainly wasn't expecting this in the
    > >>> 4.0 timeframe. The devil is in the details, but at a high level, here's what
    > >>> I was thinking for a Hadoop-based transformation runner:
    > >>>
    > >>> 1. Read a transfomation definition
    > >>> 2. Construct a Hadoop job to perform the equivalent of what Kettle does now.
    > >>> 3. Execute the job.
    > >>>
    > >>> Kettle could have a library of MapReduce classes, one for each
    > >>> transformation step. On Hadoop 0.20+, MapReduce classes can actually be
    > >>> chained together, so this mirrors Kettle's architecture quite nicely. A
    > >>> Kettle transformation = a Hadoop job running a single ChainMapper, and a
    > >>> Kettle step = a Hadoop MapReduce class. This means that you wouldn't have to
    > >>> generate any Java code; the transformation runner would use Hadoop methods
    > >>> to construct the execution graph and then execute a single job.
    > >>>
    > >>> Psuedo-code might look a little like this:
    > >>> http://hadoop.apache.org/common/docs...ainMapper.html
    > >>>
    > >>> I'm sure there are many special cases that would have to be handled. One I
    > >>> can think of is input and output. For example, you probably wouldn't want
    > >>> every machine in your cluster doing a database query. For such an input,
    > >>> you'd probably want the "driver" machine -- the one executing the job -- to
    > >>> do the query and write it to the Hadoop DFS so cluster machines can just
    > >>> read from a file.
    > >>>
    > >>> Cheers,
    > >>> Yuval
    > >>>
    > >>> On Mon, Mar 15, 2010 at 9:14 AM, Matt Casters <mcasters (AT) pentaho (DOT) org> wrote:
    > >>>>
    > >>>> Hi Yuval,
    > >>>>
    > >>>> It's certainly not the first time we heard the question.
    > >>>> Some brainstorming sessions have been done already and some steps have
    > >>>> been taken in the architecture. Still, this is not something for the 4.0
    > >>>> timeframe.
    > >>>> We will try to get "something" done later this year.
    > >>>>
    > >>>> As usual, input from your side is important. What would you like to see
    > >>>> Kettle do specifically? Generate Hadoop jobs or have Hadoop run steps?
    > >>>> The way I see it, folks are mostly interesting in the UI part. How do you
    > >>>> see this?
    > >>>>
    > >>>> Take care,
    > >>>> Matt
    > >>>> --
    > >>>> Matt Casters <mcasters (AT) pentaho (DOT) org>
    > >>>> Chief Data Integration
    > >>>> Fonteinstraat 70, 9400 OKEGEM - Belgium - Cell : +32 486 97 29 37
    > >>>> Pentaho : The Commercial Open Source Alternative for Business Intelligence
    > >>>>
    > >>>> On Monday 15 March 2010 17:07:48 Yuval Oren wrote:
    > >>>>> Hello,
    > >>>>>
    > >>>>> I'm a big fan of PDI but have had to switch to Hadoop for some data
    > >>>>> flows
    > >>>>> for scalability reasons. I miss the nice UI tools of PDI, though. Has
    > >>>>> anyone
    > >>>>> thought of building a Hadoop implementation of the PDI steps? It's a
    > >>>>> huge
    > >>>>> project for sure, but it's something Hadoop is sorely lacking.
    > >>>>>
    > >>>>> Cheers,
    > >>>>> Yuval
    > >>>>>
    > >>>>>
    > >>>>
    > >>>> --
    > >>>> You received this message because you are subscribed to the Google Groups
    > >>>> "kettle-developers" group.
    > >>>> To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
    > >>>> To unsubscribe from this group, send email to
    > >>>> kettle-developers+unsubscribe (AT) googlegroups (DOT) com.
    > >>>> For more options, visit this group at
    > >>>> http://groups.google.com/group/kettle-developers?hl=en.
    > >>>>
    > >>>
    > >>> --
    > >>> You received this message because you are subscribed to the Google Groups
    > >>> "kettle-developers" group.
    > >>> To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
    > >>> To unsubscribe from this group, send email to
    > >>> kettle-developers+unsubscribe (AT) googlegroups (DOT) com.
    > >>> For more options, visit this group at
    > >>> http://groups.google.com/group/kettle-developers?hl=en.
    > >>>
    > >>
    > >> --
    > >> You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    > >> To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
    > >> To unsubscribe from this group, send email to kettle-developers+unsubscribe (AT) googlegroups (DOT) com.
    > >> For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en.
    > >>

    > >
    > > --
    > > You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    > > To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com..
    > > To unsubscribe from this group, send email to kettle-developers+unsubscribe (AT) googlegroups (DOT) com.
    > > For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en.
    > >
    > >

    >
    > --
    > You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    > To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
    > To unsubscribe from this group, send email to kettle-developers+unsubscribe (AT) googlegroups (DOT) com.
    > For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en.
    >
    >
    >
    > --
    > You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    > To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
    > To unsubscribe from this group, send email to kettle-developers+unsubscribe (AT) googlegroups (DOT) com.
    > For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en.


    --
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
    To unsubscribe from this group, send email to kettle-developers+unsubscribe (AT) googlegroups (DOT) com.
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en.

  9. #9
    Matt Casters Guest

    Default PDI over Hadoop!

    My boss Richard and quite a few other people are very interested in exploring ways to leverage PDI in Map/Reduce - Hadoop settings.
    As such, I'll be spending 'some' time over the next couple of months to do a spike on this.

    As usual, I'll be as depth-first (practical) as possible to get something up-and-running to demo the concept in the short term.
    I think that the real challenge will be to leverage as much of the current PDI code-base as possible but I think that there is a lot of potential for success.

    Of-course, since I'm quite the novice when it comes to Map/Reduce I'll be welcoming & accepting all sorts of advice and contributions from you all (as usual).

    Regards,
    Matt
    --
    Matt Casters <mcasters (AT) pentaho (DOT) org>
    Chief Data Integration
    Fonteinstraat 70, 9400 OKEGEM - Belgium - Cell : +32 486 97 29 37
    Pentaho : The Commercial Open Source Alternative for Business Intelligence

    On Friday 19 March 2010 03:11:46 Nicholas Goodman wrote:
    > Hi Yuval - hope you're doing well!
    >
    > You guys are both correct: for large scale (where "billions" is considered small), long running (10hr+) 200 node data crunching MR, and the capabilities I already mentioned (recov, auto balancing, file sharding, etc) are great features.
    >
    > I think there'd be ways to, in a meta data driven way split Kettle xforms into individual jobs (using the Hadoop chaining pieces Yuval mentioned or an intermediate meta model like Cascading). I think this would be very very cool. I think you two should do it! I'm pretty sure that Matt won't say no to committing a great Hadoop runtime for PDI.
    >
    > Nick
    >
    > On Mar 18, 2010, at 2:17 PM, Yuval Oren wrote:
    >
    > > I haven't played around with Kettle's distributed features, so I don't have the complete picture here. I do really like in Hadoop that you don't have to think (much, anyway) about distributing the processing. It just happens. I think that would be possible with Kettle over Hadoop.
    > >
    > > Lack of reliability and recoverability is also a dealbreaker for us and, I imagine, many others. My company has enough machines that failures are a regular occurrence, and we don't have that many, relatively speaking.
    > >
    > > As Biswapesh points out, though, we'd need to manage the number of MR instances. Maybe transformations could be grouped together, and each MR class could use Kettle itself to run sub-transformations. As the MR graph is built, certain types of steps, such as sorts and groups, would always trigger a new sub-transformation.
    > >
    > > Yuval
    > >
    > >
    > > On Thu, Mar 18, 2010 at 1:33 PM, Biswapesh Chattopadhyay <biswapesh (AT) gmail (DOT) com> wrote:
    > > Commodity hardware + Large scale long running process + No failover /
    > > recoverability == No reliability. And where I come from, billions is
    > > considered small volumes :-)
    > >
    > > On Thu, Mar 18, 2010 at 12:48 PM, Nicholas Goodman
    > > <ngoodman (AT) bayontechnologies (DOT) com> wrote:
    > > > Perhaps an intermediate model generation to something like Cascadings can be of help as well. The overall object models for PDI xforms and Cascadings are not wildly dissimilar.
    > > >
    > > > However, as the author of the whitepaper that showed compelling PDI results on commodity, cloud based text file processing (which is a heck of a lot like what happens on Hadoop clusters) I'm in no hurry. MR has a bunch of nice things (recoverability, node recovery, auto file sharding/dist etc) but PDI can scale out to billions of records and100+ nodes.
    > > >
    > > > So... we're all in agreement it's a good idea, but since PDI can already "go big" just not "go big with auto recovery/self managing work distribution" I don't know it's a burning requirement. Unless - is there something else that MR buys other than the items I just mentioned?
    > > >
    > > > Nick
    > > >
    > > > On Mar 18, 2010, at 11:09 AM, Biswapesh Chattopadhyay wrote:
    > > >
    > > >> This is a neat idea. We had initially evaluated KETTLE but had to go
    > > >> for a MapReduce based solution for scalability reasons; however we had
    > > >> to write XML files by hand to get it working and it was a big pain. I
    > > >> think the way to go here is to get the XML generated by the UI and
    > > >> write special Mappers and Reducers corresponding to the different
    > > >> transformations; then having a step which generates the Hadoop job
    > > >> using these components from the XML. One issue we have seen though is
    > > >> that the number of mapreduces increases significantly when you do that
    > > >> and that makes the ETL quite a bit slower than hand-coded mapreduces,
    > > >> so you may consider developing some simple optimizations while
    > > >> generating the hadoop graph.
    > > >>
    > > >> On Mon, Mar 15, 2010 at 10:17 AM, Yuval Oren <trumpety (AT) gmail (DOT) com> wrote:
    > > >>> Thanks for the quick reply, Matt. I certainly wasn't expecting this in the
    > > >>> 4.0 timeframe. The devil is in the details, but at a high level, here's what
    > > >>> I was thinking for a Hadoop-based transformation runner:
    > > >>>
    > > >>> 1. Read a transfomation definition
    > > >>> 2. Construct a Hadoop job to perform the equivalent of what Kettle does now.
    > > >>> 3. Execute the job.
    > > >>>
    > > >>> Kettle could have a library of MapReduce classes, one for each
    > > >>> transformation step. On Hadoop 0.20+, MapReduce classes can actually be
    > > >>> chained together, so this mirrors Kettle's architecture quite nicely. A
    > > >>> Kettle transformation = a Hadoop job running a single ChainMapper, and a
    > > >>> Kettle step = a Hadoop MapReduce class. This means that you wouldn't have to
    > > >>> generate any Java code; the transformation runner would use Hadoop methods
    > > >>> to construct the execution graph and then execute a single job.
    > > >>>
    > > >>> Psuedo-code might look a little like this:
    > > >>> http://hadoop.apache.org/common/docs...ainMapper.html
    > > >>>
    > > >>> I'm sure there are many special cases that would have to be handled.. One I
    > > >>> can think of is input and output. For example, you probably wouldn't want
    > > >>> every machine in your cluster doing a database query. For such an input,
    > > >>> you'd probably want the "driver" machine -- the one executing the job -- to
    > > >>> do the query and write it to the Hadoop DFS so cluster machines can just
    > > >>> read from a file.
    > > >>>
    > > >>> Cheers,
    > > >>> Yuval
    > > >>>
    > > >>> On Mon, Mar 15, 2010 at 9:14 AM, Matt Casters <mcasters (AT) pentaho (DOT) org> wrote:
    > > >>>>
    > > >>>> Hi Yuval,
    > > >>>>
    > > >>>> It's certainly not the first time we heard the question.
    > > >>>> Some brainstorming sessions have been done already and some steps have
    > > >>>> been taken in the architecture. Still, this is not something for the 4.0
    > > >>>> timeframe.
    > > >>>> We will try to get "something" done later this year.
    > > >>>>
    > > >>>> As usual, input from your side is important. What would you like to see
    > > >>>> Kettle do specifically? Generate Hadoop jobs or have Hadoop run steps?
    > > >>>> The way I see it, folks are mostly interesting in the UI part. How do you
    > > >>>> see this?
    > > >>>>
    > > >>>> Take care,
    > > >>>> Matt
    > > >>>> --
    > > >>>> Matt Casters <mcasters (AT) pentaho (DOT) org>
    > > >>>> Chief Data Integration
    > > >>>> Fonteinstraat 70, 9400 OKEGEM - Belgium - Cell : +32 486 97 29 37
    > > >>>> Pentaho : The Commercial Open Source Alternative for Business Intelligence
    > > >>>>
    > > >>>> On Monday 15 March 2010 17:07:48 Yuval Oren wrote:
    > > >>>>> Hello,
    > > >>>>>
    > > >>>>> I'm a big fan of PDI but have had to switch to Hadoop for some data
    > > >>>>> flows
    > > >>>>> for scalability reasons. I miss the nice UI tools of PDI, though. Has
    > > >>>>> anyone
    > > >>>>> thought of building a Hadoop implementation of the PDI steps? It's a
    > > >>>>> huge
    > > >>>>> project for sure, but it's something Hadoop is sorely lacking.
    > > >>>>>
    > > >>>>> Cheers,
    > > >>>>> Yuval
    > > >>>>>
    > > >>>>>
    > > >>>>
    > > >>>> --
    > > >>>> You received this message because you are subscribed to the Google Groups
    > > >>>> "kettle-developers" group.
    > > >>>> To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
    > > >>>> To unsubscribe from this group, send email to
    > > >>>> kettle-developers+unsubscribe (AT) googlegroups (DOT) com.
    > > >>>> For more options, visit this group at
    > > >>>> http://groups.google.com/group/kettle-developers?hl=en.
    > > >>>>
    > > >>>
    > > >>> --
    > > >>> You received this message because you are subscribed to the Google Groups
    > > >>> "kettle-developers" group.
    > > >>> To post to this group, send email to kettle-developers (AT) googlegroups (DOT) .com.
    > > >>> To unsubscribe from this group, send email to
    > > >>> kettle-developers+unsubscribe (AT) googlegroups (DOT) com.
    > > >>> For more options, visit this group at
    > > >>> http://groups.google.com/group/kettle-developers?hl=en.
    > > >>>
    > > >>
    > > >> --
    > > >> You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    > > >> To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
    > > >> To unsubscribe from this group, send email to kettle-developers+unsubscribe (AT) googlegroups (DOT) com.
    > > >> For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en.
    > > >>
    > > >
    > > > --
    > > > You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    > > > To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
    > > > To unsubscribe from this group, send email to kettle-developers+unsubscribe (AT) googlegroups (DOT) com.
    > > > For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en.
    > > >
    > > >

    > >
    > > --
    > > You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    > > To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com..
    > > To unsubscribe from this group, send email to kettle-developers+unsubscribe (AT) googlegroups (DOT) com.
    > > For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en.
    > >
    > >
    > >

    >
    >


    --
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com.
    To unsubscribe from this group, send email to kettle-developers+unsubscribe (AT) googlegroups (DOT) com.
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.