Hitachi Vantara Pentaho Community Forums
Results 1 to 20 of 20

Thread: Data Profiling Feature

Hybrid View

Previous Post Previous Post   Next Post Next Post
  1. #1
    Jeffrey Mo Guest

    Default Data Profiling Feature

    Hello Kettle developers,

    I'm a developer at SQL Power Group, and I'd like to discuss the data
    profiling feature we've been working on. I understand that our
    development lead, Jonathan Fuerth, has already had some preliminary
    discussions with Matt Casters on this.

    Our team at SQL Power have been working on adding a data profiling
    feature to Kettle. We have a working implementation on the the trunk
    version of Kettle. We've tested it successfully on Windows XP and on
    OS X 10.4. Currently, the profiler is accessed through the Database
    Explorer feature, although we could also make it available in other
    places if desired.

    How would you like us to submit this feature for review and testing?
    We can send the source code changes as an Eclipse patch, along with
    the library JAR files it depends on. These include a couple of our own
    libraries, and a couple of JFree libraries (we use JFreeChart to
    display data in graphs).

    Hope to hear back form you soon!

    Jeffrey Mo
    --~--~---------~--~----~------------~-------~--~----~
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com
    To unsubscribe from this group, send email to kettle-developers-unsubscribe (AT) g...oups (DOT) com
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en
    -~----------~----~----~----~------~----~------~--~---

  2. #2
    Sven Boden Guest

    Default Re: Data Profiling Feature

    Cool ... maybe put up a binary version somewhere? It's probably
    easier than a patch with jar files.

    Regards,
    Sven

    On Nov 26, 6:53 pm, Jeffrey Mo <jeff... (AT) sqlpower (DOT) ca> wrote:
    > Hello Kettle developers,
    >
    > I'm a developer at SQL Power Group, and I'd like to discuss the data
    > profiling feature we've been working on. I understand that our
    > development lead, Jonathan Fuerth, has already had some preliminary
    > discussions with Matt Casters on this.
    >
    > Our team at SQL Power have been working on adding a data profiling
    > feature to Kettle. We have a working implementation on the the trunk
    > version of Kettle. We've tested it successfully on Windows XP and on
    > OS X 10.4. Currently, the profiler is accessed through the Database
    > Explorer feature, although we could also make it available in other
    > places if desired.
    >
    > How would you like us to submit this feature for review and testing?
    > We can send the source code changes as an Eclipse patch, along with
    > the library JAR files it depends on. These include a couple of our own
    > libraries, and a couple of JFree libraries (we use JFreeChart to
    > display data in graphs).
    >
    > Hope to hear back form you soon!
    >
    > Jeffrey Mo

    --~--~---------~--~----~------------~-------~--~----~
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com
    To unsubscribe from this group, send email to kettle-developers-unsubscribe (AT) g...oups (DOT) com
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en
    -~----------~----~----~----~------~----~------~--~---

  3. #3
    Jens Bleuel Guest

    Default Re: Data Profiling Feature

    A good thing would also to see the running solution, may be you could
    put some slides here.

    At this state the trunk will be 3.0.1 at some time and I (and Matt
    stated this too) would like to avoid new functionality in there. (BTW:
    We have trouble with the new version checker that slipped in.) So keep
    the trunk / 3.0.1 as a mostly bug fix.

    Later on we need to decide how we can implement your new feature. May be
    you can give us an inside of what has to be changed within the code
    base. E.g. is it only a new button or more (I think more is needed).

    You mentioned you need the JFreeChart and some of your libraries... what
    happens when another project needs a JFreeChart lib that is different
    from yours a.s.o. .

    I just mention this to be aware of the impact this can have.

    Thanks a lot and best greetings (live from the Frankfurt Kettle training),
    Jens

    Sven Boden schrieb:
    >
    > Cool ... maybe put up a binary version somewhere? It's probably
    > easier than a patch with jar files.
    >
    > Regards,
    > Sven
    >
    > On Nov 26, 6:53 pm, Jeffrey Mo <jeff... (AT) sqlpower (DOT) ca> wrote:
    >> Hello Kettle developers,
    >>
    >> I'm a developer at SQL Power Group, and I'd like to discuss the data
    >> profiling feature we've been working on. I understand that our
    >> development lead, Jonathan Fuerth, has already had some preliminary
    >> discussions with Matt Casters on this.
    >>
    >> Our team at SQL Power have been working on adding a data profiling
    >> feature to Kettle. We have a working implementation on the the trunk
    >> version of Kettle. We've tested it successfully on Windows XP and on
    >> OS X 10.4. Currently, the profiler is accessed through the Database
    >> Explorer feature, although we could also make it available in other
    >> places if desired.
    >>
    >> How would you like us to submit this feature for review and testing?
    >> We can send the source code changes as an Eclipse patch, along with
    >> the library JAR files it depends on. These include a couple of our own
    >> libraries, and a couple of JFree libraries (we use JFreeChart to
    >> display data in graphs).
    >>
    >> Hope to hear back form you soon!
    >>
    >> Jeffrey Mo
    > >

    >


    --
    Jens Bleuel



    --~--~---------~--~----~------------~-------~--~----~
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com
    To unsubscribe from this group, send email to kettle-developers-unsubscribe (AT) g...oups (DOT) com
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en
    -~----------~----~----~----~------~----~------~--~---

  4. #4
    Matt Casters Guest

    Default Re: Data Profiling Feature

    Hi Jeffrey!

    A the end of next week, December 7th, we will release version 3.0.1 and branch
    3.0.1 as well as 3.0.2. At that time we'll flag the trunk as 3.1.0 and we'll
    be able to stuff things in there.
    I'm all for adding the Pentaho Reporting libraries, JFreeChart, etc.
    Kettle is an ETL tool, not a reporting tool. If there will ever be
    a "reporting" step it will run as a plugin with a separate class loader and
    it should not be affected by these libraries too much. (I think ;-))

    When I spoke with Jonathan and his SQL Power team in Orlando in early 2007, I
    was excited because with the work-load we have been under I knew we couldn't
    pull off a profiler ourselves. As such I think it's absolutely wonderful to
    be able to work with Jeffrey, Jonathan and the rest of SQL Power to make this
    happen.

    If you look at the criteria on which open source project managers like Linus
    (for the Linux kernel) accept code donations, at the top spot is always the
    willingness to maintain the code. As such I'm happy that the SQL Power team
    is behind this code and not just a single person as we've had a few bad
    experiences with that in the past (Mod JS & Streaming XML Input).

    So there you have it Jeffrey, I'm very excited and ready to help you pull this
    off. Can you send the patch to me somehow? I'll make a special build so
    that people can play with it. (or you guys can do it too, it's just the "zip"
    ant target) Then we'll release 3.0.1 and open 3.1.0 and we can merge the code
    in for good.

    All the best!

    Matt
    ____________________________________________
    Matt Casters
    Chief Data Integration - Kettle founder
    Pentaho, Open Source Business Intelligence
    http://www.pentaho.org -- mcasters (AT) pentaho (DOT) org
    Tel. +32 (0) 486 97 29 37

    On Tuesday 27 November 2007 10:44:02 Jens Bleuel wrote:
    > A good thing would also to see the running solution, may be you could
    > put some slides here.
    >
    > At this state the trunk will be 3.0.1 at some time and I (and Matt
    > stated this too) would like to avoid new functionality in there. (BTW:
    > We have trouble with the new version checker that slipped in.) So keep
    > the trunk / 3.0.1 as a mostly bug fix.
    >
    > Later on we need to decide how we can implement your new feature. May be
    > you can give us an inside of what has to be changed within the code
    > base. E.g. is it only a new button or more (I think more is needed).
    >
    > You mentioned you need the JFreeChart and some of your libraries... what
    > happens when another project needs a JFreeChart lib that is different
    > from yours a.s.o. .
    >
    > I just mention this to be aware of the impact this can have.
    >
    > Thanks a lot and best greetings (live from the Frankfurt Kettle training),
    > Jens
    >
    > Sven Boden schrieb:
    > > Cool ... maybe put up a binary version somewhere? It's probably
    > > easier than a patch with jar files.
    > >
    > > Regards,
    > > Sven
    > >
    > > On Nov 26, 6:53 pm, Jeffrey Mo <jeff... (AT) sqlpower (DOT) ca> wrote:
    > >> Hello Kettle developers,
    > >>
    > >> I'm a developer at SQL Power Group, and I'd like to discuss the data
    > >> profiling feature we've been working on. I understand that our
    > >> development lead, Jonathan Fuerth, has already had some preliminary
    > >> discussions with Matt Casters on this.
    > >>
    > >> Our team at SQL Power have been working on adding a data profiling
    > >> feature to Kettle. We have a working implementation on the the trunk
    > >> version of Kettle. We've tested it successfully on Windows XP and on
    > >> OS X 10.4. Currently, the profiler is accessed through the Database
    > >> Explorer feature, although we could also make it available in other
    > >> places if desired.
    > >>
    > >> How would you like us to submit this feature for review and testing?
    > >> We can send the source code changes as an Eclipse patch, along with
    > >> the library JAR files it depends on. These include a couple of our own
    > >> libraries, and a couple of JFree libraries (we use JFreeChart to
    > >> display data in graphs).
    > >>
    > >> Hope to hear back form you soon!
    > >>
    > >> Jeffrey Mo



    --~--~---------~--~----~------------~-------~--~----~
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com
    To unsubscribe from this group, send email to kettle-developers-unsubscribe (AT) g...oups (DOT) com
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en
    -~----------~----~----~----~------~----~------~--~---

  5. #5
    Jonathan Fuerth Guest

    Default Re: Data Profiling Feature

    Hi Matt, Jens, Sven, et al.

    Sven wrote:
    > maybe put up a binary version somewhere? It's probably
    > easier than a patch with jar files.


    We're happy to produce a Kettle build incorporating our change, but we
    wouldn't be able to host it publicly for too long on our site, since
    we pay quite a bit for our bandwidth. If you have somewhere to host
    it, Matt, that would be ideal. We'll make a copy available to you
    today!

    Jens wrote:
    > what happens when another project needs a JFreeChart lib that is different
    > from yours a.s.o. .


    Good point. We're actually not making very deep use of JFreeChart, so
    pretty much any version that includes a 2D pie chart will do the
    trick. Newer is usually better in terms of bug count and overall
    compatibility, but if another chunk of code has managed to tie itself
    to last year's JFreeChart, that won't bother our profiling GUI.

    The bigger issue with JFreeChart is that it uses Java2D and Swing, and
    the SWT-AWT bridge is constantly breaking on OS X every time Apple
    releases a Java update. As of now, the chart works on 10.4.x, but
    doesn't show up on 10.5 (Leopard). There's a rumor that Apple will be
    including something more permanent for bridging SWT-AWT in the Java 6
    for Leopard. I'm not holding my breath though.

    Jens wrote:
    > Later on we need to decide how we can implement your new feature. May be
    > you can give us an inside of what has to be changed within the code
    > base. E.g. is it only a new button or more (I think more is needed).


    We've tried to make the footprint on Kettle's code base as minimal as
    possible. The core of the profiling feature itself (no GUI) is stable
    and has been included in our Power*Architect data modeling tool for
    some time now. This profiling core depends on our own library and the
    core of the Architect, but those two jars are not too big.

    So, the actual code change to Kettle is limited to an SWT GUI for
    viewing profile results, a new button on the Database Explorer (but
    we'd like ideas for other places to integrate the profiling feature),
    and a little bit of glue code that takes a Kettle connection, catalog,
    schema, and table anem and uses it to come up with our own model of
    database metadata which we pass to our profiler API.

    I should mention at this point (to set your expectations), the actual
    profiling functions we apply to the data set are not terribly
    advanced. No university-level statistics here (yet)! The profiling
    API is modular, so it will be easy to add new profiling functions in
    the future. We'll choose the enhancements based on need and/or
    contributions. As it stands, we don't have the time to dream up
    esoteric profiling functions and implement them "because we can."

    Matt wrote:
    > A the end of next week, December 7th, we will release version 3.0.1 and branch
    > 3.0.1 as well as 3.0.2. At that time we'll flag the trunk as 3.1.0 and we'll
    > be able to stuff things in there.


    Ok, great.

    > When I spoke with Jonathan and his SQL Power team in Orlando in early 2007, I
    > was excited because with the work-load we have been under I knew we couldn't
    > pull off a profiler ourselves. As such I think it's absolutely wonderful to
    > be able to work with Jeffrey, Jonathan and the rest of SQL Power to make this
    > happen.


    Yes, we're happy to be working with you guys on this too.

    We'll get you a release with our changes at some point today.

    -Jonathan

    --~--~---------~--~----~------------~-------~--~----~
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com
    To unsubscribe from this group, send email to kettle-developers-unsubscribe (AT) g...oups (DOT) com
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en
    -~----------~----~----~----~------~----~------~--~---

  6. #6
    Matt Casters Guest

    Default Re: Data Profiling Feature

    Hi Jonathan,

    I'll host the images on Amazon S3. It's as cheap as it gets and delivers good
    services. I've come to like it very much ;-) The milestone images I've put
    there have typically been dowloaded a few hundred times and that only costs a
    few $US. (almost 0 in Euro :-))

    As far as the SWT/AWT bridge is concerned, I still have code that translates
    Swing images to SWT images. We could use that and get rid of the bridge
    altogether. It's really not that complicated code.

    For now, feel free to hand me some work for this and I'll make sure the images
    pop up ;-)

    All the best!
    Matt


    On Tuesday 27 November 2007 17:20:44 Jonathan Fuerth wrote:
    > Hi Matt, Jens, Sven, et al.
    >
    > Sven wrote:
    > > maybe put up a binary version somewhere? It's probably
    > > easier than a patch with jar files.

    >
    > We're happy to produce a Kettle build incorporating our change, but we
    > wouldn't be able to host it publicly for too long on our site, since
    > we pay quite a bit for our bandwidth. If you have somewhere to host
    > it, Matt, that would be ideal. We'll make a copy available to you
    > today!
    >
    > Jens wrote:
    > > what happens when another project needs a JFreeChart lib that is
    > > different from yours a.s.o. .

    >
    > Good point. We're actually not making very deep use of JFreeChart, so
    > pretty much any version that includes a 2D pie chart will do the
    > trick. Newer is usually better in terms of bug count and overall
    > compatibility, but if another chunk of code has managed to tie itself
    > to last year's JFreeChart, that won't bother our profiling GUI.
    >
    > The bigger issue with JFreeChart is that it uses Java2D and Swing, and
    > the SWT-AWT bridge is constantly breaking on OS X every time Apple
    > releases a Java update. As of now, the chart works on 10.4.x, but
    > doesn't show up on 10.5 (Leopard). There's a rumor that Apple will be
    > including something more permanent for bridging SWT-AWT in the Java 6
    > for Leopard. I'm not holding my breath though.
    >
    > Jens wrote:
    > > Later on we need to decide how we can implement your new feature. May be
    > > you can give us an inside of what has to be changed within the code
    > > base. E.g. is it only a new button or more (I think more is needed).

    >
    > We've tried to make the footprint on Kettle's code base as minimal as
    > possible. The core of the profiling feature itself (no GUI) is stable
    > and has been included in our Power*Architect data modeling tool for
    > some time now. This profiling core depends on our own library and the
    > core of the Architect, but those two jars are not too big.
    >
    > So, the actual code change to Kettle is limited to an SWT GUI for
    > viewing profile results, a new button on the Database Explorer (but
    > we'd like ideas for other places to integrate the profiling feature),
    > and a little bit of glue code that takes a Kettle connection, catalog,
    > schema, and table anem and uses it to come up with our own model of
    > database metadata which we pass to our profiler API.
    >
    > I should mention at this point (to set your expectations), the actual
    > profiling functions we apply to the data set are not terribly
    > advanced. No university-level statistics here (yet)! The profiling
    > API is modular, so it will be easy to add new profiling functions in
    > the future. We'll choose the enhancements based on need and/or
    > contributions. As it stands, we don't have the time to dream up
    > esoteric profiling functions and implement them "because we can."
    >
    > Matt wrote:
    > > A the end of next week, December 7th, we will release version 3.0.1 and
    > > branch 3.0.1 as well as 3.0.2. At that time we'll flag the trunk as
    > > 3.1.0 and we'll be able to stuff things in there.

    >
    > Ok, great.
    >
    > > When I spoke with Jonathan and his SQL Power team in Orlando in early
    > > 2007, I was excited because with the work-load we have been under I knew
    > > we couldn't pull off a profiler ourselves. As such I think it's
    > > absolutely wonderful to be able to work with Jeffrey, Jonathan and the
    > > rest of SQL Power to make this happen.

    >
    > Yes, we're happy to be working with you guys on this too.
    >
    > We'll get you a release with our changes at some point today.
    >
    > -Jonathan
    >
    >



    --
    Matt
    ____________________________________________
    Matt Casters
    Chief Data Integration - Kettle founder
    Pentaho, Open Source Business Intelligence
    http://www.pentaho.org -- mcasters (AT) pentaho (DOT) org
    Tel. +32 (0) 486 97 29 37

    --~--~---------~--~----~------------~-------~--~----~
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com
    To unsubscribe from this group, send email to kettle-developers-unsubscribe (AT) g...oups (DOT) com
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en
    -~----------~----~----~----~------~----~------~--~---

  7. #7
    Darren Hartford Guest

    Default RE: Data Profiling Feature

    I just finished a QA project involving sampling, and I came across this
    really useful one-page piece of information that was invaluable that may
    help: http://www.petefreitag.com/item/466.cfm

    My usecase was based on 10% sampling, instead of fixed-number-of-rows
    sampling, although I have had a need in the past for a
    fixed-number-of-rows sampling regardless of the number of incoming
    records.

    Just sharing above link to avoid going through each databases own
    documentation :-)

    -D

    Response to:
    > I'm not sure that the full-scan of the table is the limiting factor

    here.
    > I
    > tried it with a 1 million rows table and the profiler was busy for

    about 5
    > minutes. At the same time, a 3.0 transformation reads *all* the rows

    from
    > the same table in 7 seconds flat.
    >
    > So one way or another, there is analytical work being done on the

    MySQL
    > database and I'm not sure that this is the right way to go. (MySQL is
    > using
    > 100% CPU for minutes)
    >
    > Matt



    --~--~---------~--~----~------------~-------~--~----~
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com
    To unsubscribe from this group, send email to kettle-developers-unsubscribe (AT) g...oups (DOT) com
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en
    -~----------~----~----~----~------~----~------~--~---

  8. #8
    Jonathan Fuerth Guest

    Default Re: Data Profiling Feature

    On 12/4/07, Darren Hartford <dhartford (AT) ghsinc (DOT) com> wrote:
    > I just finished a QA project involving sampling, and I came across this
    > really useful one-page piece of information that was invaluable that may
    > help: http://www.petefreitag.com/item/466.cfm


    Thanks, Darren! That's a really valuable resource. We'll try those
    techniques on the databases at our disposal, and if they improve
    performance, we'll incorporate them into the existing profiler.

    I do hate to do platform-specific stuff like this, but when the
    network bandwidth is the limiting factor, there is little choice but
    to start using SQL tricks to reduce the amount of data being moved
    around.

    -Jonathan

    --~--~---------~--~----~------------~-------~--~----~
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com
    To unsubscribe from this group, send email to kettle-developers-unsubscribe (AT) g...oups (DOT) com
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en
    -~----------~----~----~----~------~----~------~--~---

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.