Hitachi Vantara Pentaho Community Forums
Results 1 to 20 of 20

Thread: Data Profiling Feature

  1. #1
    Jeffrey Mo Guest

    Default Data Profiling Feature

    Hello Kettle developers,

    I'm a developer at SQL Power Group, and I'd like to discuss the data
    profiling feature we've been working on. I understand that our
    development lead, Jonathan Fuerth, has already had some preliminary
    discussions with Matt Casters on this.

    Our team at SQL Power have been working on adding a data profiling
    feature to Kettle. We have a working implementation on the the trunk
    version of Kettle. We've tested it successfully on Windows XP and on
    OS X 10.4. Currently, the profiler is accessed through the Database
    Explorer feature, although we could also make it available in other
    places if desired.

    How would you like us to submit this feature for review and testing?
    We can send the source code changes as an Eclipse patch, along with
    the library JAR files it depends on. These include a couple of our own
    libraries, and a couple of JFree libraries (we use JFreeChart to
    display data in graphs).

    Hope to hear back form you soon!

    Jeffrey Mo
    --~--~---------~--~----~------------~-------~--~----~
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com
    To unsubscribe from this group, send email to kettle-developers-unsubscribe (AT) g...oups (DOT) com
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en
    -~----------~----~----~----~------~----~------~--~---

  2. #2
    Sven Boden Guest

    Default Re: Data Profiling Feature

    Cool ... maybe put up a binary version somewhere? It's probably
    easier than a patch with jar files.

    Regards,
    Sven

    On Nov 26, 6:53 pm, Jeffrey Mo <jeff... (AT) sqlpower (DOT) ca> wrote:
    > Hello Kettle developers,
    >
    > I'm a developer at SQL Power Group, and I'd like to discuss the data
    > profiling feature we've been working on. I understand that our
    > development lead, Jonathan Fuerth, has already had some preliminary
    > discussions with Matt Casters on this.
    >
    > Our team at SQL Power have been working on adding a data profiling
    > feature to Kettle. We have a working implementation on the the trunk
    > version of Kettle. We've tested it successfully on Windows XP and on
    > OS X 10.4. Currently, the profiler is accessed through the Database
    > Explorer feature, although we could also make it available in other
    > places if desired.
    >
    > How would you like us to submit this feature for review and testing?
    > We can send the source code changes as an Eclipse patch, along with
    > the library JAR files it depends on. These include a couple of our own
    > libraries, and a couple of JFree libraries (we use JFreeChart to
    > display data in graphs).
    >
    > Hope to hear back form you soon!
    >
    > Jeffrey Mo

    --~--~---------~--~----~------------~-------~--~----~
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com
    To unsubscribe from this group, send email to kettle-developers-unsubscribe (AT) g...oups (DOT) com
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en
    -~----------~----~----~----~------~----~------~--~---

  3. #3
    Jens Bleuel Guest

    Default Re: Data Profiling Feature

    A good thing would also to see the running solution, may be you could
    put some slides here.

    At this state the trunk will be 3.0.1 at some time and I (and Matt
    stated this too) would like to avoid new functionality in there. (BTW:
    We have trouble with the new version checker that slipped in.) So keep
    the trunk / 3.0.1 as a mostly bug fix.

    Later on we need to decide how we can implement your new feature. May be
    you can give us an inside of what has to be changed within the code
    base. E.g. is it only a new button or more (I think more is needed).

    You mentioned you need the JFreeChart and some of your libraries... what
    happens when another project needs a JFreeChart lib that is different
    from yours a.s.o. .

    I just mention this to be aware of the impact this can have.

    Thanks a lot and best greetings (live from the Frankfurt Kettle training),
    Jens

    Sven Boden schrieb:
    >
    > Cool ... maybe put up a binary version somewhere? It's probably
    > easier than a patch with jar files.
    >
    > Regards,
    > Sven
    >
    > On Nov 26, 6:53 pm, Jeffrey Mo <jeff... (AT) sqlpower (DOT) ca> wrote:
    >> Hello Kettle developers,
    >>
    >> I'm a developer at SQL Power Group, and I'd like to discuss the data
    >> profiling feature we've been working on. I understand that our
    >> development lead, Jonathan Fuerth, has already had some preliminary
    >> discussions with Matt Casters on this.
    >>
    >> Our team at SQL Power have been working on adding a data profiling
    >> feature to Kettle. We have a working implementation on the the trunk
    >> version of Kettle. We've tested it successfully on Windows XP and on
    >> OS X 10.4. Currently, the profiler is accessed through the Database
    >> Explorer feature, although we could also make it available in other
    >> places if desired.
    >>
    >> How would you like us to submit this feature for review and testing?
    >> We can send the source code changes as an Eclipse patch, along with
    >> the library JAR files it depends on. These include a couple of our own
    >> libraries, and a couple of JFree libraries (we use JFreeChart to
    >> display data in graphs).
    >>
    >> Hope to hear back form you soon!
    >>
    >> Jeffrey Mo
    > >

    >


    --
    Jens Bleuel



    --~--~---------~--~----~------------~-------~--~----~
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com
    To unsubscribe from this group, send email to kettle-developers-unsubscribe (AT) g...oups (DOT) com
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en
    -~----------~----~----~----~------~----~------~--~---

  4. #4
    Matt Casters Guest

    Default Re: Data Profiling Feature

    Hi Jeffrey!

    A the end of next week, December 7th, we will release version 3.0.1 and branch
    3.0.1 as well as 3.0.2. At that time we'll flag the trunk as 3.1.0 and we'll
    be able to stuff things in there.
    I'm all for adding the Pentaho Reporting libraries, JFreeChart, etc.
    Kettle is an ETL tool, not a reporting tool. If there will ever be
    a "reporting" step it will run as a plugin with a separate class loader and
    it should not be affected by these libraries too much. (I think ;-))

    When I spoke with Jonathan and his SQL Power team in Orlando in early 2007, I
    was excited because with the work-load we have been under I knew we couldn't
    pull off a profiler ourselves. As such I think it's absolutely wonderful to
    be able to work with Jeffrey, Jonathan and the rest of SQL Power to make this
    happen.

    If you look at the criteria on which open source project managers like Linus
    (for the Linux kernel) accept code donations, at the top spot is always the
    willingness to maintain the code. As such I'm happy that the SQL Power team
    is behind this code and not just a single person as we've had a few bad
    experiences with that in the past (Mod JS & Streaming XML Input).

    So there you have it Jeffrey, I'm very excited and ready to help you pull this
    off. Can you send the patch to me somehow? I'll make a special build so
    that people can play with it. (or you guys can do it too, it's just the "zip"
    ant target) Then we'll release 3.0.1 and open 3.1.0 and we can merge the code
    in for good.

    All the best!

    Matt
    ____________________________________________
    Matt Casters
    Chief Data Integration - Kettle founder
    Pentaho, Open Source Business Intelligence
    http://www.pentaho.org -- mcasters (AT) pentaho (DOT) org
    Tel. +32 (0) 486 97 29 37

    On Tuesday 27 November 2007 10:44:02 Jens Bleuel wrote:
    > A good thing would also to see the running solution, may be you could
    > put some slides here.
    >
    > At this state the trunk will be 3.0.1 at some time and I (and Matt
    > stated this too) would like to avoid new functionality in there. (BTW:
    > We have trouble with the new version checker that slipped in.) So keep
    > the trunk / 3.0.1 as a mostly bug fix.
    >
    > Later on we need to decide how we can implement your new feature. May be
    > you can give us an inside of what has to be changed within the code
    > base. E.g. is it only a new button or more (I think more is needed).
    >
    > You mentioned you need the JFreeChart and some of your libraries... what
    > happens when another project needs a JFreeChart lib that is different
    > from yours a.s.o. .
    >
    > I just mention this to be aware of the impact this can have.
    >
    > Thanks a lot and best greetings (live from the Frankfurt Kettle training),
    > Jens
    >
    > Sven Boden schrieb:
    > > Cool ... maybe put up a binary version somewhere? It's probably
    > > easier than a patch with jar files.
    > >
    > > Regards,
    > > Sven
    > >
    > > On Nov 26, 6:53 pm, Jeffrey Mo <jeff... (AT) sqlpower (DOT) ca> wrote:
    > >> Hello Kettle developers,
    > >>
    > >> I'm a developer at SQL Power Group, and I'd like to discuss the data
    > >> profiling feature we've been working on. I understand that our
    > >> development lead, Jonathan Fuerth, has already had some preliminary
    > >> discussions with Matt Casters on this.
    > >>
    > >> Our team at SQL Power have been working on adding a data profiling
    > >> feature to Kettle. We have a working implementation on the the trunk
    > >> version of Kettle. We've tested it successfully on Windows XP and on
    > >> OS X 10.4. Currently, the profiler is accessed through the Database
    > >> Explorer feature, although we could also make it available in other
    > >> places if desired.
    > >>
    > >> How would you like us to submit this feature for review and testing?
    > >> We can send the source code changes as an Eclipse patch, along with
    > >> the library JAR files it depends on. These include a couple of our own
    > >> libraries, and a couple of JFree libraries (we use JFreeChart to
    > >> display data in graphs).
    > >>
    > >> Hope to hear back form you soon!
    > >>
    > >> Jeffrey Mo



    --~--~---------~--~----~------------~-------~--~----~
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com
    To unsubscribe from this group, send email to kettle-developers-unsubscribe (AT) g...oups (DOT) com
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en
    -~----------~----~----~----~------~----~------~--~---

  5. #5
    Jonathan Fuerth Guest

    Default Re: Data Profiling Feature

    Hi Matt, Jens, Sven, et al.

    Sven wrote:
    > maybe put up a binary version somewhere? It's probably
    > easier than a patch with jar files.


    We're happy to produce a Kettle build incorporating our change, but we
    wouldn't be able to host it publicly for too long on our site, since
    we pay quite a bit for our bandwidth. If you have somewhere to host
    it, Matt, that would be ideal. We'll make a copy available to you
    today!

    Jens wrote:
    > what happens when another project needs a JFreeChart lib that is different
    > from yours a.s.o. .


    Good point. We're actually not making very deep use of JFreeChart, so
    pretty much any version that includes a 2D pie chart will do the
    trick. Newer is usually better in terms of bug count and overall
    compatibility, but if another chunk of code has managed to tie itself
    to last year's JFreeChart, that won't bother our profiling GUI.

    The bigger issue with JFreeChart is that it uses Java2D and Swing, and
    the SWT-AWT bridge is constantly breaking on OS X every time Apple
    releases a Java update. As of now, the chart works on 10.4.x, but
    doesn't show up on 10.5 (Leopard). There's a rumor that Apple will be
    including something more permanent for bridging SWT-AWT in the Java 6
    for Leopard. I'm not holding my breath though.

    Jens wrote:
    > Later on we need to decide how we can implement your new feature. May be
    > you can give us an inside of what has to be changed within the code
    > base. E.g. is it only a new button or more (I think more is needed).


    We've tried to make the footprint on Kettle's code base as minimal as
    possible. The core of the profiling feature itself (no GUI) is stable
    and has been included in our Power*Architect data modeling tool for
    some time now. This profiling core depends on our own library and the
    core of the Architect, but those two jars are not too big.

    So, the actual code change to Kettle is limited to an SWT GUI for
    viewing profile results, a new button on the Database Explorer (but
    we'd like ideas for other places to integrate the profiling feature),
    and a little bit of glue code that takes a Kettle connection, catalog,
    schema, and table anem and uses it to come up with our own model of
    database metadata which we pass to our profiler API.

    I should mention at this point (to set your expectations), the actual
    profiling functions we apply to the data set are not terribly
    advanced. No university-level statistics here (yet)! The profiling
    API is modular, so it will be easy to add new profiling functions in
    the future. We'll choose the enhancements based on need and/or
    contributions. As it stands, we don't have the time to dream up
    esoteric profiling functions and implement them "because we can."

    Matt wrote:
    > A the end of next week, December 7th, we will release version 3.0.1 and branch
    > 3.0.1 as well as 3.0.2. At that time we'll flag the trunk as 3.1.0 and we'll
    > be able to stuff things in there.


    Ok, great.

    > When I spoke with Jonathan and his SQL Power team in Orlando in early 2007, I
    > was excited because with the work-load we have been under I knew we couldn't
    > pull off a profiler ourselves. As such I think it's absolutely wonderful to
    > be able to work with Jeffrey, Jonathan and the rest of SQL Power to make this
    > happen.


    Yes, we're happy to be working with you guys on this too.

    We'll get you a release with our changes at some point today.

    -Jonathan

    --~--~---------~--~----~------------~-------~--~----~
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com
    To unsubscribe from this group, send email to kettle-developers-unsubscribe (AT) g...oups (DOT) com
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en
    -~----------~----~----~----~------~----~------~--~---

  6. #6
    Matt Casters Guest

    Default Re: Data Profiling Feature

    Hi Jonathan,

    I'll host the images on Amazon S3. It's as cheap as it gets and delivers good
    services. I've come to like it very much ;-) The milestone images I've put
    there have typically been dowloaded a few hundred times and that only costs a
    few $US. (almost 0 in Euro :-))

    As far as the SWT/AWT bridge is concerned, I still have code that translates
    Swing images to SWT images. We could use that and get rid of the bridge
    altogether. It's really not that complicated code.

    For now, feel free to hand me some work for this and I'll make sure the images
    pop up ;-)

    All the best!
    Matt


    On Tuesday 27 November 2007 17:20:44 Jonathan Fuerth wrote:
    > Hi Matt, Jens, Sven, et al.
    >
    > Sven wrote:
    > > maybe put up a binary version somewhere? It's probably
    > > easier than a patch with jar files.

    >
    > We're happy to produce a Kettle build incorporating our change, but we
    > wouldn't be able to host it publicly for too long on our site, since
    > we pay quite a bit for our bandwidth. If you have somewhere to host
    > it, Matt, that would be ideal. We'll make a copy available to you
    > today!
    >
    > Jens wrote:
    > > what happens when another project needs a JFreeChart lib that is
    > > different from yours a.s.o. .

    >
    > Good point. We're actually not making very deep use of JFreeChart, so
    > pretty much any version that includes a 2D pie chart will do the
    > trick. Newer is usually better in terms of bug count and overall
    > compatibility, but if another chunk of code has managed to tie itself
    > to last year's JFreeChart, that won't bother our profiling GUI.
    >
    > The bigger issue with JFreeChart is that it uses Java2D and Swing, and
    > the SWT-AWT bridge is constantly breaking on OS X every time Apple
    > releases a Java update. As of now, the chart works on 10.4.x, but
    > doesn't show up on 10.5 (Leopard). There's a rumor that Apple will be
    > including something more permanent for bridging SWT-AWT in the Java 6
    > for Leopard. I'm not holding my breath though.
    >
    > Jens wrote:
    > > Later on we need to decide how we can implement your new feature. May be
    > > you can give us an inside of what has to be changed within the code
    > > base. E.g. is it only a new button or more (I think more is needed).

    >
    > We've tried to make the footprint on Kettle's code base as minimal as
    > possible. The core of the profiling feature itself (no GUI) is stable
    > and has been included in our Power*Architect data modeling tool for
    > some time now. This profiling core depends on our own library and the
    > core of the Architect, but those two jars are not too big.
    >
    > So, the actual code change to Kettle is limited to an SWT GUI for
    > viewing profile results, a new button on the Database Explorer (but
    > we'd like ideas for other places to integrate the profiling feature),
    > and a little bit of glue code that takes a Kettle connection, catalog,
    > schema, and table anem and uses it to come up with our own model of
    > database metadata which we pass to our profiler API.
    >
    > I should mention at this point (to set your expectations), the actual
    > profiling functions we apply to the data set are not terribly
    > advanced. No university-level statistics here (yet)! The profiling
    > API is modular, so it will be easy to add new profiling functions in
    > the future. We'll choose the enhancements based on need and/or
    > contributions. As it stands, we don't have the time to dream up
    > esoteric profiling functions and implement them "because we can."
    >
    > Matt wrote:
    > > A the end of next week, December 7th, we will release version 3.0.1 and
    > > branch 3.0.1 as well as 3.0.2. At that time we'll flag the trunk as
    > > 3.1.0 and we'll be able to stuff things in there.

    >
    > Ok, great.
    >
    > > When I spoke with Jonathan and his SQL Power team in Orlando in early
    > > 2007, I was excited because with the work-load we have been under I knew
    > > we couldn't pull off a profiler ourselves. As such I think it's
    > > absolutely wonderful to be able to work with Jeffrey, Jonathan and the
    > > rest of SQL Power to make this happen.

    >
    > Yes, we're happy to be working with you guys on this too.
    >
    > We'll get you a release with our changes at some point today.
    >
    > -Jonathan
    >
    >



    --
    Matt
    ____________________________________________
    Matt Casters
    Chief Data Integration - Kettle founder
    Pentaho, Open Source Business Intelligence
    http://www.pentaho.org -- mcasters (AT) pentaho (DOT) org
    Tel. +32 (0) 486 97 29 37

    --~--~---------~--~----~------------~-------~--~----~
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com
    To unsubscribe from this group, send email to kettle-developers-unsubscribe (AT) g...oups (DOT) com
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en
    -~----------~----~----~----~------~----~------~--~---

  7. #7
    Jeffrey Mo Guest

    Default Re: Data Profiling Feature

    Hi Matt,

    I've placed a file on one of our servers for you to download.

    http://trillian.sqlpower.ca/kettle_profile_patch.tar.gz

    We'll put it up there temporarily so that you guys can get a chance to try
    it out.

    When you unzip the archive, you should a directory 'kettle_profile_patch'.
    Inside are the following:

    * In /lib, I placed the two kettle JAR files recompiled with our profiling
    code in it.
    I should note that for some reason, building a Kettle distributable without
    our code changes result in a kettle-engine jar that is 10 MB, while building
    with our code changes results in a 2.9 MB jar. I'm not clear what the reason
    is for the size difference.

    * In /libext/sqlpower/ we have architect_lib.jar and sqlpower_library.jar
    which provide the profiling backend. We license these under the new BSD
    license.

    * In /libext/jfree/ we have the two JFree libraries that we use:
    jcommon-1.0.0.jar and jfreechart-1.0.1.jar

    * kettle_profile_patch.txt - an Eclipse patch file that should add the
    necessary Java code to add the profiling feature. Note

    I created this patch from the Kettle trunk updated from around 2:30 pm EST.
    You would have to apply this patch using Eclipse.

    I've been able to get the feature to work on OS X 10.4 by just copying over
    the jar files into an unzipped Kettle release directory. You use the
    Database Explorer, then find the table you wish to profile, and then click
    the new 'Profile Table' button on the right. A new window should pop up with
    a graphical summary of the data in that table, along with another tab in the
    same window with a tabular summary of the data.

    I should note some issues I've run into:

    1) Trying to run the patch by copying the Jar file into a Kettle directory
    doesn't seem to work on Windows XP. The application would just not do
    anything when you press the 'Profile Table' button the first time, and then
    the applicaiton would just exit without any warning if you press it a second
    time. Yet, if you apply the patch in Eclipse and copy and add the necessary
    jars to the build path, it seems to work.

    2) The feature doesn't work in OS X 10.5 (Leopard). From what I've read,
    there appear to be some issues with using AWT in SWT on Leopard. However, if
    we use the code Matt mentioned to convert AWT images to SWT, it may work
    better, and it may be something worth trying.

    So feel free to download the archive, try patching Kettle, and using the
    profiling feature. Let Jonathan or me know if there are any problems,
    questions, comments, or anything else.

    Cheers,
    Jeffrey

    On 27/11/2007, Matt Casters <mcasters (AT) pentaho (DOT) org> wrote:
    >
    >
    >
    > Hi Jonathan,
    >
    > I'll host the images on Amazon S3. It's as cheap as it gets and delivers
    > good
    > services. I've come to like it very much ;-) The milestone images I've
    > put
    > there have typically been dowloaded a few hundred times and that only
    > costs a
    > few $US. (almost 0 in Euro :-))
    >
    > As far as the SWT/AWT bridge is concerned, I still have code that
    > translates
    > Swing images to SWT images. We could use that and get rid of the bridge
    > altogether. It's really not that complicated code.
    >
    > For now, feel free to hand me some work for this and I'll make sure the
    > images
    > pop up ;-)
    >
    > All the best!
    > Matt
    >
    >
    > On Tuesday 27 November 2007 17:20:44 Jonathan Fuerth wrote:
    > > Hi Matt, Jens, Sven, et al.
    > >
    > > Sven wrote:
    > > > maybe put up a binary version somewhere? It's probably
    > > > easier than a patch with jar files.

    > >
    > > We're happy to produce a Kettle build incorporating our change, but we
    > > wouldn't be able to host it publicly for too long on our site, since
    > > we pay quite a bit for our bandwidth. If you have somewhere to host
    > > it, Matt, that would be ideal. We'll make a copy available to you
    > > today!
    > >
    > > Jens wrote:
    > > > what happens when another project needs a JFreeChart lib that is
    > > > different from yours a.s.o. .

    > >
    > > Good point. We're actually not making very deep use of JFreeChart, so
    > > pretty much any version that includes a 2D pie chart will do the
    > > trick. Newer is usually better in terms of bug count and overall
    > > compatibility, but if another chunk of code has managed to tie itself
    > > to last year's JFreeChart, that won't bother our profiling GUI.
    > >
    > > The bigger issue with JFreeChart is that it uses Java2D and Swing, and
    > > the SWT-AWT bridge is constantly breaking on OS X every time Apple
    > > releases a Java update. As of now, the chart works on 10.4.x, but
    > > doesn't show up on 10.5 (Leopard). There's a rumor that Apple will be
    > > including something more permanent for bridging SWT-AWT in the Java 6
    > > for Leopard. I'm not holding my breath though.
    > >
    > > Jens wrote:
    > > > Later on we need to decide how we can implement your new feature. May

    > be
    > > > you can give us an inside of what has to be changed within the code
    > > > base. E.g. is it only a new button or more (I think more is needed).

    > >
    > > We've tried to make the footprint on Kettle's code base as minimal as
    > > possible. The core of the profiling feature itself (no GUI) is stable
    > > and has been included in our Power*Architect data modeling tool for
    > > some time now. This profiling core depends on our own library and the
    > > core of the Architect, but those two jars are not too big.
    > >
    > > So, the actual code change to Kettle is limited to an SWT GUI for
    > > viewing profile results, a new button on the Database Explorer (but
    > > we'd like ideas for other places to integrate the profiling feature),
    > > and a little bit of glue code that takes a Kettle connection, catalog,
    > > schema, and table anem and uses it to come up with our own model of
    > > database metadata which we pass to our profiler API.
    > >
    > > I should mention at this point (to set your expectations), the actual
    > > profiling functions we apply to the data set are not terribly
    > > advanced. No university-level statistics here (yet)! The profiling
    > > API is modular, so it will be easy to add new profiling functions in
    > > the future. We'll choose the enhancements based on need and/or
    > > contributions. As it stands, we don't have the time to dream up
    > > esoteric profiling functions and implement them "because we can."
    > >
    > > Matt wrote:
    > > > A the end of next week, December 7th, we will release version 3.0.1and
    > > > branch 3.0.1 as well as 3.0.2. At that time we'll flag the trunk as
    > > > 3.1.0 and we'll be able to stuff things in there.

    > >
    > > Ok, great.
    > >
    > > > When I spoke with Jonathan and his SQL Power team in Orlando in early
    > > > 2007, I was excited because with the work-load we have been under I

    > knew
    > > > we couldn't pull off a profiler ourselves. As such I think it's
    > > > absolutely wonderful to be able to work with Jeffrey, Jonathan and the
    > > > rest of SQL Power to make this happen.

    > >
    > > Yes, we're happy to be working with you guys on this too.
    > >
    > > We'll get you a release with our changes at some point today.
    > >
    > > -Jonathan
    > >
    > >

    >
    >
    > --
    > Matt
    > ____________________________________________
    > Matt Casters
    > Chief Data Integration - Kettle founder
    > Pentaho, Open Source Business Intelligence
    > http://www.pentaho.org -- mcasters (AT) pentaho (DOT) org
    > Tel. +32 (0) 486 97 29 37
    >
    > >

    >


    --~--~---------~--~----~------------~-------~--~----~
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com
    To unsubscribe from this group, send email to kettle-developers-unsubscribe (AT) g...oups (DOT) com
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en
    -~----------~----~----~----~------~----~------~--~---

  8. #8
    Jeffrey Mo Guest

    Default Re: Data Profiling Feature

    Hello again,

    I should also mention one more issue:

    When I tried testing the profiler on Ubuntu 7.10, some of the labels do not
    appear in the window. Instead, I see a grey box in the middle. You'll notice
    it if you try running the profile under GNOME on Linux. This issue does not
    occur in either OS X 10.4 or Windows XP.

    I should note that our team is relatively new to SWT (most of our
    applications use Swing), so please do point out if there's anything we're
    doing improperly in our SWT code.

    Cheers,
    Jeffrey

    On 27/11/2007, Jeffrey Mo <jeffrey (AT) sqlpower (DOT) ca> wrote:
    >
    > Hi Matt,
    >
    > I've placed a file on one of our servers for you to download.
    >
    > http://trillian.sqlpower.ca/kettle_profile_patch.tar.gz
    >
    > We'll put it up there temporarily so that you guys can get a chance to try
    > it out.
    >
    > When you unzip the archive, you should a directory 'kettle_profile_patch'.
    > Inside are the following:
    >
    > * In /lib, I placed the two kettle JAR files recompiled with our profiling
    > code in it.
    > I should note that for some reason, building a Kettle distributable
    > without our code changes result in a kettle-engine jar that is 10 MB, while
    > building with our code changes results in a 2.9 MB jar. I'm not clear what
    > the reason is for the size difference.
    >
    > * In /libext/sqlpower/ we have architect_lib.jar and sqlpower_library.jar
    > which provide the profiling backend. We license these under the new BSD
    > license.
    >
    > * In /libext/jfree/ we have the two JFree libraries that we use:
    > jcommon-1.0.0.jar and jfreechart-1.0.1.jar
    >
    > * kettle_profile_patch.txt - an Eclipse patch file that should add the
    > necessary Java code to add the profiling feature. Note
    >
    > I created this patch from the Kettle trunk updated from around 2:30 pm
    > EST. You would have to apply this patch using Eclipse.
    >
    > I've been able to get the feature to work on OS X 10.4 by just copying
    > over the jar files into an unzipped Kettle release directory. You use the
    > Database Explorer, then find the table you wish to profile, and then click
    > the new 'Profile Table' button on the right. A new window should pop up with
    > a graphical summary of the data in that table, along with another tab in the
    > same window with a tabular summary of the data.
    >
    > I should note some issues I've run into:
    >
    > 1) Trying to run the patch by copying the Jar file into a Kettle directory
    > doesn't seem to work on Windows XP. The application would just not do
    > anything when you press the 'Profile Table' button the first time, and then
    > the applicaiton would just exit without any warning if you press it a second
    > time. Yet, if you apply the patch in Eclipse and copy and add the necessary
    > jars to the build path, it seems to work.
    >
    > 2) The feature doesn't work in OS X 10.5 (Leopard). From what I've read,
    > there appear to be some issues with using AWT in SWT on Leopard. However, if
    > we use the code Matt mentioned to convert AWT images to SWT, it may work
    > better, and it may be something worth trying.
    >
    > So feel free to download the archive, try patching Kettle, and using the
    > profiling feature. Let Jonathan or me know if there are any problems,
    > questions, comments, or anything else.
    >
    > Cheers,
    > Jeffrey
    >
    > On 27/11/2007, Matt Casters <mcasters (AT) pentaho (DOT) org> wrote:
    > >
    > >
    > >
    > > Hi Jonathan,
    > >
    > > I'll host the images on Amazon S3. It's as cheap as it gets and
    > > delivers good
    > > services. I've come to like it very much ;-) The milestone images I've
    > > put
    > > there have typically been dowloaded a few hundred times and that only
    > > costs a
    > > few $US. (almost 0 in Euro :-))
    > >
    > > As far as the SWT/AWT bridge is concerned, I still have code that
    > > translates
    > > Swing images to SWT images. We could use that and get rid of the bridge
    > > altogether. It's really not that complicated code.
    > >
    > > For now, feel free to hand me some work for this and I'll make sure the
    > > images
    > > pop up ;-)
    > >
    > > All the best!
    > > Matt
    > >
    > >
    > > On Tuesday 27 November 2007 17:20:44 Jonathan Fuerth wrote:
    > > > Hi Matt, Jens, Sven, et al.
    > > >
    > > > Sven wrote:
    > > > > maybe put up a binary version somewhere? It's probably
    > > > > easier than a patch with jar files.
    > > >
    > > > We're happy to produce a Kettle build incorporating our change, but we

    > >
    > > > wouldn't be able to host it publicly for too long on our site, since
    > > > we pay quite a bit for our bandwidth. If you have somewhere to host
    > > > it, Matt, that would be ideal. We'll make a copy available to you
    > > > today!
    > > >
    > > > Jens wrote:
    > > > > what happens when another project needs a JFreeChart lib that is
    > > > > different from yours a.s.o. .
    > > >
    > > > Good point. We're actually not making very deep use of JFreeChart, so

    > >
    > > > pretty much any version that includes a 2D pie chart will do the
    > > > trick. Newer is usually better in terms of bug count and overall
    > > > compatibility, but if another chunk of code has managed to tie itself
    > > > to last year's JFreeChart, that won't bother our profiling GUI.
    > > >
    > > > The bigger issue with JFreeChart is that it uses Java2D and Swing, and
    > > > the SWT-AWT bridge is constantly breaking on OS X every time Apple
    > > > releases a Java update. As of now, the chart works on 10.4.x, but
    > > > doesn't show up on 10.5 (Leopard). There's a rumor that Apple will be
    > > > including something more permanent for bridging SWT-AWT in the Java 6
    > > > for Leopard. I'm not holding my breath though.
    > > >
    > > > Jens wrote:
    > > > > Later on we need to decide how we can implement your new feature.

    > > May be
    > > > > you can give us an inside of what has to be changed within the code
    > > > > base. E.g. is it only a new button or more (I think more is needed).
    > > >
    > > > We've tried to make the footprint on Kettle's code base as minimal as
    > > > possible. The core of the profiling feature itself (no GUI) is stable

    > >
    > > > and has been included in our Power*Architect data modeling tool for
    > > > some time now. This profiling core depends on our own library and the
    > > > core of the Architect, but those two jars are not too big.
    > > >
    > > > So, the actual code change to Kettle is limited to an SWT GUI for
    > > > viewing profile results, a new button on the Database Explorer (but
    > > > we'd like ideas for other places to integrate the profiling feature),
    > > > and a little bit of glue code that takes a Kettle connection, catalog,
    > > > schema, and table anem and uses it to come up with our own model of
    > > > database metadata which we pass to our profiler API.
    > > >
    > > > I should mention at this point (to set your expectations), the actual
    > > > profiling functions we apply to the data set are not terribly
    > > > advanced. No university-level statistics here (yet)! The profiling
    > > > API is modular, so it will be easy to add new profiling functions in
    > > > the future. We'll choose the enhancements based on need and/or
    > > > contributions. As it stands, we don't have the time to dream up
    > > > esoteric profiling functions and implement them "because we can."
    > > >
    > > > Matt wrote:
    > > > > A the end of next week, December 7th, we will release version 3.0.1and
    > > > > branch 3.0.1 as well as 3.0.2. At that time we'll flag the trunk as
    > > > > 3.1.0 and we'll be able to stuff things in there.
    > > >
    > > > Ok, great.
    > > >
    > > > > When I spoke with Jonathan and his SQL Power team in Orlando in

    > > early
    > > > > 2007, I was excited because with the work-load we have been under I

    > > knew
    > > > > we couldn't pull off a profiler ourselves. As such I think it's
    > > > > absolutely wonderful to be able to work with Jeffrey, Jonathan and

    > > the
    > > > > rest of SQL Power to make this happen.
    > > >
    > > > Yes, we're happy to be working with you guys on this too.
    > > >
    > > > We'll get you a release with our changes at some point today.
    > > >
    > > > -Jonathan
    > > >
    > > >

    > >
    > >
    > > --
    > > Matt
    > > ____________________________________________
    > > Matt Casters
    > > Chief Data Integration - Kettle founder
    > > Pentaho, Open Source Business Intelligence
    > > http://www.pentaho.org -- mcasters (AT) pentaho (DOT) org
    > > Tel. +32 (0) 486 97 29 37
    > >
    > > > >

    > >

    >


    --~--~---------~--~----~------------~-------~--~----~
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com
    To unsubscribe from this group, send email to kettle-developers-unsubscribe (AT) g...oups (DOT) com
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en
    -~----------~----~----~----~------~----~------~--~---

  9. #9
    Matt Casters Guest

    Default Re: Data Profiling Feature

    Thanks Jeffrey,

    I have the file. I'll look into the Linux thing as well. I'm using it
    constantly these days ;-)
    It's getting late over here, I'll create the images tomorrow.

    All the best,

    Matt


    On Tuesday 27 November 2007 22:04:38 Jeffrey Mo wrote:
    > Hello again,
    >
    > I should also mention one more issue:
    >
    > When I tried testing the profiler on Ubuntu 7.10, some of the labels do not
    > appear in the window. Instead, I see a grey box in the middle. You'll
    > notice it if you try running the profile under GNOME on Linux. This issue
    > does not occur in either OS X 10.4 or Windows XP.
    >
    > I should note that our team is relatively new to SWT (most of our
    > applications use Swing), so please do point out if there's anything we're
    > doing improperly in our SWT code.
    >
    > Cheers,
    > Jeffrey
    >
    > On 27/11/2007, Jeffrey Mo <jeffrey (AT) sqlpower (DOT) ca> wrote:
    > > Hi Matt,
    > >
    > > I've placed a file on one of our servers for you to download.
    > >
    > > http://trillian.sqlpower.ca/kettle_profile_patch.tar.gz
    > >
    > > We'll put it up there temporarily so that you guys can get a chance to
    > > try it out.
    > >
    > > When you unzip the archive, you should a directory
    > > 'kettle_profile_patch'. Inside are the following:
    > >
    > > * In /lib, I placed the two kettle JAR files recompiled with our
    > > profiling code in it.
    > > I should note that for some reason, building a Kettle distributable
    > > without our code changes result in a kettle-engine jar that is 10 MB,
    > > while building with our code changes results in a 2.9 MB jar. I'm not
    > > clear what the reason is for the size difference.
    > >
    > > * In /libext/sqlpower/ we have architect_lib.jar and sqlpower_library.jar
    > > which provide the profiling backend. We license these under the new BSD
    > > license.
    > >
    > > * In /libext/jfree/ we have the two JFree libraries that we use:
    > > jcommon-1.0.0.jar and jfreechart-1.0.1.jar
    > >
    > > * kettle_profile_patch.txt - an Eclipse patch file that should add the
    > > necessary Java code to add the profiling feature. Note
    > >
    > > I created this patch from the Kettle trunk updated from around 2:30 pm
    > > EST. You would have to apply this patch using Eclipse.
    > >
    > > I've been able to get the feature to work on OS X 10.4 by just copying
    > > over the jar files into an unzipped Kettle release directory. You use the
    > > Database Explorer, then find the table you wish to profile, and then
    > > click the new 'Profile Table' button on the right. A new window should
    > > pop up with a graphical summary of the data in that table, along with
    > > another tab in the same window with a tabular summary of the data.
    > >
    > > I should note some issues I've run into:
    > >
    > > 1) Trying to run the patch by copying the Jar file into a Kettle
    > > directory doesn't seem to work on Windows XP. The application would just
    > > not do anything when you press the 'Profile Table' button the first time,
    > > and then the applicaiton would just exit without any warning if you press
    > > it a second time. Yet, if you apply the patch in Eclipse and copy and add
    > > the necessary jars to the build path, it seems to work.
    > >
    > > 2) The feature doesn't work in OS X 10.5 (Leopard). From what I've read,
    > > there appear to be some issues with using AWT in SWT on Leopard. However,
    > > if we use the code Matt mentioned to convert AWT images to SWT, it may
    > > work better, and it may be something worth trying.
    > >
    > > So feel free to download the archive, try patching Kettle, and using the
    > > profiling feature. Let Jonathan or me know if there are any problems,
    > > questions, comments, or anything else.
    > >
    > > Cheers,
    > > Jeffrey
    > >
    > > On 27/11/2007, Matt Casters <mcasters (AT) pentaho (DOT) org> wrote:
    > > > Hi Jonathan,
    > > >
    > > > I'll host the images on Amazon S3. It's as cheap as it gets and
    > > > delivers good
    > > > services. I've come to like it very much ;-) The milestone images I've
    > > > put
    > > > there have typically been dowloaded a few hundred times and that only
    > > > costs a
    > > > few $US. (almost 0 in Euro :-))
    > > >
    > > > As far as the SWT/AWT bridge is concerned, I still have code that
    > > > translates
    > > > Swing images to SWT images. We could use that and get rid of the
    > > > bridge altogether. It's really not that complicated code.
    > > >
    > > > For now, feel free to hand me some work for this and I'll make sure the
    > > > images
    > > > pop up ;-)
    > > >
    > > > All the best!
    > > > Matt
    > > >
    > > > On Tuesday 27 November 2007 17:20:44 Jonathan Fuerth wrote:
    > > > > Hi Matt, Jens, Sven, et al.
    > > > >
    > > > > Sven wrote:
    > > > > > maybe put up a binary version somewhere? It's probably
    > > > > > easier than a patch with jar files.
    > > > >
    > > > > We're happy to produce a Kettle build incorporating our change, but
    > > > > we
    > > > >
    > > > > wouldn't be able to host it publicly for too long on our site, since
    > > > > we pay quite a bit for our bandwidth. If you have somewhere to host
    > > > > it, Matt, that would be ideal. We'll make a copy available to you
    > > > > today!
    > > > >
    > > > > Jens wrote:
    > > > > > what happens when another project needs a JFreeChart lib that is
    > > > > > different from yours a.s.o. .
    > > > >
    > > > > Good point. We're actually not making very deep use of JFreeChart,
    > > > > so
    > > > >
    > > > > pretty much any version that includes a 2D pie chart will do the
    > > > > trick. Newer is usually better in terms of bug count and overall
    > > > > compatibility, but if another chunk of code has managed to tie itself
    > > > > to last year's JFreeChart, that won't bother our profiling GUI.
    > > > >
    > > > > The bigger issue with JFreeChart is that it uses Java2D and Swing,
    > > > > and the SWT-AWT bridge is constantly breaking on OS X every time
    > > > > Apple releases a Java update. As of now, the chart works on 10.4.x,
    > > > > but doesn't show up on 10.5 (Leopard). There's a rumor that Apple
    > > > > will be including something more permanent for bridging SWT-AWT in
    > > > > the Java 6 for Leopard. I'm not holding my breath though.
    > > > >
    > > > > Jens wrote:
    > > > > > Later on we need to decide how we can implement your new feature.
    > > >
    > > > May be
    > > >
    > > > > > you can give us an inside of what has to be changed within the code
    > > > > > base. E.g. is it only a new button or more (I think more is
    > > > > > needed).
    > > > >
    > > > > We've tried to make the footprint on Kettle's code base as minimal as
    > > > > possible. The core of the profiling feature itself (no GUI) is
    > > > > stable
    > > > >
    > > > > and has been included in our Power*Architect data modeling tool for
    > > > > some time now. This profiling core depends on our own library and
    > > > > the core of the Architect, but those two jars are not too big.
    > > > >
    > > > > So, the actual code change to Kettle is limited to an SWT GUI for
    > > > > viewing profile results, a new button on the Database Explorer (but
    > > > > we'd like ideas for other places to integrate the profiling feature),
    > > > > and a little bit of glue code that takes a Kettle connection,
    > > > > catalog, schema, and table anem and uses it to come up with our own
    > > > > model of database metadata which we pass to our profiler API.
    > > > >
    > > > > I should mention at this point (to set your expectations), the actual
    > > > > profiling functions we apply to the data set are not terribly
    > > > > advanced. No university-level statistics here (yet)! The profiling
    > > > > API is modular, so it will be easy to add new profiling functions in
    > > > > the future. We'll choose the enhancements based on need and/or
    > > > > contributions. As it stands, we don't have the time to dream up
    > > > > esoteric profiling functions and implement them "because we can."
    > > > >
    > > > > Matt wrote:
    > > > > > A the end of next week, December 7th, we will release version
    > > > > > 3.0.1and branch 3.0.1 as well as 3.0.2. At that time we'll flag
    > > > > > the trunk as 3.1.0 and we'll be able to stuff things in there.
    > > > >
    > > > > Ok, great.
    > > > >
    > > > > > When I spoke with Jonathan and his SQL Power team in Orlando in
    > > >
    > > > early
    > > >
    > > > > > 2007, I was excited because with the work-load we have been under I
    > > >
    > > > knew
    > > >
    > > > > > we couldn't pull off a profiler ourselves. As such I think it's
    > > > > > absolutely wonderful to be able to work with Jeffrey, Jonathan and
    > > >
    > > > the
    > > >
    > > > > > rest of SQL Power to make this happen.
    > > > >
    > > > > Yes, we're happy to be working with you guys on this too.
    > > > >
    > > > > We'll get you a release with our changes at some point today.
    > > > >
    > > > > -Jonathan
    > > >
    > > > --
    > > > Matt
    > > > ____________________________________________
    > > > Matt Casters
    > > > Chief Data Integration - Kettle founder
    > > > Pentaho, Open Source Business Intelligence
    > > > http://www.pentaho.org -- mcasters (AT) pentaho (DOT) org
    > > > Tel. +32 (0) 486 97 29 37

    >
    >



    --
    Matt
    ____________________________________________
    Matt Casters
    Chief Data Integration - Kettle founder
    Pentaho, Open Source Business Intelligence
    http://www.pentaho.org -- mcasters (AT) pentaho (DOT) org
    Tel. +32 (0) 486 97 29 37

    --~--~---------~--~----~------------~-------~--~----~
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com
    To unsubscribe from this group, send email to kettle-developers-unsubscribe (AT) g...oups (DOT) com
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en
    -~----------~----~----~----~------~----~------~--~---

  10. #10
    Matt Casters Guest

    Default Re: Data Profiling Feature

    For those that want to play around with the profiler, I've uploaded an image
    to:

    http://kettle3.s3.amazonaws.com/Kett...1-profiler.zip

    Screenshots so you can see what it is about:

    http://www.kettle.be/images/temp/pro...e-explorer.png
    http://www.kettle.be/images/temp/pro...graph-view.png
    http://www.kettle.be/images/temp/pro...table-view.png

    Some problems I noticed (simple stuff):
    - The explorer dialog blocks. It would be nice to throw it in a separate
    thread, separate from the explorer dialog (much like in PA :-)) but with
    immediate feedback.
    - There is no provision to do sampling: it would be nice to be presented with
    a sampling dialog allowing you to read only the first X rows etc.

    Other than that I think it's a good starting point.

    All the best,

    Matt


    On Tuesday 27 November 2007 22:23:58 Matt Casters wrote:
    > Thanks Jeffrey,
    >
    > I have the file. I'll look into the Linux thing as well. I'm using it
    > constantly these days ;-)
    > It's getting late over here, I'll create the images tomorrow.
    >
    > All the best,
    >
    > Matt
    >
    > On Tuesday 27 November 2007 22:04:38 Jeffrey Mo wrote:
    > > Hello again,
    > >
    > > I should also mention one more issue:
    > >
    > > When I tried testing the profiler on Ubuntu 7.10, some of the labels do
    > > not appear in the window. Instead, I see a grey box in the middle. You'll
    > > notice it if you try running the profile under GNOME on Linux. This issue
    > > does not occur in either OS X 10.4 or Windows XP.
    > >
    > > I should note that our team is relatively new to SWT (most of our
    > > applications use Swing), so please do point out if there's anything we're
    > > doing improperly in our SWT code.
    > >
    > > Cheers,
    > > Jeffrey
    > >
    > > On 27/11/2007, Jeffrey Mo <jeffrey (AT) sqlpower (DOT) ca> wrote:
    > > > Hi Matt,
    > > >
    > > > I've placed a file on one of our servers for you to download.
    > > >
    > > > http://trillian.sqlpower.ca/kettle_profile_patch.tar.gz
    > > >
    > > > We'll put it up there temporarily so that you guys can get a chance to
    > > > try it out.
    > > >
    > > > When you unzip the archive, you should a directory
    > > > 'kettle_profile_patch'. Inside are the following:
    > > >
    > > > * In /lib, I placed the two kettle JAR files recompiled with our
    > > > profiling code in it.
    > > > I should note that for some reason, building a Kettle distributable
    > > > without our code changes result in a kettle-engine jar that is 10 MB,
    > > > while building with our code changes results in a 2.9 MB jar. I'm not
    > > > clear what the reason is for the size difference.
    > > >
    > > > * In /libext/sqlpower/ we have architect_lib.jar and
    > > > sqlpower_library.jar which provide the profiling backend. We license
    > > > these under the new BSD license.
    > > >
    > > > * In /libext/jfree/ we have the two JFree libraries that we use:
    > > > jcommon-1.0.0.jar and jfreechart-1.0.1.jar
    > > >
    > > > * kettle_profile_patch.txt - an Eclipse patch file that should add the
    > > > necessary Java code to add the profiling feature. Note
    > > >
    > > > I created this patch from the Kettle trunk updated from around 2:30 pm
    > > > EST. You would have to apply this patch using Eclipse.
    > > >
    > > > I've been able to get the feature to work on OS X 10.4 by just copying
    > > > over the jar files into an unzipped Kettle release directory. You use
    > > > the Database Explorer, then find the table you wish to profile, and
    > > > then click the new 'Profile Table' button on the right. A new window
    > > > should pop up with a graphical summary of the data in that table, along
    > > > with another tab in the same window with a tabular summary of the data.
    > > >
    > > > I should note some issues I've run into:
    > > >
    > > > 1) Trying to run the patch by copying the Jar file into a Kettle
    > > > directory doesn't seem to work on Windows XP. The application would
    > > > just not do anything when you press the 'Profile Table' button the
    > > > first time, and then the applicaiton would just exit without any
    > > > warning if you press it a second time. Yet, if you apply the patch in
    > > > Eclipse and copy and add the necessary jars to the build path, it seems
    > > > to work.
    > > >
    > > > 2) The feature doesn't work in OS X 10.5 (Leopard). From what I've
    > > > read, there appear to be some issues with using AWT in SWT on Leopard.
    > > > However, if we use the code Matt mentioned to convert AWT images to
    > > > SWT, it may work better, and it may be something worth trying.
    > > >
    > > > So feel free to download the archive, try patching Kettle, and using
    > > > the profiling feature. Let Jonathan or me know if there are any
    > > > problems, questions, comments, or anything else.
    > > >
    > > > Cheers,
    > > > Jeffrey
    > > >
    > > > On 27/11/2007, Matt Casters <mcasters (AT) pentaho (DOT) org> wrote:
    > > > > Hi Jonathan,
    > > > >
    > > > > I'll host the images on Amazon S3. It's as cheap as it gets and
    > > > > delivers good
    > > > > services. I've come to like it very much ;-) The milestone images
    > > > > I've put
    > > > > there have typically been dowloaded a few hundred times and that only
    > > > > costs a
    > > > > few $US. (almost 0 in Euro :-))
    > > > >
    > > > > As far as the SWT/AWT bridge is concerned, I still have code that
    > > > > translates
    > > > > Swing images to SWT images. We could use that and get rid of the
    > > > > bridge altogether. It's really not that complicated code.
    > > > >
    > > > > For now, feel free to hand me some work for this and I'll make sure
    > > > > the images
    > > > > pop up ;-)
    > > > >
    > > > > All the best!
    > > > > Matt
    > > > >
    > > > > On Tuesday 27 November 2007 17:20:44 Jonathan Fuerth wrote:
    > > > > > Hi Matt, Jens, Sven, et al.
    > > > > >
    > > > > > Sven wrote:
    > > > > > > maybe put up a binary version somewhere? It's probably
    > > > > > > easier than a patch with jar files.
    > > > > >
    > > > > > We're happy to produce a Kettle build incorporating our change, but
    > > > > > we
    > > > > >
    > > > > > wouldn't be able to host it publicly for too long on our site,
    > > > > > since we pay quite a bit for our bandwidth. If you have somewhere
    > > > > > to host it, Matt, that would be ideal. We'll make a copy available
    > > > > > to you today!
    > > > > >
    > > > > > Jens wrote:
    > > > > > > what happens when another project needs a JFreeChart lib that is
    > > > > > > different from yours a.s.o. .
    > > > > >
    > > > > > Good point. We're actually not making very deep use of JFreeChart,
    > > > > > so
    > > > > >
    > > > > > pretty much any version that includes a 2D pie chart will do the
    > > > > > trick. Newer is usually better in terms of bug count and overall
    > > > > > compatibility, but if another chunk of code has managed to tie
    > > > > > itself to last year's JFreeChart, that won't bother our profiling
    > > > > > GUI.
    > > > > >
    > > > > > The bigger issue with JFreeChart is that it uses Java2D and Swing,
    > > > > > and the SWT-AWT bridge is constantly breaking on OS X every time
    > > > > > Apple releases a Java update. As of now, the chart works on
    > > > > > 10.4.x, but doesn't show up on 10.5 (Leopard). There's a rumor
    > > > > > that Apple will be including something more permanent for bridging
    > > > > > SWT-AWT in the Java 6 for Leopard. I'm not holding my breath
    > > > > > though.
    > > > > >
    > > > > > Jens wrote:
    > > > > > > Later on we need to decide how we can implement your new feature.
    > > > >
    > > > > May be
    > > > >
    > > > > > > you can give us an inside of what has to be changed within the
    > > > > > > code base. E.g. is it only a new button or more (I think more is
    > > > > > > needed).
    > > > > >
    > > > > > We've tried to make the footprint on Kettle's code base as minimal
    > > > > > as possible. The core of the profiling feature itself (no GUI) is
    > > > > > stable
    > > > > >
    > > > > > and has been included in our Power*Architect data modeling tool for
    > > > > > some time now. This profiling core depends on our own library and
    > > > > > the core of the Architect, but those two jars are not too big.
    > > > > >
    > > > > > So, the actual code change to Kettle is limited to an SWT GUI for
    > > > > > viewing profile results, a new button on the Database Explorer (but
    > > > > > we'd like ideas for other places to integrate the profiling
    > > > > > feature), and a little bit of glue code that takes a Kettle
    > > > > > connection, catalog, schema, and table anem and uses it to come up
    > > > > > with our own model of database metadata which we pass to our
    > > > > > profiler API.
    > > > > >
    > > > > > I should mention at this point (to set your expectations), the
    > > > > > actual profiling functions we apply to the data set are not
    > > > > > terribly advanced. No university-level statistics here (yet)! The
    > > > > > profiling API is modular, so it will be easy to add new profiling
    > > > > > functions in the future. We'll choose the enhancements based on
    > > > > > need and/or contributions. As it stands, we don't have the time to
    > > > > > dream up esoteric profiling functions and implement them "because
    > > > > > we can."
    > > > > >
    > > > > > Matt wrote:
    > > > > > > A the end of next week, December 7th, we will release version
    > > > > > > 3.0.1and branch 3.0.1 as well as 3.0.2. At that time we'll flag
    > > > > > > the trunk as 3.1.0 and we'll be able to stuff things in there.
    > > > > >
    > > > > > Ok, great.
    > > > > >
    > > > > > > When I spoke with Jonathan and his SQL Power team in Orlando in
    > > > >
    > > > > early
    > > > >
    > > > > > > 2007, I was excited because with the work-load we have been under
    > > > > > > I
    > > > >
    > > > > knew
    > > > >
    > > > > > > we couldn't pull off a profiler ourselves. As such I think it's
    > > > > > > absolutely wonderful to be able to work with Jeffrey, Jonathan
    > > > > > > and
    > > > >
    > > > > the
    > > > >
    > > > > > > rest of SQL Power to make this happen.
    > > > > >
    > > > > > Yes, we're happy to be working with you guys on this too.
    > > > > >
    > > > > > We'll get you a release with our changes at some point today.
    > > > > >
    > > > > > -Jonathan
    > > > >
    > > > > --
    > > > > Matt
    > > > > ____________________________________________
    > > > > Matt Casters
    > > > > Chief Data Integration - Kettle founder
    > > > > Pentaho, Open Source Business Intelligence
    > > > > http://www.pentaho.org -- mcasters (AT) pentaho (DOT) org
    > > > > Tel. +32 (0) 486 97 29 37




    --
    Matt
    ____________________________________________
    Matt Casters
    Chief Data Integration - Kettle founder
    Pentaho, Open Source Business Intelligence
    http://www.pentaho.org -- mcasters (AT) pentaho (DOT) org
    Tel. +32 (0) 486 97 29 37

    --~--~---------~--~----~------------~-------~--~----~
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com
    To unsubscribe from this group, send email to kettle-developers-unsubscribe (AT) g...oups (DOT) com
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en
    -~----------~----~----~----~------~----~------~--~---

  11. #11
    Jonathan Fuerth Guest

    Default Re: Data Profiling Feature

    On 12/3/07, Matt Casters <mcasters (AT) pentaho (DOT) org> wrote:
    > Some problems I noticed (simple stuff):
    > - The explorer dialog blocks. It would be nice to throw it in a separate
    > thread, separate from the explorer dialog (much like in PA :-)) but with
    > immediate feedback.


    Yes, I totally agree. We're all new to SWT here, so we didn't want to
    do something stupid with threads right out of the gate. We should
    have time this week to learn how to do proper multithreaded SWT
    programming, and we'll fix the blocking explorer.

    Would you (and others) be interested in a port of the "profile
    manager" GUI too? Or is one profile at a time sufficient?

    > - There is no provision to do sampling: it would be nice to be presented with
    > a sampling dialog allowing you to read only the first X rows etc.


    Yes, I've discussed this at length in the past. The problem here is
    how to get a representative sample of the data in the table. If you
    just read the first N rows of the table, you can short-circuit the
    full table scan. That's good for performance, but bad for accuracy:
    You can't get an accurate national opinion poll result by phoning
    everyone who lives on one particular street. I expect database tables
    will tend to be the same way. Most platforms I've read about tend to
    keep rows together that were inserted at the same time (updates are
    likely to shuffle your rows around though). So on average, you'd be
    weighting the stats toward data that was inserted into the table early
    in its life and never updated.

    Another approach would be to choose an arbitrary column to order by,
    then read the first N rows from that result. This is going to be
    worse for performance, since the backend will have to perform a sort
    that considers every row in the table before it can even return the
    first row. I think the question of whether or not it improves the
    quality of the sample depends on what data is in the table, and which
    column you pick. Obviously the min, max, and mean for the column
    you've sorted by will be badly skewed. The data for other columns may
    also be skewed too. For instance, if you sorted by phone number in a
    U.S. customer table, you'd find that everyone lives in New York (area
    code 212). Sorting by customer ID would give you your oldest
    customers, etc. Sorting by name might actually be the most reasonable
    in terms of fair distribution.

    Another approach would be to try and consider only every Nth row from
    the table, leaving the rows in their natural order. I think this
    would be best for sampling accuracy, but I don't know of any database
    that could do this without a full table scan. So for performance,
    it's the same as just computing the exact values (assuming a full
    table scan is I/O bound and not CPU bound ).

    Probably the best approach would be to read all the rows within every
    Nth disk page of the table. This would give you the performance
    benefit of being 1/N of a full table scan, and I expect it would also
    produce a reasonably fair sampling of the data. The problem is, I
    don't know of any databases that expose this level of control to JDBC
    (or for that matter, to any user-level API). It is my understanding
    that this is how PostgreSQL performs sampling during an ANALYZE
    operation, but I haven't looked at the code, so take that with a grain
    of salt.

    Which brings me to what could be the best compromise solution,
    although it's the most work to create and maintain from our point of
    view. We could enhance our profiling configuration files a bit more
    to not just say which aggregate functions are legal for which data
    types on a particular platform, but to optionally provide the full
    query that will give back the answer. This way, on platforms that
    expose a well-documented data dictionary complete with profiler
    statistics (such as PostgreSQL), we could do an ANALYZE followed by
    reading the stats it produced. The ANALYZE should be as fast as
    possible, since it can use internal tricks.

    > Other than that I think it's a good starting point.


    Great. We'll fix the threading problem. If we have time this week
    (we might), we can look at experimenting with using planner statistics
    in the profiler back end. Which platforms to most people use Kettle
    with? We find most of our customers want to use MS SQL Server these
    days, and our salespeople don't even try to talk them out of it.

    -Jonathan

    --~--~---------~--~----~------------~-------~--~----~
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com
    To unsubscribe from this group, send email to kettle-developers-unsubscribe (AT) g...oups (DOT) com
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en
    -~----------~----~----~----~------~----~------~--~---

  12. #12
    Matt Casters Guest

    Default Re: Data Profiling Feature

    Hi Jonathon,

    I absolutely wasn't asking for an immediate fix on the threading, as per usual
    what I'm trying to do is ask for feedback from the community. Usually, once
    the software is open sourced, these things evolve over time based on that
    feedback.

    On the sampling, I think we can do a lot better than that.

    As some of you might have noticed, Mark Hall the main man behind Weka has
    published a few new step plugins for 3.0:

    http://wiki.pentaho.org/display/EAI/...ation+Plug-Ins

    Especially steps like "Reservoir Sampling" are interesting.

    "The reservoir sampling plugin is a tool that allows you to sample a fixed
    number of rows from an incoming Kettle data stream when the total number of
    incoming rows is not known in advance. All rows have equal chance of being
    selected (uniform sampling). This step is particularly useful when used in
    conjunction with the ARFF output step in order to generate a suitable sized
    data set to be used by WEKA. The reservoir sampling step uses algorithm �R�
    by Vitter (Vitter 1985)."

    Generating a transformation on the fly to grab the data should be pretty
    trivial actually. (we do it for preview etc)

    I'm not sure that the full-scan of the table is the limiting factor here. I
    tried it with a 1 million rows table and the profiler was busy for about 5
    minutes. At the same time, a 3.0 transformation reads *all* the rows from
    the same table in 7 seconds flat.

    So one way or another, there is analytical work being done on the MySQL
    database and I'm not sure that this is the right way to go. (MySQL is using
    100% CPU for minutes)

    What happens when we have 10 million rows or more? A good sampling solution
    is IMHO crucial in that case. Using transformations we would have
    interesting options to parallelise the profiling itself over CPUs and later
    even over clusters.

    I haven't seen the code yet, but I'm guessing that things like this probably
    require some major changes. Please understand that *again*, I'm not saying
    we should do all that right away, but we should think big and start small ;-)

    All the best,

    Matt



    On Monday 03 December 2007 18:27:10 Jonathan Fuerth wrote:
    > On 12/3/07, Matt Casters <mcasters (AT) pentaho (DOT) org> wrote:
    > > Some problems I noticed (simple stuff):
    > > - The explorer dialog blocks. It would be nice to throw it in a separate
    > > thread, separate from the explorer dialog (much like in PA :-)) but with
    > > immediate feedback.

    >
    > Yes, I totally agree. We're all new to SWT here, so we didn't want to
    > do something stupid with threads right out of the gate. We should
    > have time this week to learn how to do proper multithreaded SWT
    > programming, and we'll fix the blocking explorer.
    >
    > Would you (and others) be interested in a port of the "profile
    > manager" GUI too? Or is one profile at a time sufficient?
    >
    > > - There is no provision to do sampling: it would be nice to be presented
    > > with a sampling dialog allowing you to read only the first X rows etc.

    >
    > Yes, I've discussed this at length in the past. The problem here is
    > how to get a representative sample of the data in the table. If you
    > just read the first N rows of the table, you can short-circuit the
    > full table scan. That's good for performance, but bad for accuracy:
    > You can't get an accurate national opinion poll result by phoning
    > everyone who lives on one particular street. I expect database tables
    > will tend to be the same way. Most platforms I've read about tend to
    > keep rows together that were inserted at the same time (updates are
    > likely to shuffle your rows around though). So on average, you'd be
    > weighting the stats toward data that was inserted into the table early
    > in its life and never updated.
    >
    > Another approach would be to choose an arbitrary column to order by,
    > then read the first N rows from that result. This is going to be
    > worse for performance, since the backend will have to perform a sort
    > that considers every row in the table before it can even return the
    > first row. I think the question of whether or not it improves the
    > quality of the sample depends on what data is in the table, and which
    > column you pick. Obviously the min, max, and mean for the column
    > you've sorted by will be badly skewed. The data for other columns may
    > also be skewed too. For instance, if you sorted by phone number in a
    > U.S. customer table, you'd find that everyone lives in New York (area
    > code 212). Sorting by customer ID would give you your oldest
    > customers, etc. Sorting by name might actually be the most reasonable
    > in terms of fair distribution.
    >
    > Another approach would be to try and consider only every Nth row from
    > the table, leaving the rows in their natural order. I think this
    > would be best for sampling accuracy, but I don't know of any database
    > that could do this without a full table scan. So for performance,
    > it's the same as just computing the exact values (assuming a full
    > table scan is I/O bound and not CPU bound ).
    >
    > Probably the best approach would be to read all the rows within every
    > Nth disk page of the table. This would give you the performance
    > benefit of being 1/N of a full table scan, and I expect it would also
    > produce a reasonably fair sampling of the data. The problem is, I
    > don't know of any databases that expose this level of control to JDBC
    > (or for that matter, to any user-level API). It is my understanding
    > that this is how PostgreSQL performs sampling during an ANALYZE
    > operation, but I haven't looked at the code, so take that with a grain
    > of salt.
    >
    > Which brings me to what could be the best compromise solution,
    > although it's the most work to create and maintain from our point of
    > view. We could enhance our profiling configuration files a bit more
    > to not just say which aggregate functions are legal for which data
    > types on a particular platform, but to optionally provide the full
    > query that will give back the answer. This way, on platforms that
    > expose a well-documented data dictionary complete with profiler
    > statistics (such as PostgreSQL), we could do an ANALYZE followed by
    > reading the stats it produced. The ANALYZE should be as fast as
    > possible, since it can use internal tricks.
    >
    > > Other than that I think it's a good starting point.

    >
    > Great. We'll fix the threading problem. If we have time this week
    > (we might), we can look at experimenting with using planner statistics
    > in the profiler back end. Which platforms to most people use Kettle
    > with? We find most of our customers want to use MS SQL Server these
    > days, and our salespeople don't even try to talk them out of it.
    >
    > -Jonathan
    >
    >



    --
    Matt
    ____________________________________________
    Matt Casters
    Chief Data Integration - Kettle founder
    Pentaho, Open Source Business Intelligence
    http://www.pentaho.org -- mcasters (AT) pentaho (DOT) org
    Tel. +32 (0) 486 97 29 37

    --~--~---------~--~----~------------~-------~--~----~
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com
    To unsubscribe from this group, send email to kettle-developers-unsubscribe (AT) g...oups (DOT) com
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en
    -~----------~----~----~----~------~----~------~--~---

  13. #13
    Darren Hartford Guest

    Default RE: Data Profiling Feature

    I just finished a QA project involving sampling, and I came across this
    really useful one-page piece of information that was invaluable that may
    help: http://www.petefreitag.com/item/466.cfm

    My usecase was based on 10% sampling, instead of fixed-number-of-rows
    sampling, although I have had a need in the past for a
    fixed-number-of-rows sampling regardless of the number of incoming
    records.

    Just sharing above link to avoid going through each databases own
    documentation :-)

    -D

    Response to:
    > I'm not sure that the full-scan of the table is the limiting factor

    here.
    > I
    > tried it with a 1 million rows table and the profiler was busy for

    about 5
    > minutes. At the same time, a 3.0 transformation reads *all* the rows

    from
    > the same table in 7 seconds flat.
    >
    > So one way or another, there is analytical work being done on the

    MySQL
    > database and I'm not sure that this is the right way to go. (MySQL is
    > using
    > 100% CPU for minutes)
    >
    > Matt



    --~--~---------~--~----~------------~-------~--~----~
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com
    To unsubscribe from this group, send email to kettle-developers-unsubscribe (AT) g...oups (DOT) com
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en
    -~----------~----~----~----~------~----~------~--~---

  14. #14
    Jonathan Fuerth Guest

    Default Re: Data Profiling Feature

    On 12/4/07, Matt Casters <mcasters (AT) pentaho (DOT) org> wrote:
    > I absolutely wasn't asking for an immediate fix on the threading, as per usual
    > what I'm trying to do is ask for feedback from the community. Usually, once
    > the software is open sourced, these things evolve over time based on that
    > feedback.


    Okay, I understand what you're saying here. Your criticism here was
    certainly valid, though! It's annoying when the UI freezes under any
    circumstances, but especially when there's not even any immediate
    feedback.

    > On the sampling, I think we can do a lot better than that.


    Great! I'm not particularly happy with the performance myself.

    > "The reservoir sampling plugin is a tool that allows you to sample a fixed
    > number of rows from an incoming Kettle data stream when the total number of
    > incoming rows is not known in advance. All rows have equal chance of being
    > selected (uniform sampling). This step is particularly useful when used in
    > conjunction with the ARFF output step in order to generate a suitable sized
    > data set to be used by WEKA. The reservoir sampling step uses algorithm �R�
    > by Vitter (Vitter 1985)."


    This looks good (I grabbed the plugin and perused the code). The R
    algorithm itself is only about 5 lines of code, and what it's doing
    certainly does make sense. However, it requires not only a full table
    scan, but to actually transfer the data of every row over the network.

    > I'm not sure that the full-scan of the table is the limiting factor here. I
    > tried it with a 1 million rows table and the profiler was busy for about 5
    > minutes. At the same time, a 3.0 transformation reads *all* the rows from
    > the same table in 7 seconds flat.


    That's definitely my favourite option when you know the database is
    local (either on the same machine or over a gigabit LAN) but
    unfortunately the Internet isn't gigabit everywhere yet (we only have
    10mbits, and we pay a lot for that). It's the nature of our work that
    we do a lot of our profiling over VPN's to our customer's database
    servers. In that case, it is orders of magnitudes faster to ask the
    remote database to perform the analysis, then transfer one row over
    the network.

    > What happens when we have 10 million rows or more? A good sampling solution
    > is IMHO crucial in that case. Using transformations we would have
    > interesting options to parallelise the profiling itself over CPUs and later
    > even over clusters.


    Good point, but the network between the database server and the
    cluster is still the limiting factor.

    By far the costliest thing our profiler does is the "top N values."
    This is a "select column, count(*) from table group by column, order
    by count(*)" query. You need one per column in the table, because
    each of those queries involves sorting the data set by that column.
    If you tried a compound query to ask for the top N values in every
    column at the same time, you'd explode the database.

    So, in the top N case, sampling would help tremendously, since it
    reduces the size of the set that needs to be sorted. Right now, we're
    sorting the entire data set once for every column!

    > I haven't seen the code yet, but I'm guessing that things like this probably
    > require some major changes. Please understand that *again*, I'm not saying
    > we should do all that right away, but we should think big and start small ;-)


    Well, it's not trivial of course, but it's also relatively painless.
    The core of the profiling system is fairly flexible, so it would be
    easy to introduce an SPI concept, which would make it possible to plug
    in completely different service providers that perform the actual
    profiling. We could create the SPI interfaces, refactor the existing
    profiling code to be the reference implementation, and also implement
    the client-side provider (which could use the reservoir sampling
    technique and calculate the stats in Java). This way, users could
    have a choice based on how good the network between their workstation
    and database is (and how performant the DMBS is compared to the
    workstation).

    Also, I understand Kettle now features remote job execution. This
    would be another interesting way of solving the slow network problem,
    since you could ask a remote Kettle on the same LAN as the database to
    perform the "client side" profiling, then send the results back to
    you. Maybe this could be a 3rd SPI implementation!

    Thanks for your feedback, and opening this discussion. The profiler
    is getting better already!

    Cheers,

    -Jonathan

    --~--~---------~--~----~------------~-------~--~----~
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com
    To unsubscribe from this group, send email to kettle-developers-unsubscribe (AT) g...oups (DOT) com
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en
    -~----------~----~----~----~------~----~------~--~---

  15. #15
    Jonathan Fuerth Guest

    Default Re: Data Profiling Feature

    On 12/4/07, Darren Hartford <dhartford (AT) ghsinc (DOT) com> wrote:
    > I just finished a QA project involving sampling, and I came across this
    > really useful one-page piece of information that was invaluable that may
    > help: http://www.petefreitag.com/item/466.cfm


    Thanks, Darren! That's a really valuable resource. We'll try those
    techniques on the databases at our disposal, and if they improve
    performance, we'll incorporate them into the existing profiler.

    I do hate to do platform-specific stuff like this, but when the
    network bandwidth is the limiting factor, there is little choice but
    to start using SQL tricks to reduce the amount of data being moved
    around.

    -Jonathan

    --~--~---------~--~----~------------~-------~--~----~
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com
    To unsubscribe from this group, send email to kettle-developers-unsubscribe (AT) g...oups (DOT) com
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en
    -~----------~----~----~----~------~----~------~--~---

  16. #16
    Matt Casters Guest

    Default Re: Data Profiling Feature

    On Tuesday 04 December 2007 17:57:48 Jonathan Fuerth wrote:[color=blue]
    > Also, I understand Kettle now features remote job execution.

  17. #17
    Darren Hartford Guest

    Default RE: Data Profiling Feature

    Just doing a followup, whatever happened to this?

    -----Original Message-----
    From: kettle-developers (AT) googlegroups (DOT) com [mailto:kettle-developers (AT) googlegroups (DOT) com] On Behalf Of Matt Casters
    Sent: Tuesday, December 04, 2007 1:00 PM
    To: kettle-developers (AT) googlegroups (DOT) com
    Subject: Re: Data Profiling Feature



    On Tuesday 04 December 2007 17:57:48 Jonathan Fuerth wrote:[color=blue]
    > Also, I understand Kettle now features remote job execution.

  18. #18
    Jonathan Fuerth Guest

    Default Re: Data Profiling Feature

    Hi Darren,

    The profiler as discussed here, including the option to use the heavy
    database queries or the reservior approach, is still available as part
    of the Power*Architect. It's split into core functionality (which
    lives in the ca.sqlpower.architect.profile.* packages) and a Swing GUI
    (in ca.sqlpower.architect.swingui). It's all open source (GPLv3), so
    you can either use the Architect to do your profiling, or you could
    develop a Kettle plugin based on the code.

    The SWT GUI we created for the purpose of embedding those same core
    profiling features into Kettle is probably kicking around somewhere.
    Matt ended up not accepting our profiling stuff into Kettle, partly
    due to a misunderstanding about our code being open source (it is) and
    partly due to a difference in vision between us and Matt/Pentaho about
    what the use case for the profiling would be.

    If you'd like a copy of the SWT interface we made, I should be able to
    dig that up for you. Another option, if you're so inclined, would be
    to use some sort of SWT/Swing wrapper and just recycle the existing
    Swing GUI.

    -Jonathan

    PS: I did experiment with the "how to select a random row" stuff you
    pointed me at. Unfortunately, those queries cause the database backend
    to do a full table sort before spitting out the random row(s), so they
    don't perform better than what we're already doing.

    On Tue, Apr 21, 2009 at 3:56 PM, Darren Hartford <dhartford (AT) ghsinc (DOT) com> wrote:[color=blue]
    >
    > Just doing a followup, whatever happened to this?
    >
    > -----Original Message-----
    > From: kettle-developers (AT) googlegroups (DOT) com [mailto:kettle-developers (AT) googlegroups (DOT) com] On Behalf Of Matt Casters
    > Sent: Tuesday, December 04, 2007 1:00 PM
    > To: kettle-developers (AT) googlegroups (DOT) com
    > Subject: Re: Data Profiling Feature
    >
    >
    >
    > On Tuesday 04 December 2007 17:57:48 Jonathan Fuerth wrote:[color=green]
    >> Also, I understand Kettle now features remote job execution.

  19. #19
    Darren Hartford Guest

    Default RE: Data Profiling Feature

    Thanks Jonathan for the quick response (and to /dev/null with those darn table scans!). I'm behind Matt's decision about GPL v LGPL (not necessarily an 'open source' license discussion, but something else), but don't want to create drama here that would get in the way of evaluating profiling tool options from a user standpoint.

    I've started looking at a couple of others that are also open source (of varying license types) and wouldn't mind grabbing a GPL or equivalent version of Power Architect to add to the evaluation.

    I saw one data profiler claiming to integrate with Pentaho, but I won't mention the name incase there are unfinished discussions.

    I'll post back to the forum (opposed to the mailing list) findings if anyone is interested -- it's not directly related to Kettle, but useful to those in the ETL world, and might spark some ideas for Pentaho/Kettle.

    -D

    p.s. this may be case-by-case dependent, but some of the heavy/power research/profiling might want to see about moving the selected/random/snippet dataset to a separate column-oriented database like MonetDB, LucidDb, or InfoBright (in my personal order of preference for embed-ability for this scenario) then do the research/profiling there. These should better handle the per-column table-scan scenarios since that's the nature of the beast, my two coppers.

    -----Original Message-----
    From: kettle-developers (AT) googlegroups (DOT) com [mailto:kettle-developers (AT) googlegroups (DOT) com] On Behalf Of Jonathan Fuerth
    Sent: Tuesday, April 21, 2009 5:01 PM
    To: kettle-developers (AT) googlegroups (DOT) com
    Subject: Re: Data Profiling Feature


    Hi Darren,

    The profiler as discussed here, including the option to use the heavy
    database queries or the reservior approach, is still available as part
    of the Power*Architect. It's split into core functionality (which
    lives in the ca.sqlpower.architect.profile.* packages) and a Swing GUI
    (in ca.sqlpower.architect.swingui). It's all open source (GPLv3), so
    you can either use the Architect to do your profiling, or you could
    develop a Kettle plugin based on the code.

    The SWT GUI we created for the purpose of embedding those same core
    profiling features into Kettle is probably kicking around somewhere.
    Matt ended up not accepting our profiling stuff into Kettle, partly
    due to a misunderstanding about our code being open source (it is) and
    partly due to a difference in vision between us and Matt/Pentaho about
    what the use case for the profiling would be.

    If you'd like a copy of the SWT interface we made, I should be able to
    dig that up for you. Another option, if you're so inclined, would be
    to use some sort of SWT/Swing wrapper and just recycle the existing
    Swing GUI.

    -Jonathan

    PS: I did experiment with the "how to select a random row" stuff you
    pointed me at. Unfortunately, those queries cause the database backend
    to do a full table sort before spitting out the random row(s), so they
    don't perform better than what we're already doing.

    On Tue, Apr 21, 2009 at 3:56 PM, Darren Hartford <dhartford (AT) ghsinc (DOT) com> wrote:[color=blue]
    >
    > Just doing a followup, whatever happened to this?
    >
    > -----Original Message-----
    > From: kettle-developers (AT) googlegroups (DOT) com [mailto:kettle-developers (AT) googlegroups (DOT) com] On Behalf Of Matt Casters
    > Sent: Tuesday, December 04, 2007 1:00 PM
    > To: kettle-developers (AT) googlegroups (DOT) com
    > Subject: Re: Data Profiling Feature
    >
    >
    >
    > On Tuesday 04 December 2007 17:57:48 Jonathan Fuerth wrote:[color=green]
    >> Also, I understand Kettle now features remote job execution.

  20. #20
    Nicholas Goodman Guest

    Default Re: Data Profiling Feature

    On Apr 21, 2009, at 2:00 PM, Jonathan Fuerth wrote:

    > Matt ended up not accepting our profiling stuff into Kettle, partly
    > due to a misunderstanding about our code being open source (it is) and



    Well, all open source isn't created equal. GPLv3 won't be committed
    into Kettle, regardless of the vision stuff. In fact, even other
    projects that Pentaho "owns" like Weka don't go into kettle because of
    the GPL. ie - Weka integration is done as plugins instead of
    committed as part of the project.

    That being said, I'd love to see data profiling steps (and your stuff
    as a GPLv3 plugin) would be great. I'd be happy to pitch in with this
    effort as well. I happily use PowerArchitect profiling on my projects
    and would love to help see more PDI users using PowerArchitect
    profiling.

    Nick

    --~--~---------~--~----~------------~-------~--~----~
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com
    To unsubscribe from this group, send email to kettle-developers+unsubscribe (AT) googlegroups (DOT) com
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en
    -~----------~----~----~----~------~----~------~--~---

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.