Hitachi Vantara Pentaho Community Forums
Results 1 to 7 of 7

Thread: PDI Environment

  1. #1
    Sven Boden Guest

    Default PDI Environment

    I added a new JIRA a while ago http://jira.pentaho.org/browse/PDI-342

    If anyone can think of extra requirements or if changes or needed ...
    shout

    Regards,
    Sven


    --~--~---------~--~----~------------~-------~--~----~
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com
    To unsubscribe from this group, send email to kettle-developers-unsubscribe (AT) g...oups (DOT) com
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en
    -~----------~----~----~----~------~----~------~--~---

  2. #2
    Matt Casters Guest

    Default Re: PDI Environment

    Hi Sven,

    I have been thinking about this a bit and I think that there are a few pieces
    of the puzzle missing in our Kettle appliance:

    1) In order to start/stop/pauze/resume/preview/debug/monitor jobs and
    transformations you need to uniquely define them. Not only that, typically
    you want to "post" these to the appliance. A set of management tools, with
    security is needed to do just that. Interestingly, the Pentaho platform
    already has the Solutions Repository to do just that.
    BTW, it's a common misconception to think that the Pentaho platform only runs
    on an application server. (on the contrary)

    2) You don't just want to post/transfer/manage transformations, but also shell
    scripts, reference files, shared database connections, images, etc.

    3) I want to add the notion of "Resources" and "Resource Groups" to make our
    appliance more intelligent. A resource can be a transformation, job, shell
    script, database connection. Transfering and managing resources in group
    will make our life easier. We can back up resource groups, zip them, share
    them, e-mail them, etc. We can add various options to do locking on
    resources too, with time-outs, try/retry, versioning, dependency checking,
    etc. The locking in itself will allow us to run jobs and transformations in
    parallel with automatic dependency checking and order of execution
    calculation.

    4) I think it's very important to be able to be able to see the current
    logging of a specific running job or transformation. However, at the moment,
    because of the Log4J system we use, all the logging is thrown onto one big
    pile. That means that you will see logging from various jobs and
    transformations mixed into the same logging "bucket". A serious architecture
    change will have to take place, not unlike making the variables local, to do
    this.

    5) A new security system is needed for all of this. Fortunately, Pentaho has
    a pluggable, highly configurable security system in open source to connect to
    LDAP/AD and other standard and custome built security systems.

    --
    Matt
    ____________________________________________
    Matt Casters
    Chief Data Integration - Kettle founder
    Pentaho, Open Source Business Intelligence
    http://www.pentaho.org -- mcasters (AT) pentaho (DOT) org
    Tel. +32 (0) 486 97 29 37


    On Sunday 21 October 2007 20:34:34 Sven Boden wrote:
    > I added a new JIRA a while ago http://jira.pentaho.org/browse/PDI-342
    >
    > If anyone can think of extra requirements or if changes or needed ...
    > shout
    >
    > Regards,
    > Sven
    >



    --~--~---------~--~----~------------~-------~--~----~
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com
    To unsubscribe from this group, send email to kettle-developers-unsubscribe (AT) g...oups (DOT) com
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en
    -~----------~----~----~----~------~----~------~--~---

  3. #3
    Sven Boden Guest

    Default Re: PDI Environment

    Hey Matt,

    1, 4 and 5 are correct, 2 and 3 I would consider details for the
    moment (but do think about them).

    For the current state of Kettle I think that qua framework we're
    pretty good now after your 3.0 rewrite ;-) Steps and job entries are
    becoming pretty mature and complete, and they will sure still grow.
    The only big part missing to be able to compete "1-on-1" with
    commercial products is a server part. And with 1-on-1 I mean even
    disregarding price-tag, for most shortlistings it's just a comparison
    of what all available tools have as features (mostly not even finding
    out whether one is better than the other, just that you have it)...
    and all commercial tools have a server component (except in the past
    maybe Sunopsis but they marketed ELT )

    Qua existing server components that I know:
    - SAS: overdoes a lot of stuff qua meta-data and security, so that's
    no good. And operational wise they're also not that good. But they can
    claim they have a server component.
    - DataStage: has it about just right for operational monitoring, but
    they're no good at security, metadata,. ...
    - Informatica: I loved their security system, but haven't been
    following them lately.

    Most important is:
    1) A server component (e.g. running somewhere far away in a server
    farm). People can attach to the server component using IP and port,
    once logged in they can browse job names, transformation names, last
    state of execution, schedules, logfiles, and do some job
    administration. The server component can be automatically started with
    via the system scripts. Think of 1 UNIX savy person installing a
    server component and afterwards business analysts, developers, users
    without UNIX knowledge using that server component.
    Personally I wouldn't mind that the current database repositories are
    moved to a VFS based solution repository. Somehow the server component
    needs access to the jobs/transformations. The execution logs need to
    be written local to the server, there may be several people looking at
    logs (and log maintenance could keep e.g. the last 10 runs, last 4
    days, ...)
    As client-server protocol I would use http for sure as that's easiest
    to push through firewalls (I've had problems in the past getting
    DataStage approved to be run over a firewall).

    2) An internal job scheduler that runs jobs/transformations. I would
    guess normal crontab functionality + maybe some extra's like "skip job
    run if previous one is still running). People should be able to see
    which jobs are running, look at past log files and put a "tail" on the
    log files of the current running jobs. I would also prefer a queue
    based view of what is going to be started next hours (which I've only
    seen in cognos, and not in ETL tools)

    3) Qua security most importantly would be the roles: administrator (do
    everything), monitor (just see logs), operator (see logs, and start/
    stop/restart jobs, schedule jobs), implementer (import jobs, export
    jobs, schedule jobs). Important here is the split between operator and
    implementer (for SOX freezes e.g.)

    Most Kettle people would be very happy with the last 3 items.

    Additionally:
    4) Team development: multiple people accessing 1 server component
    5) Version control:
    6) Being able to sychronize repositories, incrementally/full,
    exporting/importing jobs, ... consider e.g. a development server,
    model office server, production server. and moving jobs/
    transformations/database connections between the 3 of them.
    7) Automatic crash recovery: when an option for a job in the schedule
    is set it would restart jobs automatically upon improper closing of
    the server component.
    8) Multi vendor support

    Regards,
    Sven


    On Nov 1, 12:11 pm, Matt Casters <mattcast... (AT) gmail (DOT) com> wrote:
    > Hi Sven,
    >
    > I have been thinking about this a bit and I think that there are a few pieces
    > of the puzzle missing in our Kettle appliance:
    >
    > 1) In order to start/stop/pauze/resume/preview/debug/monitor jobs and
    > transformations you need to uniquely define them. Not only that, typically
    > you want to "post" these to the appliance. A set of management tools, with
    > security is needed to do just that. Interestingly, the Pentaho platform
    > already has the Solutions Repository to do just that.
    > BTW, it's a common misconception to think that the Pentaho platform only runs
    > on an application server. (on the contrary)
    >
    > 2) You don't just want to post/transfer/manage transformations, but also shell
    > scripts, reference files, shared database connections, images, etc.
    >
    > 3) I want to add the notion of "Resources" and "Resource Groups" to make our
    > appliance more intelligent. A resource can be a transformation, job, shell
    > script, database connection. Transfering and managing resources in group
    > will make our life easier. We can back up resource groups, zip them, share
    > them, e-mail them, etc. We can add various options to do locking on
    > resources too, with time-outs, try/retry, versioning, dependency checking,
    > etc. The locking in itself will allow us to run jobs and transformations in
    > parallel with automatic dependency checking and order of execution
    > calculation.
    >
    > 4) I think it's very important to be able to be able to see the current
    > logging of a specific running job or transformation. However, at the moment,
    > because of the Log4J system we use, all the logging is thrown onto one big
    > pile. That means that you will see logging from various jobs and
    > transformations mixed into the same logging "bucket". A serious architecture
    > change will have to take place, not unlike making the variables local, to do
    > this.
    >
    > 5) A new security system is needed for all of this. Fortunately, Pentaho has
    > a pluggable, highly configurable security system in open source to connect to
    > LDAP/AD and other standard and custome built security systems.
    >
    > --
    > Matt
    > ____________________________________________
    > Matt Casters
    > Chief Data Integration - Kettle founder
    > Pentaho, Open Source Business Intelligencehttp://www.pentaho.org -- mcast... (AT) pentaho (DOT) org
    > Tel.+32 (0) 486 97 29 37
    >
    > On Sunday 21 October 2007 20:34:34 Sven Boden wrote:
    >
    > > I added a new JIRA a while agohttp://jira.pentaho.org/browse/PDI-342

    >
    > > If anyone can think of extra requirements or if changes or needed ...
    > > shout

    >
    > > Regards,
    > > Sven



    --~--~---------~--~----~------------~-------~--~----~
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com
    To unsubscribe from this group, send email to kettle-developers-unsubscribe (AT) g...oups (DOT) com
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en
    -~----------~----~----~----~------~----~------~--~---

  4. #4
    Darren Hartford Guest

    Default RE: PDI Environment

    From a user/business standpoint regarding Kettle server/appliance:

    1) start/stop/pauze/resume/preview/debug/monitor jobs, absolutely
    (monitor = resource monitoring, which Matt mentioned a little bit
    later...if this specific job is running at 100% cpu for 10 min, may want
    to do something).

    I'm using the Pentaho BI Platform as my Kettle 'Server' and this is
    working o.k., some things could be improved as mentioned in this thread.

    2) Scheduling definitely. Again, partial success with the scheduler
    within the Pentaho BI Platform (Quartz) using Kettle xaction jobs. More
    success with Pentaho 1.6, as stuck on Pentaho 1.2 scheduler issues.

    3) Security - yes, but can not speak to it. SOX and similar compliance
    issues are important and can be resolved with good security/auditing
    support.

    3a) I think the metadata editor already has the feature to 'publish' to
    the server using credentials and an additional 'publish' password. This
    is great, and the ability to publish directly from the IDE/editor is
    great. Expanding this to full-feature Kettle Job/Transformations would
    ease use/training/configuration. See 4/5/6.

    4/5/6) SVN/CVS/Jackrabbit version control, versioning, and multi-user
    access, yup. I would lean towards Jackrabbit or similar JCR, as I think
    there were already discussions about this with Pentaho with regards to
    user workspaces, and would be a good 'service' for file management.
    Another option is WebDAV, but that is just based on my experience and
    JCR seems to be picking up momentum (and JCR = jackrabbit, alfresco,
    nuxeo, etc.)

    7) automatic restart -- I had to write a custom servlet that would check
    the status and have it re-initiate the job based on certain failures.
    Having the a default piece would be great.

    7a) Regarding restarting jobs, and in general load management -- a
    user-assigned 'expense' or similar attribute and an associated
    configuration I think would go a long away. Example: Two jobs with
    expense 'very-high', you might configure the server to only run one
    'very-high' job at a time to avoid bottleneck/deadlock of resources
    (cpu/memory/network IO on the server). Maybe a max of 5 concurrent
    'medium' expense jobs, etc.

    7b) In a similar scenario with regards to load management, creating
    queues (possibly related to the above 'expense' attribute) so that
    certain jobs are run in serial, either for time/order reasons or
    resource reasons. I've done this with other custom ETL processes using
    JMS to only do one at a time, FIFO style. However -- the caveat is
    these queues need to be somewhat aware of time-scheduled jobs as they
    should have priority. The queue approach could be the core solution for
    7a.

    10?) More formal and integration-oriented notification processes on the
    results of transformations. I.e. reports via Pentaho BI Platform user
    workspace/email, web-service event notification, maybe RSS/Atom for the
    newsreader style notifications.

    My two coppers,
    -D




    --~--~---------~--~----~------------~-------~--~----~
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com
    To unsubscribe from this group, send email to kettle-developers-unsubscribe (AT) g...oups (DOT) com
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en
    -~----------~----~----~----~------~----~------~--~---

  5. #5
    Matt Casters Guest

    Default Re: PDI Environment

    Hi Darren & Sven,

    I had a long talk with the guys from Talend concerning their independend
    position. They chose an independend strategy not because nobody wanted to
    buy them, but because they think they are stronger alone, more focussed.
    Personally, I think Kettle has evolved a lot faster with the help from Pentaho
    and I think that right now, especially with the feature request list below,
    there are a lot of integration possibilities. The Pentaho platform has been
    developing at a quick pace on its very own and as such I think we should take
    the best from that side and use it to our best advantage.
    Security, reporting, analyses, auditing, solutions repository, posting to the
    server, etc. Those are all components that already exists.
    I'm planning a trip to Orlando in a month or so and these items will be high
    on the agenda. Personally I think that post 3.0GA we should be able to get
    these things working very fast.

    I think we can prove the conclusively that 1+1 can be more than 2. ;-)

    All the best,

    Matt

    On Thursday 01 November 2007 15:05:19 Darren Hartford wrote:
    > From a user/business standpoint regarding Kettle server/appliance:
    >
    > 1) start/stop/pauze/resume/preview/debug/monitor jobs, absolutely
    > (monitor = resource monitoring, which Matt mentioned a little bit
    > later...if this specific job is running at 100% cpu for 10 min, may want
    > to do something).
    >
    > I'm using the Pentaho BI Platform as my Kettle 'Server' and this is
    > working o.k., some things could be improved as mentioned in this thread.
    >
    > 2) Scheduling definitely. Again, partial success with the scheduler
    > within the Pentaho BI Platform (Quartz) using Kettle xaction jobs. More
    > success with Pentaho 1.6, as stuck on Pentaho 1.2 scheduler issues.
    >
    > 3) Security - yes, but can not speak to it. SOX and similar compliance
    > issues are important and can be resolved with good security/auditing
    > support.
    >
    > 3a) I think the metadata editor already has the feature to 'publish' to
    > the server using credentials and an additional 'publish' password. This
    > is great, and the ability to publish directly from the IDE/editor is
    > great. Expanding this to full-feature Kettle Job/Transformations would
    > ease use/training/configuration. See 4/5/6.
    >
    > 4/5/6) SVN/CVS/Jackrabbit version control, versioning, and multi-user
    > access, yup. I would lean towards Jackrabbit or similar JCR, as I think
    > there were already discussions about this with Pentaho with regards to
    > user workspaces, and would be a good 'service' for file management.
    > Another option is WebDAV, but that is just based on my experience and
    > JCR seems to be picking up momentum (and JCR = jackrabbit, alfresco,
    > nuxeo, etc.)
    >
    > 7) automatic restart -- I had to write a custom servlet that would check
    > the status and have it re-initiate the job based on certain failures.
    > Having the a default piece would be great.
    >
    > 7a) Regarding restarting jobs, and in general load management -- a
    > user-assigned 'expense' or similar attribute and an associated
    > configuration I think would go a long away. Example: Two jobs with
    > expense 'very-high', you might configure the server to only run one
    > 'very-high' job at a time to avoid bottleneck/deadlock of resources
    > (cpu/memory/network IO on the server). Maybe a max of 5 concurrent
    > 'medium' expense jobs, etc.
    >
    > 7b) In a similar scenario with regards to load management, creating
    > queues (possibly related to the above 'expense' attribute) so that
    > certain jobs are run in serial, either for time/order reasons or
    > resource reasons. I've done this with other custom ETL processes using
    > JMS to only do one at a time, FIFO style. However -- the caveat is
    > these queues need to be somewhat aware of time-scheduled jobs as they
    > should have priority. The queue approach could be the core solution for
    > 7a.
    >
    > 10?) More formal and integration-oriented notification processes on the
    > results of transformations. I.e. reports via Pentaho BI Platform user
    > workspace/email, web-service event notification, maybe RSS/Atom for the
    > newsreader style notifications.
    >
    > My two coppers,
    > -D
    >
    >
    >
    >
    >



    --
    Matt
    ____________________________________________
    Matt Casters
    Chief Data Integration - Kettle founder
    Pentaho, Open Source Business Intelligence
    http://www.pentaho.org -- mcasters (AT) pentaho (DOT) org
    Tel. +32 (0) 486 97 29 37

    --~--~---------~--~----~------------~-------~--~----~
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com
    To unsubscribe from this group, send email to kettle-developers-unsubscribe (AT) g...oups (DOT) com
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en
    -~----------~----~----~----~------~----~------~--~---

  6. #6
    Sven Boden Guest

    Default Re: PDI Environment

    I'll take whatever I way I can to have: a server component, internal
    scheduling, remote log viewing.

    For integration, some things will be a no-brainer to reuse them,
    general security modules, ... Other parts are probably not going to be
    that easy to "co-use". Maybe that's were Talend has more of a point of
    being independent. From my own past experience there's a big gap
    between e.g. reporting and ETL, which if some parts were co-used could
    be like a boat where 1 half of the crew wants to go left, and the
    other part wants to the right

    Best regards,
    Sven

    On Nov 1, 4:15 pm, Matt Casters <mattcast... (AT) gmail (DOT) com> wrote:
    > Hi Darren & Sven,
    >
    > I had a long talk with the guys from Talend concerning their independend
    > position. They chose an independend strategy not because nobody wanted to
    > buy them, but because they think they are stronger alone, more focussed.
    > Personally, I think Kettle has evolved a lot faster with the help from Pentaho
    > and I think that right now, especially with the feature request list below,
    > there are a lot of integration possibilities. The Pentaho platform has been
    > developing at a quick pace on its very own and as such I think we should take
    > the best from that side and use it to our best advantage.
    > Security, reporting, analyses, auditing, solutions repository, posting to the
    > server, etc. Those are all components that already exists.
    > I'm planning a trip to Orlando in a month or so and these items will be high
    > on the agenda. Personally I think that post 3.0GA we should be able to get
    > these things working very fast.
    >
    > I think we can prove the conclusively that 1+1 can be more than 2. ;-)
    >
    > All the best,
    >
    > Matt
    >
    > On Thursday 01 November 2007 15:05:19 Darren Hartford wrote:
    >
    >
    >
    > > From a user/business standpoint regarding Kettle server/appliance:

    >
    > > 1) start/stop/pauze/resume/preview/debug/monitor jobs, absolutely
    > > (monitor = resource monitoring, which Matt mentioned a little bit
    > > later...if this specific job is running at 100% cpu for 10 min, may want
    > > to do something).

    >
    > > I'm using the Pentaho BI Platform as my Kettle 'Server' and this is
    > > working o.k., some things could be improved as mentioned in this thread.

    >
    > > 2) Scheduling definitely. Again, partial success with the scheduler
    > > within the Pentaho BI Platform (Quartz) using Kettle xaction jobs. More
    > > success with Pentaho 1.6, as stuck on Pentaho 1.2 scheduler issues.

    >
    > > 3) Security - yes, but can not speak to it. SOX and similar compliance
    > > issues are important and can be resolved with good security/auditing
    > > support.

    >
    > > 3a) I think the metadata editor already has the feature to 'publish' to
    > > the server using credentials and an additional 'publish' password. This
    > > is great, and the ability to publish directly from the IDE/editor is
    > > great. Expanding this to full-feature Kettle Job/Transformations would
    > > ease use/training/configuration. See 4/5/6.

    >
    > > 4/5/6) SVN/CVS/Jackrabbit version control, versioning, and multi-user
    > > access, yup. I would lean towards Jackrabbit or similar JCR, as I think
    > > there were already discussions about this with Pentaho with regards to
    > > user workspaces, and would be a good 'service' for file management.
    > > Another option is WebDAV, but that is just based on my experience and
    > > JCR seems to be picking up momentum (and JCR = jackrabbit, alfresco,
    > > nuxeo, etc.)

    >
    > > 7) automatic restart -- I had to write a custom servlet that would check
    > > the status and have it re-initiate the job based on certain failures.
    > > Having the a default piece would be great.

    >
    > > 7a) Regarding restarting jobs, and in general load management -- a
    > > user-assigned 'expense' or similar attribute and an associated
    > > configuration I think would go a long away. Example: Two jobs with
    > > expense 'very-high', you might configure the server to only run one
    > > 'very-high' job at a time to avoid bottleneck/deadlock of resources
    > > (cpu/memory/network IO on the server). Maybe a max of 5 concurrent
    > > 'medium' expense jobs, etc.

    >
    > > 7b) In a similar scenario with regards to load management, creating
    > > queues (possibly related to the above 'expense' attribute) so that
    > > certain jobs are run in serial, either for time/order reasons or
    > > resource reasons. I've done this with other custom ETL processes using
    > > JMS to only do one at a time, FIFO style. However -- the caveat is
    > > these queues need to be somewhat aware of time-scheduled jobs as they
    > > should have priority. The queue approach could be the core solution for
    > > 7a.

    >
    > > 10?) More formal and integration-oriented notification processes on the
    > > results of transformations. I.e. reports via Pentaho BI Platform user
    > > workspace/email, web-service event notification, maybe RSS/Atom for the
    > > newsreader style notifications.

    >
    > > My two coppers,
    > > -D

    >
    > --
    > Matt
    > ____________________________________________
    > Matt Casters
    > Chief Data Integration - Kettle founder
    > Pentaho, Open Source Business Intelligencehttp://www.pentaho.org -- mcast... (AT) pentaho (DOT) org
    > Tel.+32 (0) 486 97 29 37



    --~--~---------~--~----~------------~-------~--~----~
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com
    To unsubscribe from this group, send email to kettle-developers-unsubscribe (AT) g...oups (DOT) com
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en
    -~----------~----~----~----~------~----~------~--~---

  7. #7
    Matt Casters Guest

    Default Re: PDI Environment

    On Thursday 01 November 2007 22:30:08 Sven Boden wrote:
    > being independent. From my own past experience there's a big gap
    > between e.g. reporting and ETL, which if some parts were co-used could
    > be like a boat where 1 half of the crew wants to go left, and the
    > other part wants to the right


    I was merely refering to the fact that there are plenty of components to
    re-use in the Pentaho platform to do operational reporting and analyses with.

    Obviously I agree that there are few overlaps between say ETL and reporting.

    Have a great weekend!

    Matt
    ____________________________________________
    Matt Casters
    Chief Data Integration - Kettle founder
    Pentaho, Open Source Business Intelligence
    http://www.pentaho.org -- mcasters (AT) pentaho (DOT) org
    Tel. +32 (0) 486 97 29 37

    --~--~---------~--~----~------------~-------~--~----~
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com
    To unsubscribe from this group, send email to kettle-developers-unsubscribe (AT) g...oups (DOT) com
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en
    -~----------~----~----~----~------~----~------~--~---

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.