Hitachi Vantara Pentaho Community Forums
Results 1 to 8 of 8

Thread: Advice on continuous data integration

  1. #1
    Join Date
    Mar 2016
    Posts
    2

    Default Advice on continuous data integration

    Hello, We have an IVR application that saves call information in XML structures in real time. This data is extracted, transformed and loaded into database tables. Our client's requirement is to have the data
    loaded into the database on a 'near' real time basis (meaning within a few minutes). From reading the PDI (Kettle) documentation, I understand that transformation jobs have to be run stand alone
    using the Community Edition (using the Op Sys scheduler if we want it to repeat) or through the DIServer Scheduler when using the Enterprise Edition version. It appears that to meet the requirement of
    near real time load, we would have to schedule a job to run every couple of minutes and due to the volume of data I am unsure that a transform job would be able to complete the ETL process in that period
    of time, meaning that we would most likely end up with multiple instances running concurrently. This doesn't seem to be a very efficient way of handling our ETL requirements. Does anyone on the forum
    have any experience with this kind of environment, or any advice for us? Is there any way we could set this up so that PDI would fit our requirement?

    Thank you in advance!

  2. #2

    Default

    Perhaps Jenkins could help for your use-case: https://wiki.jenkins-ci.org/display/...S/Meet+Jenkins

  3. #3
    Join Date
    Mar 2016
    Posts
    2

    Default Reply to suggestion of using Jenkins for continuous data integration

    Quote Originally Posted by and78386 View Post
    Perhaps Jenkins could help for your use-case: https://wiki.jenkins-ci.org/display/...S/Meet+Jenkins
    Hello and78386, thank you for the response.

    I looked at the wiki page you pointed to... I think Jenkins is used more for continuous integration and delivery of projects. It looks like it's a tool for helping develop, test and deploy changes than for
    actually running an application continuously in a production environment.

    I appreciate your help with this!

    Mbeers

  4. #4

    Default

    Hi Mbeers,

    I am not aware of any data integration tools that provides distributed processing of a desired job. Like you mentioned in the previous post a possible solution would be to run a job in several instances and limit them on input or use some distributed processing engine like spark streaming or storm. Probably the later are overkill for your case and will be suitable where amounts of data exceeds the processing power of a single machine. I do not know if they are supported in Pentaho but they are definitely in Talend. From Talend maybe you can check Talend ESB for a more message passing oriented architecture.

    Here is another link from Pentaho and continuous load of twits. Hope it helps.

  5. #5
    Join Date
    Jul 2009
    Posts
    476

    Default

    I have a job that does near-real-time ETL, using Community Edition. The Start step is set to Repeat every 1 seconds. (Right-click the Start step in the job, check Repeat, and enter 1 in the "Interval in seconds" box, or choose a different resting period to your liking.) Every time the job succeeds, it goes back to the Start step and repeats.

    When I want processing to stop, I update a row in a table that the job reads every time as it finishes. When the job sees that the table row indicates it should stop, the job aborts, so it doesn't repeat again.

    This works well for us, but we do have to monitor it for situations like bad data, or when it falls way behind the source system.

  6. #6

    Default

    If I understand you correctly the main issue is to do all checks and transformations in a single job that would finished in less than 1s? In the previous post I was apparently wrong about distributing load so maybe you can check the clustering option and divide the input set to separate slave nodes.

  7. #7
    Join Date
    Jul 2009
    Posts
    476

    Default

    mertez,

    One second is the resting interval (that I defined) between repeated job runs. It doesn't matter how long the job takes to run, PDI will always wait one second before running it again. For example, our job repetition execution times could look like this:

    Job takes 40 seconds
    Rest 1 second
    Job takes 35 seconds
    Rest 1 second
    Job takes 2 seconds
    Rest 1 second
    ...and so on...

  8. #8
    Join Date
    Aug 2011
    Posts
    360

    Default

    Maybe you should use a JMS system to transform data in a near streaming fashion (in CE you'll have to write the jms code yourself)
    Like:
    1. A job continuesly run (every 1s as robj said) wich push new files into a jms queue
    2. Another job run forever, while consuming the queue to transform data.

    Maybe you can use a sequence of jms queues to do all the transformationsone at a time
    Each one consuming messages, transform, then push it to next queue.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.