Hitachi Vantara Pentaho Community Forums
Results 1 to 5 of 5

Thread: Kettle (PDI) & Talend

  1. #1

    Default Kettle (PDI) & Talend

    Hi everyone,
    I'm using the both Open source ETL tools Kettle (PDI) and Talend. in Talend i use Tmap component to do join, lookup, different filters to generate different outputs. i didn't find something equivalent in kettle. am i missing something or it doesn't exist.
    Tmap is useful because in my case i have data coming from different tables in oracle database and to generate data for my warehouse i use almost the same query to generate my four measures (i do a filter then an agregation and then count(),the difference are only the filters and the count will be my measures).
    Also an other point to do a join use Tmap but for kettle i write my SQL query in an oracle input table step (so lack of the use of metadata: difficult for peaple who don't know SQL). I tried one time to use the join step but it was so slow.

    any suggestions?
    Thanks in advance
    hamma

  2. #2
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    It's actually the other way around: the folks from Talend are copying our Steps and Job entries (Wait for *, Dimensions, etc).
    So obviously you can do these things.

    However, we do not try put all functionality into a single operator. From a design and maintenance perspective that always seemed to be the wrong choice to me. I guess if you're generating a single big Perl loop in the back that tMap is one way to solve a problem. We went with parallel computing and data isolation approaches.

    As to your questions: you will find that our steps: Database Lookup, Merge Join, Filter can do the tricks you want nicely.

    All the best,

    Matt
    Last edited by MattCasters; 09-28-2007 at 04:04 AM.

  3. #3

    Default

    Thanks for the quick reply, but what do u mean exactly with parallel computing and data isolation? is there something for this in kettle and missing in Talend? i would like to know the different approaches and the underlying architectures.
    I have also a general question about data warehouse loading: is it better to do a single job which do almost every thing or do many single transformations, jobs and use one job to execute them?
    Also in the documentation of spoon i read that "if u want to do a join among tables in the same database the best way is to write your SQL query" does that mean the join step is lacking the performances?
    thanks in advance

    Best regards
    hamma

  4. #4
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    Hi Hamma,

    A few weeks ago I wrote a blog entry called "Making the case for Kettle" that touches upon the subject of why certain things are best left to the database to handle. The answer is quite simple: because database where designed to join, they are always going to be fastest. There are a lot of technical reasons behind it (random access vs streaming based processing) but that's the reality. BTW, that goes for all streaming based ETL tools, including Informatica, PDI, Talend and a few others.

    Parallel computing: these days new computers more often than not have multiple CPUs in them. PDI takes advantage of this by spawning off work threads: one for each step in your transformation. As such, we try to make maximum use of the available CPU power of your machine. History has shown us that parallel processing almost always beats single thread programs in performance. I have no idea whether Talend runs in different threads and frankly I'm too busy to care too much about it ;-)

    Data and function isolation: we designed PDI to completely isolate the functionality of one step from another. That leads to very good predictability of the actual output of a transformation. It also allows you to see in one overview (the transformation) how each step in the process is being executed. A "lookup" is a different operator from a "filter". A "database lookup" is different from an in memory "Stream Lookup" too. It's important to make that distinction, especially when you are dealing with large transformations. The ability to immediately see where a certain function is being performed is then very important.

    As for the size of your transformation, don't listen to me, listen to Dan Linstedt why says on his blog:
    Pitfall 13) Trying to do "too much" inside of a single data flow, increasing complexity and dropping performance

    It's not just about performance either, again this is about being able to maintain your transformations and jobs a year from now.

    All the best,

    Matt

  5. #5

    Default

    Thanks for the detailed reply it's helpful to understand the great approach of PDI. the simplicity, clearness, completeness, light user interface are things i appreciated in PDI.

    Thanks for this great! work.

    Best regards
    hamma

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.