Hitachi Vantara Pentaho Community Forums
Results 1 to 7 of 7

Thread: Kettle Performance on 5-10GB Input Text file to XML output

  1. #1
    Join Date
    Jul 2007
    Posts
    8

    Default Kettle Performance on 5-10GB Input Text file to XML output

    Hi,

    We are planning to use Kettle for converting Text file to XML file. Input Text file size might go up to 10GB (MAX). I am curious to know how well does Kettle perform in this scenario?

    Any inputs are appreciated.


    Thanks,
    Sree

  2. #2
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    What kind of text file is it? Fixed width, CSV, many columns or few, etc?

  3. #3
    Join Date
    Jul 2007
    Posts
    8

    Default

    I get both Fixed width and CSV files with number of columns ranging from 100-300.

    Thanks,
    Sree

  4. #4
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    The difference between CSV and Fixed Width files is that the latter can be read in parallel by multiple hosts (SAN / clustered) or multiple step copies (SMP).
    That brings your read performance always to the maximum that the disks can deliver.
    In one certain instance we've been able to read in the Gigabyte-per-second range. Unfortunately we can't tell you everything about it yet, but I hope we will be able to do so soon.

    Obviously, it highly depends on your setup and situation too, but I would think that you can build nice things with the latest Kettle versions.

    I've commented on these things before on my blog by the way, for example over here: http://www.ibridge.be/?p=78

    All the best,

    Matt

  5. #5
    Join Date
    Jul 2007
    Posts
    8

    Default

    Thanks a lot for the link and answers.

    I will post my results back here when i do my POC.

    Regards,
    Sree

  6. #6
    Join Date
    Jul 2007
    Posts
    2,498

    Default

    Quote Originally Posted by MattCasters View Post
    The difference between CSV and Fixed Width files is that the latter can be read in parallel by multiple hosts (SAN / clustered) or multiple step copies (SMP).
    That brings your read performance always to the maximum that the disks can deliver.


    Wait, please tell me that you don't want all that input to generate ONE SINGLE xml file.
    Pedro Alves
    Meet us on ##pentaho, a FreeNode irc channel

  7. #7
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    exactly my thoughts ;-)

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.