Hitachi Vantara Pentaho Community Forums
Results 1 to 7 of 7

Thread: parallel reading of csv file

  1. #1

    Default parallel reading of csv file

    Hi,

    I am having a csv file with 50 million records (700MB file). Transformation is very simple. Read data from csv file -- load records into MySQL table.

    Total execution time of this transformation is about 30 min.

    Even though I checked 'Running in Parallel' box, performance is not improved. It still takes 30 min to run.

    Could anyone help me to understand how to enable parallel reading of csv file in PDI?

    I am executing this transformation at my local machine (Windows 7 + 4GB RAM). PDI Version 4.0.1 community edition.

    Any help will be highly appreciated.

    Regards,
    Ritesh
    Last edited by riteshskumar; 02-25-2011 at 01:22 AM.

  2. #2
    Join Date
    Aug 2008
    Posts
    563

    Default

    Try to specify multiple copies of the CSV input step.
    Best regards,
    Diethard
    ===============
    Visit my Pentaho blog which offers some tutorials mainly on Kettle, Report Designer and Mondrian
    ===============

  3. #3
    Join Date
    Aug 2008
    Posts
    563

    Default

    Also note, that the speed of your transformation will also depend on the upload speed to MySQL. If MySQL cannot insert the rows as quickly as Kettle can provide them, the whole process will have to slow down. So running multiple copies of the CSV input step will not automatically increase the speed ...
    Best regards,
    Diethard
    ===============
    Visit my Pentaho blog which offers some tutorials mainly on Kettle, Report Designer and Mondrian
    ===============

  4. #4
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    With another operating system (anything except for Windows) you could have used named pipes to stream the data into MySQL (with the MySQL Bulk Loader).
    Jens recently loaded 400,000 rows/s into MySQL that way. MySQL buckled under the load but that's another story ;-)

  5. #5
    Join Date
    Mar 2010
    Posts
    159

    Default

    Dang developers firehosing the database....

  6. #6
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    Quote Originally Posted by MattCasters View Post
    MySQL buckled under the load but that's another story ;-)
    I'd like to see that story in a blog post or something...
    **THIS IS A SIGNATURE - IT GETS POSTED ON (ALMOST) EVERY POST**
    I'm no expert.
    Take my comments at your own risk.

    PDI user since PDI 3.1
    PDI on Windows 7 & Linux

    Please keep in mind (and this may not apply to this thread):
    No forum member is going to do your work for you. We will help you sort out how to do a specific part of the work, as best we can, in the timelines that our work will allow us.
    Signature Updated: 2014-06-30

  7. #7
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    I once presented a session at the MySQL User Conference where I discussed the finer points.
    The point back then was that the MySQL bulk load needs to be chopped up into smaller pieces because the index memory on the server was limited. I don't know if that limitation is still in recent versions of MySQL, I haven't kept up. But apparently mayhem like that searches the weakest point on the server.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.