US and Worldwide: +1 (866) 660-7555
Results 1 to 5 of 5

Thread: Kettle 4.2.0 Stable: Table Input does full table read

  1. #1
    Join Date
    Jan 2007
    Posts
    485

    Default Kettle 4.2.0 Stable: Table Input does full table read

    Hi all,

    I am working with a large census data set (+/- 14.9 millon rows by 103 columns of varibale data type columns) in MySql which I managed to upload and transform from a text file. The large table in MySql has no index.

    I am currently trying to create a transformation to summarize/aggregate of the above large dataset/table with some 19 columns of data and one meassure, by doing a sort and then group by and some other transform steps. The problem I am encountering is that the initial Table Input Step is reading the full table on inception not showing me any advance on the "Excecution Results" pane; some 2 or three hours pass and then for some strange reason, the server connection is crashing.

    I have also tried getting the DB to do the heavy work, by specifying the "sort" and "group by" in the "Table Input Step"; I get the same server connection crash after some 2 or 3 hours.

    I have alternatively tried to create the summarization/aggregation via sql scritp directly in MySql: same problem.

    I am working on a W/Ultimate 64bit, MySql 5.5.16 64bit as well; my box is a Core i7 with 8 Gb of RAM; I have modified the spoon.bat memory allocation to 2gb.

    I originally was trying the above against a MS SQL Server 2008 R2; similar problems encountered.

    Can anyone advise what configuration in Kettle or in MySQL I should be adjusting to allow the extensive process to run correctly.

    Kind regards, DMurray3

  2. #2
    Join Date
    Sep 2009
    Posts
    809

    Default

    hey DMurray3,

    if you're reading all the data you need to ask the DB to generate results in chunks instead of trying to do a full read, before returning results. The above sounds to me like the JDBC driver was accumulating all rows in memory, eventually exhausting available RAM. If that is the issue, you can fix it by trying to use a result streaming cursor (it is an option on the db connection dialog). If enabling that leads to strange errors, please replace the mysql driver jar from libext/JDBC with the one from PDI 4.1. The upgraded 5.x driver causes some trouble with some versions of MySQL servers.

    HTH

    Cheers
    Slawo
    Do you want to know more? Check out Adventures with Open Source BI

  3. #3
    Join Date
    Apr 2008
    Posts
    2,610

    Default

    Quote Originally Posted by DMurray3 View Post
    The large table in MySql has no index.
    This will cause you problems...
    Even if you are only returning a few of the rows, you have to do a full table scan in order to do it.

    Add some indexes, and then pare down the amount of data that you are trying to return, (eg. Select Count(...) from table where Col3=n group by col3) even if it's a lot less than what you want to have. Build out your aggregated data, and you should have a lot more success.
    **THIS IS A SIGNATURE - IT GETS POSTED ON (ALMOST) EVERY POST**
    I'm no expert.
    Take my comments at your own risk.

    PDI user since PDI 3.1
    PDI on Windows 7 & Linux

    Please keep in mind (and this may not apply to this thread):
    No forum member is going to do your work for you. We will help you sort out how to do a specific part of the work, as best we can, in the timelines that our work will allow us.
    Signature Updated: 2014-06-30

  4. #4
    Join Date
    Nov 1999
    Posts
    9,689

    Default

    Google gives 1.710.000 results for a query with MySQL JDBC "out of memory".

    The first time I ever tried to do anything meaningful with a big table I ran out of memory with Kettle on MySQL.
    It then took me around 2 years to find a proper work-around in the form of cursor emulation or result streaming.
    It's hardly documented, sort of a hack since MySQL lacks proper support for cursors in the engine itself.

    I don't think indexes influence that process at all although cranking back the MySQL JDBC driver to 3.1 (from PDI 4.1 or 4.2.2) might solve the issue.
    Matt Casters, Chief Data Integration
    Pentaho, Open Source Business Intelligence
    http://www.pentaho.org -- mcasters@pentaho.org

    Author of the book Pentaho Kettle Solutions by Wiley. Also available as e-Book and on the Kindle reading applications (iPhone, iPad, Android, Kindle devices, ...)

    Join us on IRC server Freenode.net, channel ##pentaho

  5. #5
    Join Date
    Jan 2007
    Posts
    485

    Default

    Hi Slawo,

    You hit the problem right on the button.... Using the streaming cursor on my "mysql" coonection worked like a charm, with no need to change jdbc connector!!.

    I followed your suggestions and was able to avoid the continuous crashes, and got my transform to process the full +14million rows in 29 minutes...which brought smiles and tears...

    Will the same approach work independent of the db I use (let say with MS SQl Server 2008)?

    Again many thanks for your interest and the excellent asist.

    Kind regards, DMurray


    Quote Originally Posted by slawomir.chodnicki View Post
    hey DMurray3,

    if you're reading all the data you need to ask the DB to generate results in chunks instead of trying to do a full read, before returning results. The above sounds to me like the JDBC driver was accumulating all rows in memory, eventually exhausting available RAM. If that is the issue, you can fix it by trying to use a result streaming cursor (it is an option on the db connection dialog). If enabling that leads to strange errors, please replace the mysql driver jar from libext/JDBC with the one from PDI 4.1. The upgraded 5.x driver causes some trouble with some versions of MySQL servers.

    HTH

    Cheers
    Slawo

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •