Hitachi Vantara Pentaho Community Forums
Results 1 to 8 of 8

Thread: Massive Perfomance Problem with Transformation

  1. #1

    Default Massive Perfomance Problem with Transformation

    Hi,

    please look at my attachment this show you my transformation.

    With the Excel Input i read 50k records into the transformation step.
    The group by is needed for the Rowcount in the Excel File.
    I generate One Row with One Sequence.
    Both streams are merged with join rows.
    -> This cost me about 2 min. All Rows are finished

    After the join Rows i add another Oracle Sequence. Copy Streams to 2 Table Output (This Example dummys)

    But now the Master Problem:

    Until 10k the speed is medium and ok.(300-400 recorrds/1sec)
    But after 10k only 2 rows in 1 sec.

    The Transformation is local started ( CPU 99% and Memory 212MB)

    I increase in Kettle.l4j.ini the -Xmx1024M -> nothing happend

    who can help me ?
    Attached Images Attached Images  

  2. #2
    Join Date
    May 2006
    Posts
    4,882

    Default

    Provide xls, transformation and I can play a little in a couple of hours.

    btw does the transformation with dummies have the same problem, or is the problem table output related.

    btw with the SVN trunk version of PDI you could probably get rid of 2 select values steps.

    Regards,
    Sven

  3. #3

    Default

    Quote Originally Posted by sboden View Post
    Provide xls, transformation and I can play a little in a couple of hours.

    btw does the transformation with dummies have the same problem, or is the problem table output related.

    btw with the SVN trunk version of PDI you could probably get rid of 2 select values steps.

    Regards,
    Sven
    Why that ?

    2. both have the same problem. (Table Output and dummies)
    3.When come this version as final version out ? I don't know what is svn.

    One Problem could be the CreateByteArray for every Row in the Script Value. Now I use 2 Java Script Values.
    One for :
    trans_Status = CONTINUE_TRANSFORMATION;
    if
    (getProcessCount("r")>1) {
    trans_Status = SKIP_TRANSFORMATION;

    }

    and :
    // Read into a javascript blob equivalent.
    file =
    new Packages.java.io.File(I_FILE_NAME);
    fileInputStream =
    new Packages.java.io.FileInputStream(file);

    var
    I_FILE = Packages.org.pentaho.di.core.Const.createByteArray(file.length());

    //org.pentaho.di.core
    fileInputStream.read(I_FILE, 0, file.length());
    fileInputStream.close();

    Now I get out of memory Heap Error...

  4. #4
    Join Date
    May 2006
    Posts
    4,882

    Default

    Attach complete xls (sanitized version) and your transformation and we can see for ourselves.

    For the rest, blob's in ETL... while sometimes necessary are usually more problematic then they're worth. What I usually try to do is to work with a pointer to a blob and keep the blob as much as possible out of the processing. It could be e.g. that in your case the garbage collector doesn't collect your blobs fast enough.

    Regards,
    Sven

  5. #5
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    Until 10k the speed is medium and ok.(300-400 recorrds/1sec)
    But after 10k only 2 rows in 1 sec.
    This is typical for a database lookup (Insert/Update, DB Lookup, Dim lookup, Combi Lookup, etc) in those cases where there isn't an index on the target table.

  6. #6

    Default

    Quote Originally Posted by sboden View Post
    Attach complete xls (sanitized version) and your transformation and we can see for ourselves.

    For the rest, blob's in ETL... while sometimes necessary are usually more problematic then they're worth. What I usually try to do is to work with a pointer to a blob and keep the blob as much as possible out of the processing. It could be e.g. that in your case the garbage collector doesn't collect your blobs fast enough.

    Regards,
    Sven.
    Sorry cannot add this xls. Data is not for public.
    I add one blob for every transformation, equal what many records are in.
    So i have no performance problems with blob.

    EDIT ! : I found a big problem with my source code in the Java Values. I need exact one Row to import the blob but
    read is 10000 and write is 1 ! How can i optimize my source code ?? ( A little bit changed now)

    // Declare variables
    var
    I_FILE_NAME = "";

    var
    Filesize = "";

    var
    Filename = "";

    var
    Fileextension = "";

    var
    temp="";

    Filesize =
    getFileSize(Filepath.getString());
    Filename =
    getShortFilename(Filepath.getString());
    Fileextension =
    getFileExtension(Filepath.getString());

    trans_Status = CONTINUE_TRANSFORMATION;


    if
    (getProcessCount("r")==1) {

    // get the full path to the file
    I_FILE_NAME=Filepath.getString();

    // Read into a javascript blob equivalent.
    file =
    new Packages.java.io.File(I_FILE_NAME);
    fileInputStream =
    new Packages.java.io.FileInputStream(file);

    var
    I_FILE = Packages.org.pentaho.di.core.Const.createByteArray(file.length());

    fileInputStream.read(I_FILE, 0, file.length());
    fileInputStream.close();
    }

    if
    (getProcessCount("r")>1) {
    trans_Status = SKIP_TRANSFORMATION;

    }

    Quote Originally Posted by MattCasters View Post
    This is typical for a database lookup (Insert/Update, DB Lookup, Dim lookup, Combi Lookup, etc) in those cases where there isn't an index on the target table.
    I have performance problems with dummies too !
    Otherwise now i work with my output tables. All Tables have index on primary keys. I don't use (Insert/Update, DB Lookup and so on). Only use 2 Oracle Sequence and 3 Table Outputs. 1 Table Output with one record (Blob no problems).
    The other 2 Table Outputs + Tbale Output Mapping terrible slow.

    The Excel File has now 10k records. I f�*nd out that i've influence with the properties.

    Transformation properties -> Miscellaneous -> Nr of rowsets = 5000 and Feedback size=5000

    Import 10k with these properties it takes over 12mins.

    When i set Nr of rowsets = 10000 and Feedback size=10000

    Import 10k with these properties it takes 90secs.

    I cannot set this properties to 50000 because i get an error "Now I get out of memory Heap Error..."

    Btw : When my Transformation abort at 4k. Not all rows are rollback at the database. That really bad ! Options ?
    Last edited by KettleFan; 12-03-2008 at 03:29 AM.

  7. #7
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    Nobody here told you to increase the rowset buffer size. It's a bad idea (lowers parallelism) so stop doing that.

    As for the "Add Sequence" step: that is the slowest part of the whole transformation probably.
    Try to parallelize that somehow.

  8. #8

    Default

    Quote Originally Posted by MattCasters View Post
    Nobody here told you to increase the rowset buffer size. It's a bad idea (lowers parallelism) so stop doing that.

    As for the "Add Sequence" step: that is the slowest part of the whole transformation probably.
    Try to parallelize that somehow.
    I experimentell with all options. (lowers parallelism) no problem with that. But i have decrease the rowset buffer size to default.

    Another question :

    In the Java Script Value :
    if
    (getProcessCount("r")>1) {
    trans_Status = SKIP_TRANSFORMATION;
    }

    I want only the first row because this is my blob row. I want only once the input file save in a blob field.
    Now i have problem when i select more as one file in "Excel Input". 4-5 Excel Files and i have although one row for the blob.
    I need for every excel file input one row for the blob. Example 5 Excel Files i need 5 rows with the blobs.
    Somebody who have an offer for me how i can change this ?

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.