Hitachi Vantara Pentaho Community Forums
Results 1 to 10 of 10

Thread: Loss of information - Table Output

  1. #1
    Join Date
    Mar 2006
    Posts
    114

    Lightbulb Loss of information - Table Output

    I found a problem in my transformation use the "Table Output" with the option "Use Batch update for inserts" and "Commit Size 100". Use this option I have a large gain in the processing performance, but they end up losing some important data.
    For example, in a transformation of 1000000000 of lines, the duplicity of those lines between it and existing data in the database. During the transformation occurs errors CONSTRAINT, apparently when commit this mistake in the 100 lines, let's say we have 80 valid and 10 invalid, he ends up losing 20% percent of the information, I do not know why he plays out the rest of the data.

    What could be done in this case?
    What I really need, you enter all the data, there is no need to update.

    regards

  2. #2
    Join Date
    Jul 2007
    Posts
    1,013

    Default

    I'm not sure I'm fully understanding your problem. Are you saying that you have a unique constraint in the DB? Then it's only normal that it will not let you insert duplicates.

    If you have no such constraint then the Table Output should just insert the data no matter what. If this is not the case then we'll probably need some more information, perhaps the transformation log and/or the transformation itself -if possible- so we can take a look.

    Cheers!

  3. #3
    Join Date
    Mar 2006
    Posts
    114

    Post

    Accurate.
    The real problem is when the commit Step 100 lines and one already exists in the database, then the step "Table Output" discards the 100 lines, and that was to be only 1 line.


    Maybe it clearer that way.

    regards

    conca

  4. #4
    Join Date
    Jul 2007
    Posts
    1,013

    Default

    Ahhh, I understand now. Then it makes sense, the whole 100 line batch is probably getting dumped when one of the lines is a duplicate.

    That's as far as my knowledge goes in this one, I don't know of a way to overcome that problem. Someone wiser will need to step up :P

  5. #5
    Join Date
    May 2006
    Posts
    4,882

    Default

    It's database related... I've seen the behaviour in other tools as well. Not using batches is the only solution found so far.

    Regards,
    Sven

  6. #6
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    You can try to enable error handling for the step to see what happens.
    For most databases, batch updating works fine although PostgreSQL is still the trouble child in some configurations.
    I think that in 3.1.0 we have a good solution for that database as well.

  7. #7
    Join Date
    Mar 2006
    Posts
    114

    Post

    I am using Oracle 3.1.0 and Spoon and enable error handling for the step dummy.

    Well, if my problem is the disposal of the lines during an error of CONSTRAINT and there isn´t a need to update, the step that will receive the information of error could be another "Table Output" ignoring the mistakes.
    Perhaps the best solution would be to avoid losing the information and have a gain in performance?

    Regards,
    Conca

  8. #8
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    What "loss of information" are you referring to anyway? Are you suggesting we should ignore all insert errors just like that?

    mmmm.

  9. #9
    Join Date
    Mar 2006
    Posts
    114

    Post

    Matt,

    Quote Originally Posted by MattCasters View Post
    What "loss of information" are you referring to anyway?
    The loss of information was referring to that comment I made:
    "The real problem is when the commit Step 100 lines and one already exists in the database, then the step" Table Output "discards the 100 lines, and that was to be only 1 line."

    Quote Originally Posted by MattCasters View Post
    Are you suggesting we should ignore all insert errors just like that?
    Maybe I was not clear when talking about "ignore the mistakes." Well, I was referring to the Table Output option "Ignore Inserts Errors".

  10. #10
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    Like Sven said, the way batch inserts typically work is that you send the lines in batches (100 rows at a time) over the network.
    This increases performance (10-40%) but in your case also means that you end up in an all or nothing situation with the "ignore insert errors" situation.

    Like I said, it's better to use error handling like this:

    http://www.ibridge.be/?p=32

    We could also extend the "ignore insert errors" by doing the analyses of the error rows of the database, but that's not the case yet.

    Matt

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.