Hitachi Vantara Pentaho Community Forums
Results 1 to 5 of 5

Thread: Problem with transformation

  1. #1
    Join Date
    May 2007
    Posts
    22

    Default Problem with transformation

    Hi,
    We are testing with production data. Our transformation writes to a staging (csv) file from the source db. The file has about 500,000 records. Then in the next transformation we read from the csv file and point it to a dimension component. We have an error where after the 35508 record no records get inserted to the dimension. When checking the logs we find at the time of the time of the reading of the 35508 the log says the connection is closed. We are not using connection pooling. We dont have any other parelal flows so it seems the dimnsin component wont reinitialize the connection if it gets droped. Do yo you think using pooling will sor t out the problem? If so why? It seems the dimension component will still have to reconnect if the the connection it has gets droped?

    Thanks,
    Gavin

  2. #2
    Join Date
    May 2006
    Posts
    4,882

    Default

    Check the logs of your database to see what happened.

    No automatic reconnects are done in any component (also not when using connection pooling) as that would get in the way of restartability. If the connect is dropped on the way, consider your ETL has failed and restart processing.

    How sure would you be of your data when the connection is automatically reconnect. what about commits, ....

    Regards,
    Sven
    Last edited by sboden; 07-20-2007 at 02:41 AM.

  3. #3
    Join Date
    May 2007
    Posts
    22

    Default

    Hi Sven,
    Thanks for the amazingly quick reply . I was thinking along the bellow lines to solve this problem.

    1. Save the number of records read in to the staging file.
    2. Save the number of records that were stored in to the dimension table.
    3. Just before the job finishes (both transformations are part of one job) check whether the two counts match.
    4. If not run the dimension transformation again. But, I need to remove allready processed data from the csv file.

    That’s where the problem is how do I don that ?

    The other option some one else suggested was setting the reconnect flag in the mysql connection string have try though I doubt it would work

    Thanks,
    Gavin

  4. #4
    Join Date
    May 2006
    Posts
    4,882

    Default

    1. Save the number of records read in to the staging file.
    2. Save the number of records that were stored in to the dimension table.
    3. Just before the job finishes (both transformations are part of one job) check whether the two counts match.
    4. If not run the dimension transformation again. But, I need to remove already processed data from the csv file.

    That’s where the problem is how do I don that ?
    That's the whole problem... there's no really good general way 4. It depends on what you exactly store, ...
    For e.g. if your dimensions are type 1 you just rerun everything from your staging area. Type 2, in some cases you should also be able to rerun the whole input, in other hybrid cases it's more difficult.
    What you hit upon now is technically the most hard part in ETLs. Even Kimball mostly glosses that over as "technical details" ... no easy answers I'm afraid.

    The other option some one else suggested was setting the reconnect flag in the mysql connection string have try though I doubt it would work
    Well, I doubt it will work as well. And anyway, 1 day you will hit the problem that your database really goes down for a couple of hours while processing, so you have to cover that case anyway.

    Regards,
    Sven
    Last edited by sboden; 07-20-2007 at 07:07 AM.

  5. #5
    Join Date
    May 2006
    Posts
    4,882

    Default

    One solution e.g. is to load your file to a staging area and assign an ever increasing sequence value to it. And save that sequence number in your dimensions as a type 1 attribute. If you need to rerun... select the max sequence number of your dimension and then run from that number onwards for the input.

    It all depends.

    Regards,
    Sven

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.