Hitachi Vantara Pentaho Community Forums
Results 1 to 3 of 3

Thread: Batch Numbering of file Imports with Postgres 8.x

  1. #1
    Join Date
    Jun 2007
    Posts
    233

    Smile Batch Numbering of file Imports with Postgres 8.x

    Hi Guys,

    I just wanted to pass an idea before your eyes for your considered feedback. I am in the process of creating a small monster datafile import, where I have a few thousand files to play with. What I need to do for 'quality control' is to apply batch number(s) to each file in the process, and then use this as a 'marker' for each incoming row to match it to the batch. If there is a problem we can identify the file, the row, and in the end the mongrel who gave it to us :-)

    What I was going to do was create a job with two step (plus the start of course). The first is a get filenames, dump them to a table with an 'serial' data-type for generating the batch number for each file.

    The second step is to read the table with the filenames and batch number(s), set some variables, read the files (do for each row approach with a job), and pump the incoming data to a target table with the associated batch number next to each row.

    I was thinking of adding some data validation later to the process when I see how it is working out, but for now I think this approach is okay. Does anyone have a better way of approaching this or any experience with problems using this approach? I would be interested to hear / discuss what others have seen and done :-)

    Always grateful for the feedback.

    The Frog
    Everything should be made as simple as possible, but not simpler - Albert Einstein

  2. #2
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    It's a good approach. You could do it probably in a faster way with some short-cut, but it's going to suffer from transparency problems.
    I would carry that "batch id" / "file / id" along with me as far as I could, even to the target tables.

  3. #3
    DEinspanjer Guest

    Default

    Your approach sounds fine. As Matt said, you could use some shortcuts (like passing all the files directly to the Text File Input step, then having that step output the filename and rownumber, then use a Combination Lookup/Update to store the filename in a dimension table to get your "batch id"), but I would really only look at that if this was a one time fixup rather than an ETL process that will become a regular thing.

    I posted an example of a subtransformation that I've used in the past to process files and record both fingerprints of the file and last processing state. This lets you easily resume an aborted ETL picking up with the next queued file, and also easily re-process files that have changed. You might want to take a look at it. It might have some useful nuggets in there.
    http://forums.pentaho.org/showthread.php?t=62794

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.