Hitachi Vantara Pentaho Community Forums
Results 1 to 6 of 6

Thread: Checking whether an input file has already been loaded

  1. #1
    Join Date
    May 2009
    Posts
    21

    Default Checking whether an input file has already been loaded

    Can someone tell me the best way of establishing whether source txt file has already been processed?

    Currently I have a get files step finding all filenames in a folder so I'm guessing the best way to accomplish the above would be to output the stream to a txt file and check whether the current filename in the get filenames step exists in the text file.

    Is there a step which does this type of processing more efficiently?

    Thanks

  2. #2
    Join Date
    Jul 2009
    Posts
    15

    Default

    Quote Originally Posted by paul.kramer View Post
    Can someone tell me the best way of establishing whether source txt file has already been processed?

    Currently I have a get files step finding all filenames in a folder so I'm guessing the best way to accomplish the above would be to output the stream to a txt file and check whether the current filename in the get filenames step exists in the text file.

    Is there a step which does this type of processing more efficiently?

    Thanks
    Hi,
    You could use Stream Lookup and Filter Rows.

    Some thing like this: Get File Names,Text file input -> Stream Lookup -> Filter Rows

  3. #3
    Join Date
    Feb 2009
    Posts
    296

    Default

    Using a textfile might not be the best of ideas.
    We usually save the information in the data warehouse itself in some kind of metadata table.

    Reasons for this are:

    • independence of changes in ETL execution environment
    • better locking/concurrency handling
    • easier to query remotely

    We even go as far as connecting these records to the facts/dimension-entries we've loaded.

    This technique also allows you to have multiple states like "I know this file", "I'm working on this file", "This file is crappy - go check it!" or "I've loaded it all good".

    There is a lot of potential to add to this.
    Fabian,
    doing ETL with his hands bound on his back

  4. #4
    Join Date
    May 2009
    Posts
    21

    Default

    Loading the processed filenames into a meta table within the DW has worked well for us so far. However we now have around 30000 raw CSV files within a folder which needs to be fed into the Kettle job each time a transformation is scheduled. We're finding that this many files is causing the job to fail with a java heap space memory error. I can add a 'move files' step to the job which would move the processed files to an archived directory, however this still wouldn't help with the initial running of the script on all 30000 files. I've increased the memory size in the spoon.bat to 1024 which still isnt sufficient. Any ideas?

  5. #5
    Join Date
    May 2009
    Posts
    21

    Default

    I forgot to add that the job is falling down when the get filename step passes rows to the database insert transformations ('get rows' step used to obtain rows from previous transformations).

  6. #6
    Join Date
    Feb 2009
    Posts
    296

    Default

    If it's a one time initial load I'd just go the manual way: move all files to a temporary directory and put them back into your input directory in chunks of a couple of hundred or thousand files.

    Should be okay for this one time.
    And I suggest that moving the input to a special directory includes sub-directories - 30.000 inodes is no fun for most filesystems and they are only getting more and more...
    Fabian,
    doing ETL with his hands bound on his back

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.