Hitachi Vantara Pentaho Community Forums
Results 1 to 7 of 7

Thread: Staging multiple data files - some with headers some without headers

  1. #1

    Default Staging multiple data files - some with headers some without headers

    Hello all,
    Here's my issue : I am receiving multiple data files everyday with same format - datafile_1, datafile_2 , datafile_3 and so on that I want to stage. I am doing automation to pick up the files and stage them into a table. The issue I am facing is that only the first file (datafile_1) has a header row in it and the remaining files will not have a header row. Is there a way to use the same Transformation to stage all these files? I'd hate to have 2 versions of staging Jobs to process something like this. Would appreciate if anyone has any suggestions/ideas/experience dealing with such a scenario or if Kettle has any neat functionality to handle this. The header row begins with a # (pound symbol) and no other data line would begin with #.

    Thanks!
    Last edited by Inder; 08-21-2012 at 06:19 PM.

  2. #2
    Join Date
    Apr 2008
    Posts
    1,771

    Default

    Easy.
    Use text file input and import all rows from all files as a text file withour header.
    Create a filter step and remove all lines with a # at the beginning.

    Note: you need to create field names in your text input file first.
    All your rows will be treated as coming from one file only.

    Mick

  3. #3

    Default

    Thanks a lot Mick - that should do it. A couple of follow up questions if I could pick your brain -
    1) Regarding the filter - I suspect it would be best to use a "Java filter" to figure out which rows begin with a # ?
    2) I have all sorts of data types in my input file that I want to preserve. Will I need to specify everything in "Text file input" as strings to begin with and then convert them to the right data type after filtering out the unwanted rows?
    Last edited by Inder; 08-22-2012 at 01:32 PM.

  4. #4
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    Could you post a couple of sample files?

    There's an approach that I'd like to try (using filter in the Text Input), but would want to test with valid files first.

  5. #5

    Default

    Thanks! Sure, I've attached 2 sample files with a lot less columns than my real file. datafile_1 has header(with #) and datafile_2 has no header. The column delimiter is |~| .
    Would appreciate any help and cool tricks.

    Quote Originally Posted by gutlez View Post
    Could you post a couple of sample files?

    There's an approach that I'd like to try (using filter in the Text Input), but would want to test with valid files first.
    Attached Files Attached Files

  6. #6
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    I was able to get this work as follows:

    Use Text File Input
    In Filename, select just your first sample
    Leave headers turned on for now
    Set the delimiter to |~|
    Go to Fields tab, and tell PDI to "Get Headers"
    OK to the sample size
    OK to the Text File Input.
    Preview (Your Data for File 1 should look OK at this point. If it doesn't go back and fix it)

    ReOpen the text file input
    On the content tab, uncheck the Header box
    On the Filters tab, enter # as the filter text, 0 as the position (if the first character of the line is #, ignore this line)
    Click OK and preview. File 1 data should still be ok.

    ReOpen the text file input
    Change the filename boxes to specify all the appropriate files (either by name or by wildcard)
    Ok and preview.
    Attached Files Attached Files
    Last edited by gutlez; 08-22-2012 at 03:05 PM.

  7. #7

    Default

    Outstanding! That does exactly what I need. I can specify datatypes from the very beginning and probably more efficient than adding a filter as an extra step. Many thanks to you (as well as thanks to whoever at pentaho came up with the filters feature on text file input step - very useful). Appreciate the time you took to help a newbie.
    -Inder

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.