Hitachi Vantara Pentaho Community Forums
Results 1 to 5 of 5

Thread: Append multiple CSV files having same fields (but in random order) in a unique Table

  1. #1
    Join Date
    Dec 2015
    Posts
    5

    Default Append multiple CSV files having same fields (but in random order) in a unique Table

    Hello,

    I am struggling with a task, maybe you guys can help me out here:

    I have a bunch of CSVs files (999+) inside a directory. I would like to append all inside a DB Table. All these files share the same fields (same names, same datatypes) but *(and there is the catch)*, these fields are arranged in different order in each file. Suppose that there are N sets of layouts (A, B, C, ..., N), as pictured bellow:

    CSV Headers:
    -------------
    file 1 [layout A]: col_1 , col_2 , col_3 , col_4
    file 2 [layout B]: col_2 , col_1 , col_3 , col_4
    file 3 [layout C]: col_2 , col_4 , col_3 , col_1
    file 4 [layout A]: col_1 , col_2 , col_3 , col_4
    ... ...
    file 999 [layout N]: col_4 , col_3 , col_2 , col_1

    Output Table:
    --------------
    [layout A]: col_1 , col_2 , col_3 , col_4

    I suppose that I'd have to use the "ETL Metadata Injection" step (the datatypes, length, etc are well-defined), but the difficulty is to define a CSV layout automatically, based on the header, in a step. I could even read all fields as String and after I could use "Select Values" to convert them, but at all situations the problem arises when different name-ordered streams arises in a step.

    Can anyone give me a hint to solve this?
    I Thank you in advance.

    Regards,

    Flavio.
    Last edited by fdaher; 12-09-2015 at 07:27 PM. Reason: typos

  2. #2
    Join Date
    Apr 2008
    Posts
    1,771

    Default

    How many fields do you have?
    If you have only 4/5 fields than you can parse the first line, see which sequence it is and direct that file to one transformation.
    Obviously you will have to create as many transformations as there are possible sequences.

    But you should probably invest some time in checking the Metadata Injection step.
    -- Mick --

  3. #3
    Join Date
    Dec 2015
    Posts
    5

    Default

    Thanks for the reply, Mick!

    The CSV file has exactly 22 fields. Your suggestion (parsing csv header then direct to correspondent transformation) is good, but it would work, I presume, just for small values of N... small enough so that is possible to write all the N transformations/inputs. The thing is, these CSV files could virtually come in any of the 22 factorial posibilities... that is more than 10^20 !!

    I could define a 'table' containing field names, and all metadata relevant information (that is already well defined for all fields). Then, if I create a Job that iterate through the CSV files, using the parsed header to order the 'metadata table' accordingly, then I am ready to pass the correct metadata to the CSV Input step. And to do so, I imagine that will have to use the 'Metadata Injection' step... that one that I really need to explore deeper, as you pointed

    Well, that Job make sense to you? Do you think that something is missing there?

    Thank you again... any help would be appreciated.

    Regards,

    Flavio.

  4. #4
    Join Date
    Apr 2008
    Posts
    1,771

    Default

    Hi Flavio,

    as you pointed out, my suggestion would work only if you have a limited number of combinations.

    My only suggestion is to test Metadata Injection step - unfortunately I have hardly used it so I cannot help on that but if you do some testing and have a specific issue there will be people in this forum who will be able to help you.
    -- Mick --

  5. #5
    Join Date
    Dec 2015
    Posts
    5

    Default

    Ok Mick,
    I thank you for your attention.

    Regards,
    Flavio.

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.