Hitachi Vantara Pentaho Community Forums
Results 1 to 13 of 13

Thread: Regex for file-inputs

  1. #1

    Default Regex for file-inputs

    Hey Community,

    today I wondered of there is a possibilty to access numerous files (in my case csv-files) in numerous directories via a regex. As the folders have all the same structure (something like yyyy-dd-mm) and the files in these directories also I wanted to load all these files into a single table.

    The follwoing is what I already tried:
    file/dir: <dir>/testDir/
    regular Expression: \\d{4}-\\d{2}-\\d{2}\\\\.*.csv

    where inside the testDir several sub-dirs with the given pattern exist.


    Any help is appreciated

  2. #2
    Join Date
    Apr 2008
    Posts
    1,771

    Default

    In a transformation you can use Text File Input and add multiple directories with regex to filter files.
    http://wiki.pentaho.com/display/EAI/Text+File+Input

    In a job you can use a Get File Names and then send those names to a text File Input Step.
    http://wiki.pentaho.com/display/EAI/Get+File+Names

    Mick

  3. #3

    Default

    And what exactly do you mean by
    multiple directories with regex to filter files
    ? As I already mentioned my actual problem is finding this one regex to get multiple directories.

  4. #4
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    Quote Originally Posted by HimBromBeere View Post
    regular Expression: \\d{4}-\\d{2}-\\d{2}\\\\.*.csv
    regex .+\.csv catches files with the csv extension.
    With the regex you can only cover the file name, not the folder name.
    Last edited by marabu; 09-06-2012 at 03:39 AM.
    So long, and thanks for all the fish.

  5. #5

    Default

    With the regex you can only cover the file name, not the folder name.
    Aaah, OK... that´s what I already assumed. So how to go now? I tried out the "get subfolder names"-Step, but as it seems this step runs endlessly (even stopping the step causes it to "Halting", which is then endlessly also). All I did was giving the parent folder and clicking "add", since I assume that this step extracts all subfolders within the parent-folder.

  6. #6
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    Keep in mind that you can chain several get file names steps together to build a multilevel regex...
    Get Filenames 1: /testDir/ ; wildcard: \\d{4}-\\d{2}-\\d{2}
    Add Constant: FileWildcard: .*.csv
    Get Filenames 2: Defined in field: filename field: path wildcard field: FileWildcard

    It's all about building your workflow:
    Step 1) Build a list of directories matching your pattern
    Step 2) Build a list of files in the directories from Step 1 which match the wanted pattern

    Then adapt the workflow to the limitations / features of the steps.

  7. #7

    Default

    Thanks for the help.

    As I already assumed I can´t get the whole file-sturcuture with one single step (dir + fileName). The hint of gutlez solved the issue (as far as I can estimate), but since there are more then 80files inside the subdirectories the step to get all the files dures very long.

    I´ll let you know if sth. happens...

  8. #8
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    Sorry for stating the obvious but you could simply grab all the files in all sub-folders (Get File Names) and then filter out the ones you want. (using for example a RegEx step or a a Filter Rows step).

  9. #9

    Default

    Thanks for your reply.

    Your suggestion seems to work as far as I can estimate this (since reading all 80.000 files dures a bit).
    But now there is another problem: the files seem to have different structures. This means the order of attributes differ from file to file and in some files there are more attributes then in others (but I just want to transmit only the "core" of attributes that is within all files). Is there any solution to get the attributes by name and transmit them into the right column inside the table?

    P.S.: Why get none of my posts published? I wrote 2 today but none of them is in the forum right now. Wondering if this one succeeds...

  10. #10

    Default

    Thanks for your reply.

    Your suggestion seems to work as far as I can estimate this (since reading all 80.000 files dures a bit).
    But now there is another problem: the files seem to have different structures. This means the order of attributes differ from file to file and in some files there are more attributes then in others (but I just want to transmit only the "core" of attributes that is within all files). Is there any solution to get the attributes by name and transmit them into the right column inside the table?

    P.S.: Why get none of my posts published? I wrote 2 today but none of them is in the forum right now. Can´t imagine I posted some bad words and if so, pls let me know so I can avoid them...
    Wondering if this one succeeds...

  11. #11

    Default

    Thanks for the reply.

    Your tip worked fine for me. But I changed a little thing: I assumed I have to escape the backslashes (as I do in JAVA), that´s why I used \\d for a number. Deleting one of the slashes solved the issue with it so I finally got \d{4}-\d{2}-\d{2}.

    But what about the following folder-structure: \d{4}-\d{2}-\d{2}\\kml\\.*.kml
    Here there is still a subfolder between the actual files (KML) and the parent-directories (representing the date). I assume I need 3 "get file name"-steps but cant´get it to work.

    That´s what I already tried out:
    "add constant"-Step: value = kml/
    "add constant"-Step: value = .*

    Any ideas will be apreciated

  12. #12

    Default

    Thanks for your reply but it didn´t work for me. I get the error that the field, I declared in the "add constant"-step is unknown:
    We can not find Field [{0}] in the input stream! [Mask]
    Where [Mask] is the name of the constant with a type of String and a value of "\d{4}-\d{2}-\d{2}"

  13. #13

    Default

    Thanks for your help,

    now it finally works, but I had to make one last change: As I assumed that I have to escape the number-formats (\d) inside the regex (as I do in JAVA too), I added a backslash to every \d-pattern. This was obsolete and no it works with this pattern: \d{4}-\d{2}-\d{2}

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.