Hitachi Vantara Pentaho Community Forums
Results 1 to 18 of 18

Thread: Handling tab seperated values and comma separated values in text file input

  1. #1

    Default Handling tab seperated values and comma separated values in text file input

    Hi,

    I am using text file input step to read .txt file. This file can be either tab separated .txt file or comma separated .txt file.
    Both files have same fields. How to handle this in etl?
    we do not know whether user will pass tab separated or comma separated file to the etl transformation.


    Thanks
    Ajinkya

  2. #2

    Default knowing the metadata

    You should always be knowing about the metadata. if not try using metadata injection step


    Quote Originally Posted by Ajinkya View Post
    Hi,

    I am using text file input step to read .txt file. This file can be either tab separated .txt file or comma separated .txt file.
    Both files have same fields. How to handle this in etl?
    we do not know whether user will pass tab separated or comma separated file to the etl transformation.


    Thanks
    Ajinkya
    Regards,
    Dileep
    Mail ID

  3. #3

    Default

    Hi,
    I did not understand how metadata injection step will help?
    Can you please elaborate?

    Thanks,
    Ajinkya

  4. #4
    Join Date
    May 2016
    Posts
    282

    Default

    You create some logic to determine which separator the file is using, and then you inject the separator as metadata.
    Regards
    OS: Ubuntu 16.04 64 bits
    Java: Openjdk 1.8.0_131
    Pentaho 6.1 CE

  5. #5

    Default

    Hi
    I am stuck in logic to determine separator.
    I used text file input step, fixed format (so that I can read whole line as single field), limit 1.
    So that I can read only 1st row of the file.
    Then I need to check whether separator in this first row is tab or comma "," or "|" . Once confirmed I'll set separator in a variable which I'll use in my original text file input step in another transformation.
    Question is How could I check/determine separator in the first row? I tried filter rows step but did not actually worked.

    You could suggest me if you have any other logic/way for doing this.
    Thanks
    Ajinkya
    Last edited by Ajinkya; 06-26-2018 at 06:45 AM.

  6. #6
    Join Date
    Aug 2016
    Posts
    290

    Default

    You would have to count?

    Examples:
    1) "abc,def,ghi" --> count 2 commas
    2) "abc|def|ghi" --> count 2 tabs

    This logic would not work if the values between the separators also can include the separators.
    Alternatively, just format the file first, replacing all tabs with commas and then you can handle everything the same way.

  7. #7

    Default

    Hi,

    No I do not need count.

    Please suggest me other ways. I do not want to manually replace separator in the file and then use as input to the transformation.

    Thanks
    Ajinkya

  8. #8
    Join Date
    Aug 2016
    Posts
    290

    Default

    I'm not saying you do it manually. I'm saying you could do it automatically.

    Why do you not need to count?

    If you don't want to count, I guess your only left with the option to see whether the two different chars exists in the line.

    Examples:
    1) "abc,def,ghi" --> comma exists
    2) "abc|def|ghi" --> tab exists

  9. #9

    Default

    Hi

    Can you please explain me step wise? I am badly stuck with the logic.


    Thanks
    Ajinkya

  10. #10
    Join Date
    Aug 2016
    Posts
    290

    Default

    If you replace all separators, you don't have to do any logic to determine what separators is in the file.

    If you can't replace all separators, you need to determine what the separator is before extracting data, yes?

    Step 1: Determine what separator is in use
    Step 2: Read the file using the separator from Step 1.

  11. #11

    Default

    hi,
    I have tried both ways getting errors in both

    a)replacing all separators.(In following example replacing separator "|" with tab)

    Step 1: Text file input, set filetype in content tab as fixed(so that I have only one field)
    Step 2: Replace String step, Use RegEx = Y , replace \| by \\t
    but its replacing | by \t and not tab.
    Alternative for this I tried using modified Java script value step
    var NewHeader = replace(Field1,"|","\t");
    even this did not work.

    b)Second way to determine separator
    Step 1: Text file input, set filetype in content tab as fixed(so that I have only one field), limit 1 so that I read only one row.
    Step 2: Filter rows check for | contains in row
    if yes then set var_Separator variable as |
    else set var_Separator variable as tab.
    Now problem here is we can not set same variable twice in a transformation. And also how do we set var_separator value to tab?

    Please help
    Last edited by Ajinkya; 06-28-2018 at 01:48 AM.

  12. #12
    Join Date
    Aug 2016
    Posts
    290

    Default

    I don't know about how to set a variable to tab.

    But you need to determine separator type in a separate transformation which must run before using the separator in a subsequent job/transformation to extract data.

    Basically inside a job:

    Start --> Transformation A (determine separator) --> Transformation B (Read file, extract data) --> Success

  13. #13

    Default

    Yes I am following same as you mentioned,
    but I am having issue in assigning varaible a value in Transformation A ,
    if separator is | then set variables step, if tab then set variables 2 step, if comma then set variables step 3.
    But this does not work because we are assigning same variable 3 times.
    I want to know how to fix this.

  14. #14
    Join Date
    Aug 2016
    Posts
    290

    Default

    I would only have a single transformation to set the variable:

    1) Transformation nr1 to set separator variable
    Read first line in file, determine what separator is in use, set variable accordingly (for example: SEPARATOR="|").
    I'm not sure how you gonna insert a tab here.
    2) Transformation nr2 to extract data from file based on the variable set in step 1.

    I'm not sure how I can explain this any different.

  15. #15
    Join Date
    Nov 2009
    Posts
    688

    Default

    In the samples folder ..\samples\jobs\run_all\Run all sample transformations.kjb show how to use a variable with different values
    1. Job1 with transformation1 and job2 (execute for every input row)
    2. Transformation1 get a list of different filenames
    3. Job2 with 2 transformations
    4. Transformation Set Variable: Read first line of tekst file (name comes from transformation1) and determine separator and set the variable
    5, transformation that read the file with the variable set in the previous transformation

  16. #16

    Default

    Hi
    Yes I am following the same.
    My question was how to determine the separator?
    In above thread you can see the ways I tried for determining separator

  17. #17

    Default

    Hi
    First of all thanks for helping.

    I found code to be used in modified javascript value step which determines separator.

    var delimiters = [ ',', ';', '|', '\t' ];
    var counters = new Array();
    var i;
    for (i=0;i<Field1.length;i++) {
    var c=Field1.charAt(i);
    for (d=0;d<delimiters.length;d++) {
    if (c==delimiters[d]) {
    if (counters[c]==null) {
    counters[c]=1;
    } else {
    counters[c]++;
    }
    }
    }
    }
    var maxC=0;
    var max=0;
    for (i=0;i<delimiters.length;i++) { if (counters[delimiters[i]=]=>max) {
    maxC=i;
    max=counters[delimiters[i]];
    }
    }

    var delimiter=""+delimiters[maxC];
    var count = counters[delimiters[maxC]];


    This gives me the separator used in the file.
    I set this value in a variable and used it in another transformation.
    reference: http://www.ibridge.be/?p=273
    Last edited by Ajinkya; 06-29-2018 at 01:40 AM.

  18. #18
    Join Date
    Nov 2009
    Posts
    688

    Default

    You know the number of columns. Then count the number of a separator in the first line of the txt file.
    You can count the number of , with a "User Defined Java Expression". See for example: https://forums.pentaho.com/threads/7...s-in-a-column/
    Then you know if , is a separator

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.