Hitachi Vantara Pentaho Community Forums
Results 1 to 10 of 10

Thread: How to remove extra pipes in text sources

  1. #1

    Default How to remove extra pipes in text sources

    Hello,

    I hope someone can help me.

    I have a text source file, let says with this structure separated with pipes (|):

    ID|COLOR|MODEL|VEHICLE

    And I need to save the data to a table, but some times the data comes with some extra pipes and this causes the data to move one or several places, for example:

    Code:
    1|RED|2014|VW
    
    2||RED|2013|GM
    
    3|BLUE||2017||||GM
    
    4|GOLD||2016|||GM|
    
    5|GOLD|   |2016| ||GM|
    Is there a way to remove those extra pipes to accomodate the data?

    Any help would be appreciated!

    Thanks!

  2. #2
    Join Date
    May 2016
    Posts
    280

    Default

    That's a hard input to treat, the standard answer would be "reject the file or the incorrect lines at least" you need to get a proper input with a defined structure to process it.
    Anyway, if the input always has 4 fields, you can try cleaning each line as a whole using regular expressions and/or text operation steps, for example:

    • Clean white spaces (do your color and model fields contain white spaces also? Or is easier and you can delete all white spaces in the line?)
    • Replace || with |

    Once you have your line with the proper format, you can process it as usual.

    If your input doesn't always have the 4 fields, or the fields are mixed, so you can't define a logic to clean the lines, then either you talk with the people providing the input file so they generate a standard output, or if that's not possible, process the lines you can, and generate an error file with the lines you can't process.
    Regards
    OS: Ubuntu 16.04 64 bits
    Java: Openjdk 1.8.0_131
    Pentaho 6.1 CE

  3. #3
    Join Date
    Nov 2009
    Posts
    688

    Default

    Ask for enclosures around the separator

  4. #4
    Join Date
    Aug 2016
    Posts
    289

    Default

    That looks like a real challenge! You got 4 attributes (ID, COLOR, MODEL, VEHICLE). Let's call them a1, a2, a3, a4. The pipes are not the issue here (unless the attribute values can consist of pipes also). One could simply replace them with commas or treat them like commas. The format is the problem here. You seem to always have a complete set of all 4 attributes, but there may be any number of additional empty attributes! The attributes also seem to always be ordered in the same sequence.

    a1,a2,a3,a4
    a1,,a2,a3,a4
    a1,a2,,a3,,,,a4
    a1,a2,,a3,,,a4,
    a1,a2,,a3, ,,a4,

    It should be feasible to handle this in java code:
    1) continue looking for attributes until you found all 4
    2) count the number of attributes so you know which one the current/next will be
    3) skip empty attributes

  5. #5

    Default

    Hello, thank you!

    In this case, the example i posted here is just 4 columns, the original file have more than 30!

    All the fields come with spaces "RED DEVIL", "GENERAL MOTOR COMPANY", etc. So, I think i cant remove all the spaces

    If I replace || with | how can I identify if the next column is really empty and not an txtra pipe?

    Thank you Ana!

    Quote Originally Posted by Ana GH View Post
    That's a hard input to treat, the standard answer would be "reject the file or the incorrect lines at least" you need to get a proper input with a defined structure to process it.
    Anyway, if the input always has 4 fields, you can try cleaning each line as a whole using regular expressions and/or text operation steps, for example:

    • Clean white spaces (do your color and model fields contain white spaces also? Or is easier and you can delete all white spaces in the line?)
    • Replace || with |

    Once you have your line with the proper format, you can process it as usual.

    If your input doesn't always have the 4 fields, or the fields are mixed, so you can't define a logic to clean the lines, then either you talk with the people providing the input file so they generate a standard output, or if that's not possible, process the lines you can, and generate an error file with the lines you can't process.
    Regards

  6. #6

    Default

    Hello,

    How can I count the columns with Java (JS right?)?

    After a full review of the file, all the columns comes with the same lenght, for example, if the color have 10 chars, even if empty comes with 10 spaces. Maybe this can help?

    Thanks!

    Quote Originally Posted by Sparkles View Post
    That looks like a real challenge! You got 4 attributes (ID, COLOR, MODEL, VEHICLE). Let's call them a1, a2, a3, a4. The pipes are not the issue here (unless the attribute values can consist of pipes also). One could simply replace them with commas or treat them like commas. The format is the problem here. You seem to always have a complete set of all 4 attributes, but there may be any number of additional empty attributes! The attributes also seem to always be ordered in the same sequence.

    a1,a2,a3,a4
    a1,,a2,a3,a4
    a1,a2,,a3,,,,a4
    a1,a2,,a3,,,a4,
    a1,a2,,a3, ,,a4,

    It should be feasible to handle this in java code:
    1) continue looking for attributes until you found all 4
    2) count the number of attributes so you know which one the current/next will be
    3) skip empty attributes

  7. #7
    Join Date
    Apr 2008
    Posts
    4,690

    Default

    Quote Originally Posted by garsan View Post
    If I replace || with | how can I identify if the next column is really empty and not an extra pipe?
    You should really go back to your source system with that issue.
    What happens when you have a field with no data, and a field that has erroneous pipes in it? (ie. multiple places where you have ||, but one is real, and one is not)
    How would you do this if you were doing it manually? Are there some key fields (like column 3 and column 4 in your first example) that you could use in a RegEx type pattern?

  8. #8
    Join Date
    Aug 2016
    Posts
    289

    Default

    Would something like this work?

    String line = "5|GOLD| |2016| ||GM|";
    String[] fields = line.split("|");
    ArrayList<String> result = new ArrayList();

    for(int i = 0; i < fields.length; i++)
    {
    .....String field = fields[i];
    .....if(field contains something other than spaces)
    .....{
    ..........result.add(fields[i]);
    .....}
    }
    Last edited by Sparkles; 06-05-2018 at 08:06 AM.

  9. #9
    Join Date
    Dec 2013
    Posts
    13

    Default

    Its easy. first set the value for maximum number of pipes. then calculate the no of pipes in each line using loop concept. delete the line which doesnt match the max no of pipes variable value and move the lines which are matching to a new file. let me know if you need help in shell script coding.
    Thanks & Regards
    Karan Arora

  10. #10
    Join Date
    Aug 2016
    Posts
    289

    Default

    Easier than the 10 lines of code above? Why hassle with os-dependent shell script or additional temp files?

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.