Hitachi Vantara Pentaho Community Forums
Results 1 to 3 of 3

Thread: merge "split field to rows" with "Regex evaluation"?

  1. #1
    Join Date
    May 2013
    Posts
    24

    Default merge "split field to rows" with "Regex evaluation"?

    Hi!
    I have a simple file to parse, and I am trying to do it with kettle rather than in php
    The structure of the file is something like that:

    Code:
    12498278         NAME LASTNAME                                                      |                                |
    AFF. SCADUTO ->   17.895,63    2.419,12    2.050,00   9   59         %        28,00 |                28,00           |      28,00
    AFFIDATO ----->               PERIODO DI AFFIDAMENTO 11/04/13 - 08/10/13  DBT       |                                |
    RECUPERATO --->       80,00                                                         |                                |
    LIQ DAL 30/05/13      80,00 30/05/13 T.9 %REC  0,3 PRV 35,0 SCA  0 REC  0     28,00 |                                |
    ____________________________________________________________________________________|________________________________|______________
    12534009         NAME2 LASTNAME2                                                    |                                |
    AFF. SCADUTO ->      958,03      284,70      150,00   9    8       22%       112,00 |               112,00           |     112,00
    AFFIDATO ----->               PERIODO DI AFFIDAMENTO 11/04/13 - 08/10/13  DBT       |                                |
    RECUPERATO --->      320,00                                                         |                                |
    LIQ DAL 30/05/13     160,00 15/04/13 T.9 %REC 22,9 PRV 35,0 SCA  0 REC  0     56,00 |                                |
    LIQ DAL 30/05/13     160,00 15/04/13 T.9 %REC 22,9 PRV 35,0 SCA  0 REC  0     56,00 |                                |
    ____________________________________________________________________________________|________________________________|______________
    12662540         XX XXXX XXX                                                        |                                |
    AFF. SCADUTO ->    2.530,21      101,19      300,00   9   19        1%        17,50 |                17,50           |      17,50
    AFFIDATO ----->               PERIODO DI AFFIDAMENTO 11/04/13 - 08/10/13  DBT       |                                |
    RECUPERATO --->       50,00                                                         |                                |
    LIQ DAL 31/05/13      50,00 31/05/13 T.9 %REC  1,7 PRV 35,0 SCA  0 REC  0     17,50 |                                |
    ____________________________________________________________________________________|________________________________|______________
    12779770         XXXXXXXXXXXXXXXX / ACCONTO PDCS                                    |                                |
    AFF. SCADUTO ->    5.268,30      827,66      650,00   9   38        2%        49,00 |                49,00           |      49,00
    AFFIDATO ----->               PERIODO DI AFFIDAMENTO 11/04/13 - 08/10/13  DBT       |                                |
    RECUPERATO --->      140,00                                                         |                                |
    LIQ DAL 15/05/13      70,00 14/05/13 T.9 %REC  2,0 PRV 35,0 SCA  0 REC  0     24,50 |                                |
    LIQ DAL 20/05/13      70,00 20/05/13 T.9 %REC  2,0 PRV 35,0 SCA  0 REC  0     24,50 |                                |
    ____________________________________________________________________________________|________________________________|______________
    now, this simple regex here:

    Code:
    ^([0-9]{3,16}) *\*?(.+?)(?=(  | /)).+?AFF\. SCADUTO.+?\| *([^ ]+) *\| *[^ \r\n]+.+?(LIQ.+?_)
    splits correctly the file in chunks, and in the same time produces capture groups 1-5 which are the fields I need for each record.
    I expected the "Regex evaluation" field to perform this duty:

    - split the field in as many rows as there are matches
    - add to each rows the required capturing groups

    I found out that this is not possible, and I have to use the "split field to rows" using something like that
    Code:
    [\r\n]+[_|]+[\r\n]+
    and only then, evaluate each chunk with regex evaluation.
    This is rather unconvenient, as the "split field to rows" does not accept the advanced regex options (multiline, case sensitivity and so on), is hard to test and forces the use of two regexes when one would have been enough.
    What do you guys think?

  2. #2
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    You're welcome to show your expected output inline, but please, put your sample input data in a file and attach it.

    PS: I wouldn't use regular expressions to regain data from this report, but then, I once knew a guy who tried to implement some interactive tool based on Easytrieve (a report generator) - and he succeeded.
    Attached Files Attached Files
    Last edited by marabu; 06-07-2013 at 01:58 PM.
    So long, and thanks for all the fish.

  3. #3
    Join Date
    May 2013
    Posts
    24

    Default

    Marabu, thanks a lot for your example: you've clearly put in some effort to do it, and I really appreciate it.
    I couldn't attach the whole input txt file because it contains real people names, but in the future I'll do as you suggest.
    About your approach, imho it is a clean and "academic" way of solving the problem, but it's also very verbose, much more than a regex of 50like characters.
    Anyway, back to the subject of the thread: how about merging the regex evaluation step with the split field to rows step?

    The transformation would become as simple as that

    1. split the input file in as many rows as there are matches of the regex, adding captured groups to the stream as named fields
    2. possibly split again one of the captured groups (this would deal with the "LIQ. [...]" rows, which appear n times for each row)

    and voila, 3 easy steps, no java code, nothing! Just a couple of regexes.
    They may look intimidating and surely are, but just use regexbuddy and you will make mincemeat of files like this.
    Also note that the structure of the input file is not as simple as the portion I have pasted here, as it is a report meant for printing so records are randomly broken by repeated headers, page breaks, footers...which again could be easily chopped away by another regex
    Last edited by PietroB; 06-10-2013 at 05:18 AM.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.