Hitachi Vantara Pentaho Community Forums
Page 1 of 2 12 LastLast
Results 1 to 10 of 11

Thread: how to process a non structured file

  1. #1

    Default how to process a non structured file

    Hi

    I'm trying to process a bunch of text files from Radius server. See sample:

    Mon Apr 21 00:00:13 2014
    Acct-Status-Type = Interim-Update
    NAS-Port-Type = Wireless-802.11
    User-Name = "user@name.com"
    NAS-Port = 2149596816
    Acct-Session-Id = "80203e90"
    Event-Timestamp = "Apr 21 2014 00:00:13 UTC"
    Acct-Input-Octets = 2745995
    Acct-Output-Octets = 17889908
    Acct-Input-Gigawords = 0
    Acct-Output-Gigawords = 0
    Acct-Input-Packets = 19376
    Acct-Output-Packets = 20912
    Acct-Session-Time = 7022
    Timestamp = 1398038413

    Mon Apr 21 00:00:14 2014
    Acct-Status-Type = stop
    NAS-Port-Type = Wireless-802.11
    User-Name = "user@name.com"
    NAS-Port = 2149596816
    Acct-Session-Id = "80267e90"
    Event-Timestamp = "Apr 21 2014 00:00:13 UTC"
    Acct-Input-Octets = 2746795
    Acct-Output-Octets = 17885408
    Acct-Input-Gigawords = 0
    Acct-Output-Gigawords = 0
    Acct-Input-Packets = 19345
    Acct-Output-Packets = 23342
    Acct-Session-Time = 70
    Timestamp = 1345668413



    My problem is that sometimes I've got 18 lines of text, sometimes 14... and not always every chunk of data gives me the same fields.


    any hint would be much appreciated

    Thanks in advance

  2. #2
    Join Date
    Apr 2008
    Posts
    1,771

    Default

    I think that it all depends on what you want to retain from the original text file.
    I would process one file at the time, maybe filtering those files based on how many rows (fields) they have and creating different streams.
    -- Mick --

  3. #3
    Join Date
    Nov 2008
    Posts
    777

    Default

    The Row denormaliser step is the answer. You can even specify a value to insert if the field is missing.
    Attached Files Attached Files
    pdi-ce-4.4.0-stable
    Java 1.7 (64 bit)
    MySQL 5.6 (64 bit)
    Windows 7 (64 bit)

  4. #4

    Default

    Hi Darrell

    I'm afraid I did not make myself clear. What I mean is a bunch of files but every file is full of sets like the mentioned in my first post. So, that kind of structure is repeated several times INSIDE the file. Let's call it a "set of data".
    Sometimes a thousand "sets" per file. Every set more or less the same fields but it can vary per occurrence inside every file. Sometimes 14 fields per set, sometimes 18 fields per set.

    take into account that Acct-Session-Id is supposed to be a unique identifier per set, so you can use it to flag every record

    My problem is:
    1.- how to split every set (I guess the best option is taking the empty line as splitter)
    2.- how to insert every value in the proper field (and NULL if the field does not exist) and cycle per every set until the end of file
    3.- run it for every file.


    Thanks in advance
    Last edited by Joselitux; 04-29-2014 at 03:01 PM.

  5. #5
    Join Date
    Nov 2008
    Posts
    777

    Default

    I see. The key to solving the problem is finding the "edge", i.e., identifying the boundaries between the sets of key-value pairs. I have attached a modified version of my sample transformation that, I think, provides a solution to your new requirements. Here is a summary of the changes:
    1. The text file input step just reads text files matching the regex pattern one-after-another and outputs a continuous stream of rows. It doesn't matter if a file has multiple sets of data or not.
    2. I inserted a User Defined Java Expression step to find the edge. From your sample data, the first key in each group of data is "Acct-Status-Type" so I used that for detection of the edge. You mentioned an empty line (which may or may not be very reliable when going from file to file) but whatever the logic is, just set the new_group_flag field to 1 when the edge is detected and 0 otherwise.
    3. The Group by step performs a Cumulative Sum on the new_group_flag field. Thus, the group number is incremented by 1 at each edge.
    4. The incrementing group field is now used as the "Group field" in the Row denormaliser step. Everything else performs as before.
    Attached Files Attached Files
    pdi-ce-4.4.0-stable
    Java 1.7 (64 bit)
    MySQL 5.6 (64 bit)
    Windows 7 (64 bit)

  6. #6

    Default

    Thanks Darrell

    I will try your suggestions and keep you posted.


    Regards

  7. #7

    Default

    Hi Darrell


    I've got an error on processing the Event-Timestamp field.

    Couldn't convert string [Apr 21 2014 00:00:13 UTC] to a date using format [MMM dd yyyy HH:mm:ss z]

    any suggestion?


    Regards

  8. #8
    Join Date
    Nov 2008
    Posts
    777

    Default

    I remember having a little trouble with that conversion but your sample data currently runs fine with my latest transformation. What version of Spoon are you running? What version of Java?
    pdi-ce-4.4.0-stable
    Java 1.7 (64 bit)
    MySQL 5.6 (64 bit)
    Windows 7 (64 bit)

  9. #9

    Default

    Hi Darrell

    Using PDI 4.4.0. stable and Java 1.6 but don't worry, I've processed that field inside database, so no prior data conversion is needed.

    I'm guessing how could be the trick to use empty line as "edge" (System Admin told me he can give me just one file instead of a bunch of them)


    thanks

  10. #10
    Join Date
    Nov 2008
    Posts
    777

    Default

    Another version is attached. This one will detect an edge at either a blank line or a new file. In addition, duplicate blank lines are handled properly. I added another input file to test the new detection logic.

    Here is a summary of the latest changes:
    1. The Text file input step now passes through both the short filename and all blank lines to aid in edge detection.
    2. An Analytic Query step has been added to look up the filename at the previous row to aid in edge detection.
    3. In the UDJE step, an edge is now detected if either the key field is null (blank line) or the filename has changed (multiple files are still allowed).
    4. A Filter rows step has been added after the Group by step to remove the blank lines.
    Attached Files Attached Files
    pdi-ce-4.4.0-stable
    Java 1.7 (64 bit)
    MySQL 5.6 (64 bit)
    Windows 7 (64 bit)

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.