Hitachi Vantara Pentaho Community Forums
Results 1 to 7 of 7

Thread: Newbie looking for PDI assistance: Regex string split & 2 other issues

  1. #1
    Join Date
    Dec 2016
    Posts
    27

    Default Newbie looking for PDI assistance: Regex string split & 2 other issues

    Hi,

    I've discovered PDI/Spoon for the first time this weekend and I like it. I'd like to use this tool to replace a who load of my crappy vba transformation and parsing code. The workflow, labeling and visibility of the effects of each step in PDI are something I really admire.

    I've been replicating my work on one of the "easy" Excel files from the 60 or so I transformed and normalised for a recent project (The purpose was to clean up files before putting them in a homogenized, consolidated, reconciled database). I made quite good progress in a day today using PDI/Spoon butI have 3 requirements I'd like some guidance on in terms of how to replicate using Spoon.

    1. Split a string field based on the first occurrence of a space i.,e ' '
    I think need to go to school regarding Regex. I've skirted this a few times in the last few years but I probably need to actually understand it now. I tried using "replace in String", Y" to use regex & '/[^ ]*/' in search but I'm not having any luck. What should I be doing differently?

    2. "Decumulation"
    I've invented a new word - sorry about that. I've got a few files from this project which show for a collection of accounts, balances that are non-zero cumulative instead of "normal" incremental. I'd instead like them to be "normal" incremental. I tried exporting to a H2 table and then importing from the same H2 table to get to "execute SQL script" but since then discovered the execute SQL is not designed for data manipulation. That was my "get out of jail" move... So I need another way to strip out the cumulative effect from a time series of numbers. How else can I do this?

    To make this question harder- for some I received certain account lines had cumulative data Except that in between some months the account value fell to zero (sweeps,manual entries etc). So I would want to say "take the current month value for x account away from the last non-zero occurrence". Does this take me into javascript, and can it be done in spoon?

    3. "Pushdown"/"Pushup"
    Sorry - another made up word. Client accounting data is basically pretty ****ty. A pattern I saw in more than one of these files was that there are certain lines so important you want to push these values down (in a new field) until you hit another value you know about, and then push down that. For instance you have heading "Revenues" and you want to tag the small accounts that sit below this as belonging to "Revenues". The next important line item you get to is "Cost of Goods Sold" and you then want to push this value down in the same new field until you reach say "R&D Expenses". I use the word "Pushup" to describe the scenario where those key values in the file from the client sit below the line items you want to tag - meaning you start from the bottom and work up. Basically you want to use one or the other approaches for any given file like that.


    Any assistance appreciated


    Regards,


    Andy
    Last edited by mek; 12-18-2016 at 07:33 PM.

  2. #2
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    You have a string containing a list of words separated by a space character?
    Without knowing the valid characters for the first word I would use the first pattern:

    Name:  216424.png
Views: 214
Size:  21.3 KB

    You have input rows where repeating column values are suppressed?
    There's a field option "Repeat" you can use to reconstruct the values.

    For your "decumulation" problem I would need a data sample to understand.
    So long, and thanks for all the fish.

  3. #3
    Join Date
    May 2016
    Posts
    282

    Default

    Hi Andy,
    For your first requirement take a look to the IndexOf formula in the Modified Java Script Value step (in the Scripting steps). On the left side, if you extend Transform functions -> String Functions -> indexOf (there are two, I think you only need the first one with two variables) right click on the function name and select Sample to see how it works.
    You can get a more general idea on how to use the Modified Java Script Value step taking a look at the samples directory of your PDI installation, there are sample transformations and jobs very useful when you are starting to use PDI.
    Regards

  4. #4
    Join Date
    May 2016
    Posts
    282

    Default

    For the rest of your requirements, they are too general to give you a quick answer, for your second requirement I would take a look to the Analytic Query Step in Statistics steps, the LAG function would give you the value of the previous row in your actual row and then you can get the difference between both values.
    For your third requirement, sounds like a question recently asked in this forum, take a look to the threads last week, someone reading an excel file asked that same question.
    Regards

  5. #5
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    You want to extract the first word of a list of words separated by a space character?

    Name:  216424.png
Views: 179
Size:  21.3 KB

    In your input rows repeating column values are suppressed?
    You can use field option "Repeat" to reconstruct values.

    For "Decumulation" you should provide a data sample.

    PS: Funny - I made this same post over 8 hours ago but it waits for approval by a moderator ...
    So long, and thanks for all the fish.

  6. #6
    Join Date
    May 2016
    Posts
    282

    Default

    Nice marabu, I haven't used RegEx much, so I tend to circumvent it with other solutions, I'll keep your answer in case I need something similar in the future.

  7. #7
    Join Date
    Dec 2016
    Posts
    27

    Default

    Thanks Marabu/Ana,

    I was was able to extract this data with the guidance you provided. The group by step provides a host of count labeling options which I found most useful. I may have a follow-up query though which relates to the cumulate/decumulate objective I described but I think I know how to go about this now, again indirectly by using the group by function.

    Regards,


    Andy

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.