Hitachi Vantara Pentaho Community Forums
Results 1 to 9 of 9

Thread: Extracting data from a headings in a text file

  1. #1
    Join Date
    Nov 2016
    Posts
    6

    Default Extracting data from a headings in a text file

    Hello,

    I am trying to extract data from a file in the following format into a tabular form, with the text in red repeated for each record beneath it in blue, until the next red value (where it is repeated for that section. The number of blue rows varies for each section.
    There are some other text lines (green) that can be discarded

    I have played around with regular expressions to try and separate and repeat the red part- but have reached a bit of a wall
    If someone can help, I would be very grateful. I'm ok splitting up the blue text into columns- it is the getting the red part out and repeating that has me stumped

    thank you in advance

    Andrew

    Input
    CHAINAGE 790
    ---------------------
    POINT -----X----- -----Y------ ----Z----- ---OFFSET-- -LABEL CUT-
    1 331178.782 6245798.563 -39.117 -10.305 EH11
    2 331178.958 6245798.408 -38.190 -10.070 WB11
    3 331178.958 6245798.408 -35.610 -10.070 WT11
    4 331179.083 6245798.297 -39.109 -9.903 C711
    5 331179.134 6245798.252 -38.289 -9.835 C411


    CHAINAGE 791
    ---------------------
    POINT -----X----- -----Y------ ----Z----- ---OFFSET-- -LABEL CUT-
    1 331179.440 6245799.308 -39.113 -10.305 EH11
    2 331179.616 6245799.152 -38.185 -10.070 WB11
    3 331179.616 6245799.152 -35.605 -10.070 WT11
    4 331179.741 6245799.042 -39.105 -9.903 C711
    5 331179.792 6245798.997 -38.285 -9.835 C411
    6 331179.905 6245798.897 -38.285 -9.685 C311
    7 331179.950 6245798.857 -38.846 -9.625 C211


    Desired output
    790 1 331178.782 6245798.563 -39.117 -10.305 EH11
    790 2 331178.958 6245798.408 -38.190 -10.070 WB11
    790 3 331178.958 6245798.408 -35.610 -10.070 WT11
    790 4 331179.083 6245798.297 -39.109 -9.903 C711
    790 5 331179.134 6245798.252 -38.289 -9.835 C411
    791 1 331179.440 6245799.308 -39.113 -10.305 EH11
    791 2 331179.616 6245799.152 -38.185 -10.070 WB11
    791 3 331179.616 6245799.152 -35.605 -10.070 WT11
    791 4 331179.741 6245799.042 -39.105 -9.903 C711
    791 5 331179.792 6245798.997 -38.285 -9.835 C411
    791 6 331179.905 6245798.897 -38.285 -9.685 C311
    791 7 331179.950 6245798.857 -38.846 -9.625 C211

  2. #2
    Join Date
    Nov 2009
    Posts
    685

    Default

    Check if the line starts with CHAINAGE and send that line to a seperate step. Then remove the text part CHAINAGE so you only have the red number. Then join "Cartesion join" that step with the rest of the line

  3. #3
    Join Date
    Aug 2016
    Posts
    279

    Default

    This is the sort of thing which is easy and fun to do in java (and much faster than Cartesian join which must wait for all rows before producing output).

    Example:
    String filePath = "C:\\test.txt";
    BufferedReader reader;
    try {
    reader = new BufferedReader(new FileReader(filePath));
    String line = reader.readLine();
    String header = null;
    while (line != null)
    {
    // read next line
    if(line.startsWith("CHAINAGE "))
    {
    header = line.substring("CHAINAGE ".length(), line.length());
    }
    else if(header != null && line.length() > 0)
    {
    char c = line.charAt(0);
    if(Character.isDigit(c))
    {
    System.out.println(header + " " + line);
    }
    }
    line = reader.readLine();
    }
    reader.close();
    } catch (IOException e) {
    e.printStackTrace();
    }
    Just remember the imports:

    import java.io.BufferedReader;
    import java.io.FileReader;
    import java.io.IOException;

    Output is exactly as you specified:
    790 1 331178.782 6245798.563 -39.117 -10.305 EH11
    790 2 331178.958 6245798.408 -38.190 -10.070 WB11
    790 3 331178.958 6245798.408 -35.610 -10.070 WT11
    790 4 331179.083 6245798.297 -39.109 -9.903 C711
    790 5 331179.134 6245798.252 -38.289 -9.835 C411
    791 1 331179.440 6245799.308 -39.113 -10.305 EH11
    791 2 331179.616 6245799.152 -38.185 -10.070 WB11
    791 3 331179.616 6245799.152 -35.605 -10.070 WT11
    791 4 331179.741 6245799.042 -39.105 -9.903 C711
    791 5 331179.792 6245798.997 -38.285 -9.835 C411
    791 6 331179.905 6245798.897 -38.285 -9.685 C311
    791 7 331179.950 6245798.857 -38.846 -9.625 C211
    Last edited by Sparkles; 02-25-2019 at 08:41 AM.

  4. #4
    Join Date
    Apr 2008
    Posts
    4,671

    Default

    I know that there's an example on this forum somewhere by a poster by the name of marabu.

    It uses a trinary operator looking for a line start flag, and sets a column value to 1 or 0 based on that start.
    Then uses a running total function to create group flags.
    Last edited by gutlez; 02-26-2019 at 12:33 PM.

  5. #5
    Join Date
    Nov 2009
    Posts
    685

  6. #6
    Join Date
    Apr 2008
    Posts
    4,671

    Default

    Quote Originally Posted by johanhammink View Post
    Nope... Not that one.
    That is how to join two lines into one.

  7. #7
    Join Date
    Apr 2008
    Posts
    4,671

    Default

    This is similar to what marabu posted, but recreated for this thread.

    KTR File:
    Code:
    <?xml version="1.0" encoding="UTF-8"?><transformation>
      <info>
        <name>Demo</name>
        <description/>
        <extended_description/>
        <trans_version/>
        <trans_type>Normal</trans_type>
        <directory>/</directory>
        <parameters>
        </parameters>
        <size_rowset>10000</size_rowset>
        <sleep_time_empty>50</sleep_time_empty>
        <sleep_time_full>50</sleep_time_full>
        <unique_connections>N</unique_connections>
        <feedback_shown>Y</feedback_shown>
        <feedback_size>50000</feedback_size>
        <using_thread_priorities>Y</using_thread_priorities>
        <shared_objects_file/>
        <capture_step_performance>N</capture_step_performance>
        <step_performance_capturing_delay>1000</step_performance_capturing_delay>
        <step_performance_capturing_size_limit>100</step_performance_capturing_size_limit>
        <created_user>-</created_user>
        <created_date>2019/02/26 11:47:44.979</created_date>
        <modified_user>-</modified_user>
        <modified_date>2019/02/26 11:47:44.979</modified_date>
        <key_for_session_key/>
        <is_key_private>N</is_key_private>
      </info>
      <notepads>
      </notepads>
      <order>
        <hop>
          <from>Data grid</from>
          <to>Filter rows</to>
          <enabled>Y</enabled>
        </hop>
        <hop>
          <from>Filter rows</from>
          <to>User defined Java expression</to>
          <enabled>Y</enabled>
        </hop>
        <hop>
          <from>User defined Java expression</from>
          <to>Group by</to>
          <enabled>Y</enabled>
        </hop>
      </order>
      <step>
        <name>Data grid</name>
        <type>DataGrid</type>
        <description/>
        <distribute>Y</distribute>
        <custom_distribution/>
        <copies>1</copies>
        <partitioning>
          <method>none</method>
          <schema_name/>
        </partitioning>
        <fields>
          <field>
            <name>InputLine</name>
            <type>String</type>
            <format/>
            <currency/>
            <decimal/>
            <group/>
            <length>-1</length>
            <precision>-1</precision>
            <set_empty_string>N</set_empty_string>
          </field>
        </fields>
        <data>
          <line>
            <item>CHAINAGE 790</item>
          </line>
          <line>
            <item>---------------------</item>
          </line>
          <line>
            <item>POINT -----X----- -----Y------ ----Z----- ---OFFSET-- -LABEL CUT-</item>
          </line>
          <line>
            <item>1 331178.782 6245798.563 -39.117 -10.305 EH11</item>
          </line>
          <line>
            <item>2 331178.958 6245798.408 -38.190 -10.070 WB11</item>
          </line>
          <line>
            <item>3 331178.958 6245798.408 -35.610 -10.070 WT11</item>
          </line>
          <line>
            <item>4 331179.083 6245798.297 -39.109 -9.903 C711</item>
          </line>
          <line>
            <item>5 331179.134 6245798.252 -38.289 -9.835 C411</item>
          </line>
          <line>
            <item/>
          </line>
          <line>
            <item>CHAINAGE 791</item>
          </line>
          <line>
            <item>---------------------</item>
          </line>
          <line>
            <item>POINT -----X----- -----Y------ ----Z----- ---OFFSET-- -LABEL CUT-</item>
          </line>
          <line>
            <item>1 331179.440 6245799.308 -39.113 -10.305 EH11</item>
          </line>
          <line>
            <item>2 331179.616 6245799.152 -38.185 -10.070 WB11</item>
          </line>
          <line>
            <item>3 331179.616 6245799.152 -35.605 -10.070 WT11</item>
          </line>
          <line>
            <item>4 331179.741 6245799.042 -39.105 -9.903 C711</item>
          </line>
          <line>
            <item>5 331179.792 6245798.997 -38.285 -9.835 C411</item>
          </line>
          <line>
            <item>6 331179.905 6245798.897 -38.285 -9.685 C311</item>
          </line>
          <line>
            <item>7 331179.950 6245798.857 -38.846 -9.625 C211</item>
          </line>
        </data>
        <attributes/>
        <cluster_schema/>
        <remotesteps>
          <input>
          </input>
          <output>
          </output>
        </remotesteps>
        <GUI>
          <xloc>144</xloc>
          <yloc>96</yloc>
          <draw>Y</draw>
        </GUI>
      </step>
      <step>
        <name>User defined Java expression</name>
        <type>Janino</type>
        <description/>
        <distribute>Y</distribute>
        <custom_distribution/>
        <copies>1</copies>
        <partitioning>
          <method>none</method>
          <schema_name/>
        </partitioning>
        <formula>
          <field_name>CHAINAGE</field_name>
          <formula_string>InputLine.startsWith("CHAINAGE") ? 1 : 0</formula_string>
          <value_type>Integer</value_type>
          <value_length>-1</value_length>
          <value_precision>-1</value_precision>
          <replace_field/>
        </formula>
        <attributes/>
        <cluster_schema/>
        <remotesteps>
          <input>
          </input>
          <output>
          </output>
        </remotesteps>
        <GUI>
          <xloc>416</xloc>
          <yloc>96</yloc>
          <draw>Y</draw>
        </GUI>
      </step>
      <step>
        <name>Filter rows</name>
        <type>FilterRows</type>
        <description/>
        <distribute>Y</distribute>
        <custom_distribution/>
        <copies>1</copies>
        <partitioning>
          <method>none</method>
          <schema_name/>
        </partitioning>
        <send_true_to/>
        <send_false_to/>
        <compare>
          <condition>
            <negated>N</negated>
            <leftvalue>InputLine</leftvalue>
            <function>IS NOT NULL</function>
            <rightvalue/>
          </condition>
        </compare>
        <attributes/>
        <cluster_schema/>
        <remotesteps>
          <input>
          </input>
          <output>
          </output>
        </remotesteps>
        <GUI>
          <xloc>272</xloc>
          <yloc>96</yloc>
          <draw>Y</draw>
        </GUI>
      </step>
      <step>
        <name>Group by</name>
        <type>GroupBy</type>
        <description/>
        <distribute>Y</distribute>
        <custom_distribution/>
        <copies>1</copies>
        <partitioning>
          <method>none</method>
          <schema_name/>
        </partitioning>
        <all_rows>Y</all_rows>
        <ignore_aggregate>N</ignore_aggregate>
        <field_ignore/>
        <directory>%%java.io.tmpdir%%</directory>
        <prefix>grp</prefix>
        <add_linenr>Y</add_linenr>
        <linenr_fieldname>Line</linenr_fieldname>
        <give_back_row>N</give_back_row>
        <group>
          </group>
        <fields>
          <field>
            <aggregate>CHAINGROUP</aggregate>
            <subject>CHAINAGE</subject>
            <type>CUM_SUM</type>
            <valuefield/>
          </field>
        </fields>
        <attributes/>
        <cluster_schema/>
        <remotesteps>
          <input>
          </input>
          <output>
          </output>
        </remotesteps>
        <GUI>
          <xloc>544</xloc>
          <yloc>96</yloc>
          <draw>Y</draw>
        </GUI>
      </step>
      <step_error_handling>
      </step_error_handling>
      <slave-step-copy-partition-distribution>
      </slave-step-copy-partition-distribution>
      <slave_transformation>N</slave_transformation>
      <attributes/>
    </transformation>

  8. #8
    Join Date
    May 2016
    Posts
    270

    Default

    That's a creative way to upload a ktr when we are not more allowed to attach files
    OS: Ubuntu 16.04 64 bits
    Java: Openjdk 1.8.0_131
    Pentaho 6.1 CE

  9. #9
    Join Date
    Nov 2016
    Posts
    6

    Default

    Hello

    thank you for all the answers and comments.
    I ended up using the marabu approach, as per @gutlez example

    The key part is that it uses a java expression to set a flag, then groups by that flag
    Code:
    InputLine.startsWith("CHAINAGE") ? 1 : 0

    thanks again,


    Andrew

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2017 Pentaho Corporation. All Rights Reserved.