Hitachi Vantara Pentaho Community Forums
Results 1 to 6 of 6

Thread: How to organize the data in a excel file?!

  1. #1
    Join Date
    Nov 2015
    Posts
    3

    Question How to organize the data in a excel file?!

    In a text file I get this:
    seller; make; model; year; condition; price;
    professional; audi; tt; 2000; used; 5000€;
    seller; make; model; year; condition; price; fuel;
    private; bmw; 118d; 2010; used; 15000€; diesel;
    seller; make; model; year; price; fuel; kilometres;
    private; audi; a3; 2007; 7500€; gas; 188 000km;
    ...

    and I try to get this result, in output excel file:
    | seller ----------| make | model | year | condition | price -----| fuel ----| Kilometres |
    | professional | audi ---| tt ------| 2000 | used ------| 5000€ ---| ----------| -----------------|
    | private --------| bmw --| 118d -| 2010 | used ------| 15000€ -| diesel -| -----------------|
    | private --------| audi ---| a3 ----| 2007 | --------------| 7500€ ---| gas ----| 188 000km |
    ...

    I've tried some components but without success ... Does anyone know if there is any component that does this transformation?!!
    Last edited by zemateus; 11-27-2015 at 10:02 AM.

  2. #2
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    You're going to have to do a handful of different things to make this work.

    1) Number all the lines (this can be done in your Text File Input)
    2) Calculate the "group" of header and data (Round(Row#/2))
    3) Break the lines in the group into numbered columns (Group:1,Column:1,Line:1="seller", Group:1,Column:1,Line:2="professional")
    3a) If you import each row as one column, you can use "Split Field to Rows" to break the lines into one field per row
    3b) Use the "Value Changing Sequence" to reset your column numbers for each row of data
    4) Create a Key/Value pair based on Group and column <Strike>(The "Group By" step will help here)</strike> The Denormalize step will help here.
    5) Denormalize Key/Value pairs based on Group and known expected columns

    Try to build out what I'm suggesting above, and share what you've built - the community will help you learn how to make it work.
    If the above doesn't make sense, reply, and I'll use your example data to show what is happening in each step.
    Last edited by gutlez; 11-27-2015 at 09:14 PM. Reason: Correction on Step 4

  3. #3
    Join Date
    Nov 2015
    Posts
    3

    Thumbs up

    Quote Originally Posted by gutlez View Post
    You're going to have to do a handful of different things to make this work.

    1) Number all the lines (this can be done in your Text File Input)
    2) Calculate the "group" of header and data (Round(Row#/2))
    3) Break the lines in the group into numbered columns (Group:1,Column:1,Line:1="seller", Group:1,Column:1,Line:2="professional")
    3a) If you import each row as one column, you can use "Split Field to Rows" to break the lines into one field per row
    3b) Use the "Value Changing Sequence" to reset your column numbers for each row of data
    4) Create a Key/Value pair based on Group and column (The "Group By" step will help here)
    5) Denormalize Key/Value pairs based on Group and known expected columns

    Try to build out what I'm suggesting above, and share what you've built - the community will help you learn how to make it work.
    If the above doesn't make sense, reply, and I'll use your example data to show what is happening in each step.


    Hi gutlez,
    Thanks a lot for you answer and to try to help me.

    Following your instructions I built a scheme and I got stuck in the second step. "2) Calculate the "group" of header and data (Round(Row#/2))"?
    I don't know if i understand very well,... Can you check what I'm doing wrong?! If put all the steps that are needed?!!
    Attached Files Attached Files

  4. #4
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    You've done really well!

    Change the encoding on your Text File Input to UTF8, or you'll get really frustrated that Matrícula won't get read correctly. Also, change the field separator to something that will not appear in your input, so that we can play with the data a little bit. You can read a whole line as one field, and then use other steps to fix it later. This is sometimes a very useful technique.

    I probably worded things badly, and something fell apart in translation.
    You have Row numbers turned on, which is Step 1 of my suggestion.
    Next, you have to calculate which rows go together.
    I also used a calculator step for this:
    Code:
    New field Calculation Field A Field B Field C Value type Length Precision Remove Conversion mask Decimal symbol Grouping symbol Currency symbol
    Two Set field to constant value A 2   Integer   Y    
    One Set field to constant value A 1   Integer   Y    
    RowGroup A + B nblines One  Integer   Y    
    Group A / B RowGroup Two  Integer   N
    So now you have data that looks something like:
    Code:
    Content | nblines | Group
    Anunciante; Matrícula; Portas; Combustível; Quilómetros; Ano; Modelo; Condição;| 1 | 1
    Particular ; 40-21-VG ; 4-5 ; Diesel ; 180.000 ; 2.003 ; A4 ; Usado ;| 2 | 1
    Anunciante; Portas; Combustível; Quilómetros; Ano;| 3 | 2
    Particular ; 1-3 ; Diesel ; 181.000 ; 1.998 ;| 4 | 2
    Here is where I added a Value Changing Sequence to identify header rows and data rows.

    Code:
    Content | nblines | Group | Header
    Anunciante; Matrícula; Portas; Combustível; Quilómetros; Ano; Modelo; Condição; |1 | 1| 1
    Particular ; 40-21-VG ; 4-5 ; Diesel ; 180.000 ; 2.003 ; A4 ; Usado ; |2 | 1| 2
    Anunciante; Portas; Combustível; Quilómetros; Ano; |3 | 2| 1
    Particular ; 1-3 ; Diesel ; 181.000 ; 1.998 ; |4 | 2| 2
    Anunciante; Portas; Combustível; Quilómetros; Ano; Modelo; Condição; |5 | 3| 1
    Particular ; 4-5 ; Diesel ; 158.000 ; 2.005 ; 206 ; Usado ; |6 | 3| 2
    If you then split fields to rows using the ; as a separator, and turn on the "Include Rownum in Output" option you're almost ready to denormalize into Key/Value pairs.
    Code:
    Content | nblines | Group | Header | Data | ColumnNum
    Anunciante; Matrícula; Portas; Combustível; Quilómetros; Ano; Modelo; Condição; |1 | 1| 1| Anunciante| 1
    Anunciante; Matrícula; Portas; Combustível; Quilómetros; Ano; Modelo; Condição; |1 | 1| 1| Matrícula| 2
    Anunciante; Matrícula; Portas; Combustível; Quilómetros; Ano; Modelo; Condição; |1 | 1| 1| Portas| 3
    Anunciante; Matrícula; Portas; Combustível; Quilómetros; Ano; Modelo; Condição; |1 | 1| 1| Combustível| 4
    Anunciante; Matrícula; Portas; Combustível; Quilómetros; Ano; Modelo; Condição; |1 | 1| 1| Quilómetros| 5
    Anunciante; Matrícula; Portas; Combustível; Quilómetros; Ano; Modelo; Condição; |1 | 1| 1| Ano| 6
    Anunciante; Matrícula; Portas; Combustível; Quilómetros; Ano; Modelo; Condição; |1 | 1| 1| Modelo| 7
    Anunciante; Matrícula; Portas; Combustível; Quilómetros; Ano; Modelo; Condição; |1 | 1| 1| Condição| 8
    How about I leave you here for the moment, and you can build what I've outlined in this post, and then we can work further?

    Final Output:
    Code:
    Anunciante | Matrícula | Portas | Combustível | Quilómetros | Ano | Modelo | Condição
    Particular | 40-21-VG | 4-5| Diesel | 180.000 | 2.003 | A4 | Usado
    Particular || 1-3 | Diesel | 181.000 | 1.998 ||  
    Particular || 4-5 | Diesel | 158.000 | 2.005 | 206|  Usado
    Last edited by gutlez; 11-27-2015 at 09:27 PM.

  5. #5
    Join Date
    Nov 2015
    Posts
    3

    Thumbs up

    Hi gutlez,
    one more time, thanks for the news instructions you gave me.





    And once again I followed the directions ... and I think that I got the intended results of these steps. After that, I tried hard to get the final result that I want. But without success. If you don't mind, I need more a little help. And if you want to check I have added the file with the new changes.
    Attached Files Attached Files

  6. #6
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    Try turning on the "Reset RowNum at each input row"

    I didn't use any group by steps in the version I did.

    Text File Input
    Calculator
    Value Changing Sequence
    Split Fields to Rows
    String Operations (trim out any extra spaces!)
    Sort Rows (Group, Column)
    Denormalize (Group on Group, Column. Key Field=Header. 1=Key, 2=Value)
    Denormalize (Group on Group. Key Field=Key. List all known fields)
    Excel Output

    My Final Denormalizer:
    Code:
    Target fieldname Value fieldname Key value Type Format Length Precision Currency Decimal Group Null if Aggregation
    Anunciante Value Anunciante String        -
    Matrícula Value Matrícula String        -
    Portas Value Portas String        -
    Combustível Value Combustível String        -
    Quilómetros Value Quilómetros String        -
    Ano Value Ano String        -
    Modelo Value Modelo String        -
    Condição Value Condição String        -

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.