Hitachi Vantara Pentaho Community Forums
Results 1 to 5 of 5

Thread: kettle-step “Group by” with SQL ?

  1. #1

    Default kettle-step “Group by” with SQL ?


    I extract my data from a file and load it into a MySQL table for following processing.
    Just before the loading into the table I use the kettle- step “Group by”. So I identify multiple records (2x, 3x, 4x…concerning 2 key- fields) and load the records with this additional information.

    So I have in the MySQL table one additional field “rank_Nr” from the kettle “Group by” step (given under: “Line number field name”).

    In the next day I extract again data from a new file and load into the same MySQL table and so on… (the table becomes daily bigger!)

    Usually the records with the same key (multiple records) are from different days (different loading procedures).

    How can I keep my field “rank_Nr” in the MySQL table updated, considering all records- the new and the old (i.e. the just loaded and the yesterday loaded records)?? (I mean in the most performance way; the table has records more then 3-5 GB!)

    keyField_1 keyField_2 otherField_1 rank_Nr loadDay
    A__________B________ X_________ 1_____ 25.03.09
    C__________D________ Y_________ 1_____ 25.03.09
    C__________ D________ Z_________ 2 ____ 25.03.09
    C__________ D________ V_________ 1 ____ 26.03.09
    rank_Nr of the record from 26.03.09 has to be 3 and not 1!

    Thank you in advice!

  2. #2
    Join Date
    Nov 1999


    Ranks 1 and 2 are in the database table.
    Rank 3 is new.

    However, PDI doesn't know about 1 and 2 in the "Group By" step.

    So perhaps you can simply read the rows from the table and add them before the "Group By" step. You should probably also sort them on the keys + load date.

    Something like that.

  3. #3


    If I understand correctly I have:
    1) to read the entire data (all records) from the table
    2) then add to the stream (mix with the new data)
    3) then, sort + Pentaho “Group By” – step
    4) and at the end load the entire data (new and from the table) into the table again (before make the table empty…)

    I hoped, that maybe perhaps in 4) somehow I have to load into the table only the new data (with the correct Rank Nr. =3) ? Do we have a solution?

    One more question:
    It is perhaps possible to solve my problem with SQL??
    I mean something like that: “update myTable set rank_Nr numeration or count rank_Nr where key fields identical…”

    Thank you in advance!

  4. #4
    Join Date
    Nov 1999


    Peter, you're having these problems because you created database tables with aggregated information in them.
    If you would create 2 tables, one with the raw data and then optionally do the ranking exercise your problems would go away.

    You are correct, you don't have to load everything all over again, only the new records.

    There are ranking/analytical operators in certain databases although not in all of them.

  5. #5


    Thank you very match!

    Perhaps instead of using of certain ranking/analytical operators in the database (in my case MySQL), I can also use the Pentaho "Filter rows" - step. So after ranking exercise and before loading, I can filter all the rows with loadDay <> today (i.e. older) to a dummy-step. So I will load only the new data?

    Is that what you mean?
    Would be with a good performance too? (Or using of certain ranking/analytical operators in the database would be more performat?)

    I didn't understand why do I need 2 tables?
    Or maybe it would be impossible to read from a table and to write in the same table in one pentaho transaction? The tables with the aggregated information is about 1-2 million records! But I need the aggregated information.

    Thank you in advance!

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.