View Full Version : Group by Step not very useful?

12-14-2005, 12:53 AM
I am trying to build a transformation which looks up all invoice positions for several customers in our ERP database. After sorting the data by customer id, I wanted to use the "Group by" step to insert a summary line for each customer. The problem is, that the "Group by" step only delivers the fields defined in "The fields that make up the group" into the output, so only the customer id will be delivered, even if I select "Include all rows". All the other fields get lost... What am I doing wrong?

12-14-2005, 01:52 AM

You're not doing anything wrong.
The group by functionality is the same as on any database: you get the grouping fields + the grouping function results in the same row:

SELECT fieldA, fieldB, sum(fieldC)
FROM table1
GROUP BY fieldA, fieldB

I wanted to avoid having to cache possibly large amounts of rows on the client in order to achieve the result you are describing. Because you see, in that case we can only start sending rows to the next step if all rows of a single group have been processed.

But, I know what you want to do. You want to attach the sum to every line so that you can calculate percentages etc. How about re-attaching the sum to the original rows with a StreamLookup?
That way you get the same functionality.
Make sure to do this in 2 transformations and do the caching in a relational database.

Or you can use a reporting tool like Excel, Business Objects, Cognos, ... that will do the trick for you.
However, they only work with data in memory, they don't stream and as such limit the size of the documents that can be processed. The scenario of milions of rows per group is unworkable for these tools, even if there would be only a couple of groups.

I'll tell you what, you're not the only one who asked for it, so even though I think it's a reporting functionality I'll try to add it. The caching will be limited to 5000 rows in memory, after that the rows will go to temp space (configurable, the same as Sort).
Please check the following tracker to see if implementation is finished (probably by the end of the week ;-))

Change Request - [# 1355] GroupBy: cache rows so result can be added to original row.

12-14-2005, 05:36 AM
Thanks for your answer. I see that you don't want to cache all rows in memory. The only problem with it is the fact that if you only return the grouping fields, the result is not very useful especially when specifying "Include all rows". The solution to make it similar to the Sort step sounds good so I am looking forward to it :-)

12-14-2005, 12:03 PM
Test-code is available, grab the kettle.jar and try it out.