View Full Version : Newbie request

07-02-2007, 03:52 PM
I am an experienced computer user and programmer but am new to data mining. What I'm having a problem with right now is finding if a particular package will do what I'm after. Please let me know if you think that Pentaho and its family members will help me with the following scenario. It's a professional service firm situation, but I've translated it into a product catalog analogy, which is probably easier to understand for outsiders.

I have a product catalog with about 20000 items. Each of these has a launch date when it is first available for purchase, but sales often take at least a year before they begin. Then there's a period of activity, maybe several periods, and eventually the product is withdrawn and goes off the catalog. For each product I can use my existing software to map sales volume against time, but what I'm really interested in doing is to analyse all 20000 items and come up with some sort of average sales vs. time profile. Even better would be to split that modelling by product category to see if specific categories share underlying patterns.

Is this 20000-way parallel analysis the sort of thing that Pentaho could do for me, or am I in the wrong ball park?

07-16-2007, 06:28 PM

Analysis of relationships between product categories over time sounds like the most interesting scenario to investigate. How many product categories are there? If there are in the order of (at most) hundreds of product categories, and once average sales volumes have been computed for each category, then it should be possible to apply a regression scheme from Weka (such as linear regression or model trees) in order to find a relationship between one product category (the target) and all the others. Of course, this will have to be applied for each product category (i.e. treating each in turn as the target) if you are interested in finding a descriptive pattern for each.

Another option might be to try market basket analysis (although this will require discretizing the sales volumes into a small number of bins for each product category).