PDA

View Full Version : Market Forecasting Model (Mark take a look at this)



perryrico
06-09-2009, 07:58 AM
Hi Mark,

I am Perry. I was impressed with Mark's marksmanship in Weka. As such I would like to share and seek advice to my Data Mining headache with my fellow forum members specially Mark. I do not know if my question is to simple or to complicated.

Objective: Predict the likelyhood of a Leads to respond with given marketing campaign. My Class is simply YES or NO. The outcome of my historical data.

My Solution: I've developed model in Explorer using the following algorithm J48, LMT, and MultiPerceptron.

Question:
1) Have I used the most appropriate models for the Objective?
2) I used Explorer in developing the model, based on most Forum Thread in Pentaho they are using Experimenter. I do not have any idea with Experimenter.
- What is Experimenter in a nutshell anyway?
- What I am missing if I use do not use Experimenter?
- Could you give me a link where I could start learning Experimenter and its concepts.
3) My Historical Dataset for training the model has imbalance Class. Lets say 90% responded NO to marketing, 10% responded YES. How do I improve the accuracy of my model.

P.S. I will greatly appreciate any advice to be given on my headache. In advance thank you very much.

Mark
06-09-2009, 07:19 PM
Hi Perry,

Data mining applications are usually experimental in nature, so I can't say for your data what the best method to use would be. You have done some evaluation with the Explorer (probably using cross-validation) so you might have a rough idea as to which method, out of the ones you've been looking at, performs the best. This, of course, is heavily dependent on which metric you choose to look at for evaluation (more on that later).

Whilst a single run of cross-validation gives you a rough idea as to how methods compare and how they will perform on future data it doesn't necessarily tell you whether method A is really better than method B (in a statistical sense). This is where the Experimenter comes in. It allows you to apply a set of different learning algorithms (or perhaps the same learning algorithm but with different parameter settings) to one or more data sets. Typically multiple runs of 10-fold cross-validation are performed (the de facto practice in machine learning is to perform 10 x 10 folds cross validation). The Experimenter automatically collects up the evaluation statistics computed over the folds of cross-validation and presents the results in a table-like format, complete with statistical significance testing (i.e. paired t-test results).

There is an entire chapter on using the Experimenter in the WekaManual.pdf that came with your Weka distribution (3.6.x or 3.7.x). Older version of Weka come with an Experimenter tutorial.

Learning and evaluating when the class distribution is heavily skewed is a whole topic in itself. I can't really do it justice here in a few paragraphs. I'd suggest reading the Data Mining book by Witten and Frank that was written to accompany the Weka software. In your case, the class distribution is not extremely skewed - 10% negatives is not too bad. Hence, I'd guess that the tree learners that you've applied have actually learned a tree structure (rather than just a single node that predicts the majority class). In skewed class distribution cases there are several approaches to take:

1) Create training data that balances the class distribution through oversampling or undersampling. This is particularly useful for tree/rule learners.

2) If you know the misclassification costs involved in your application use Weka's CostSensitiveClassifier in cost-sensitive learning mode (for rule/tree learners) or minimum expected cost prediction mode for other learners (i.e. ones that produce good probability estimates).

3) Choose methods that produce good probability estimates (logistic regression, LMT, Bagged unpruned decision trees, etc.) and focus on the ranking performance of the methods. I.e. which methods tend to rank true positives higher than false positives when predictions are ordered by the probability of the true (YES) class. The Area Under the ROC curve (AUC) is a good summary metric to look at here when you do not necessarily know the costs involved in misclassification.

Hope this helps.

Cheers,
Mark.

perryrico
06-09-2009, 10:20 PM
wow. Thanks Mark. I had found your response to be very informative. I had just learned that there are alot of things yet to consider in prediction.