PDA

View Full Version : suggestion: how about adding a ID field in WEKA's Instance class?



haiyeong
06-03-2008, 10:30 AM
Hi there,

In WEKA experiment, it can be configured to output the Prediction/Target together with IDs. However, the IDs field can only be set to one of Attributes in the dataset. This is not useful because the ID field in the dataset is usually removed before using it to train a classifier. Otherwise, including a ID field in the training data would most likely affect the classifier's performance. I think it's better off to have a ID field in Instance class, like:

protected String m_ID;

In this way, we separate the ID from the rest of fields which are used in training / testing process. And the output from WEKA experiment (Prediction/Target/ID) make more sense. What's your opinion?

Thanks for your effort to maintain this very nice forum.

--Haiyong Xu

Mark
06-03-2008, 05:43 PM
Hi there,

Have you taken a look at the section in the Weka wiki on how to use ID attributes?

http://weka.sourceforge.net/wekadoc/index.php/en:Troubleshooting#Instance_ID

Essentially, you can keep them in your data and use a FilteredClassifier to remove them only for the purposes of learning the model (they remain in the data for outputting along with predictions or for visualization).

Cheers,
Mark.

haiyeong
06-04-2008, 09:41 PM
Thanks Mark.

The "FilteredClassifier" can solve the ID problem, but I don't think it work for my question.

In "weka.experiment.ClassifierSplitEvaluator", the private member "m_attID" is used to indicate the output ID fields in the cross-validation experiment. In this case, I didn't figure out how to bypass its internal mechanism to generate result of Prediction/Target together with ID field. Would you like to give me a hint? Thanks.

--Haiyong

Mark
06-05-2008, 12:18 AM
I'm not sure I understand what you are trying to do. The Experimenter always produces summary results (not predictions for individual test instances). The instances that are created by the Experimenter are the summary results computed for test folds, hold-out sets etc. When you select the options for outputting targets, predictions and an ID field, it creates String attributes in the resulting instances that contain a list of IDs or predictions/targets of each instance in the test fold/hold-out set, with each element separated by a "|" character.

Why do you need to "bypass" this mechanism of generating IDs, targets and predictions etc?

haiyeong
06-05-2008, 11:57 AM
If you look at the source code of "ClassifierSplitEvaluator", you would find that the ID field has to be one of attributes in the dataset (an Instances object). If we remove the ID field before feeding the dataset to a classifier, then "ClassifierSplitEvaluator" cannot get the ID information for each instance in test-fold, which makes the Prediction meaningless.

Mark
06-05-2008, 06:29 PM
If we remove the ID field before feeding the dataset to a classifier, then "ClassifierSplitEvaluator" cannot get the ID information for each instance in test-fold, which makes the Prediction meaningless.

Arghh! I feel like I'm starting to go mad :-) One last time - use the FilteredClassifier in your experiments. The FilteredClassifier was created so that a COPY of the original training data can be modified before being passed to the base classifier. If you use a FilteredClassifier with the filter set to be weka.filters.unsupervise.Remove (to remove the ID attribute), then the copy of the data passed to the classifier will have the ID removed. HOWEVER, the original data will still have the ID in it - this is the data from which the ClassifierSplitEvaluator will create the String attribute that contains the list of IDs for the current test fold.

Does this handle your situtation, or am I still missing something?

Cheers,
Mark

haiyeong
06-06-2008, 10:35 AM
Arghh! I feel like I'm starting to go mad, too :-) I have tried the "FilteredClassifier", It does not work at all. Following is the output using command line mode:

$ java -cp ../classes/ weka.classifiers.meta.FilteredClassifier -F "weka.filters.unsupervised.attribute.Remove -R 0" -W weka.classifiers.trees.J48 -t weather.arff -T test.arff
java.lang.IllegalArgumentException: Invalid range list at 0
at weka.core.Range.setFlags(Range.java:316)
at weka.core.Range.setUpper(Range.java:88)
at weka.filters.unsupervised.attribute.Remove.setInputFormat(Remove.java:197)
at weka.classifiers.meta.FilteredClassifier.buildClassifier(FilteredClassifier.java:388)
at weka.classifiers.Evaluation.evaluateModel(Evaluation.java:977)
at weka.classifiers.Classifier.runClassifier(Classifier.java:295)
at weka.classifiers.meta.FilteredClassifier.main(FilteredClassifier.java:469)



This is the output using Weka's Experimenter. It triggered the same exception.

CrossValidationResultProducer: setting additional measures for split evaluator
java.lang.IllegalArgumentException: Invalid range list at 0
weka.core.Range.setFlags(Unknown Source)
weka.core.Range.setUpper(Unknown Source)
weka.filters.unsupervised.attribute.Remove.setInputFormat(Unknown Source)
weka.classifiers.meta.FilteredClassifier.buildClassifier(Unknown Source)
weka.experiment.ClassifierSplitEvaluator.getResult(Unknown Source)
weka.experiment.CrossValidationResultProducer.doRun(Unknown Source)
weka.experiment.Experiment.nextIteration(Unknown Source)
weka.gui.experiment.RunPanel$ExperimentRunner.run(Unknown Source)

at weka.core.Range.setFlags(Unknown Source)
at weka.core.Range.setUpper(Unknown Source)
at weka.filters.unsupervised.attribute.Remove.setInputFormat(Unknown Source)
at weka.classifiers.meta.FilteredClassifier.buildClassifier(Unknown Source)
at weka.experiment.ClassifierSplitEvaluator.getResult(Unknown Source)
at weka.experiment.CrossValidationResultProducer.doRun(Unknown Source)
at weka.experiment.Experiment.nextIteration(Unknown Source)
at weka.gui.experiment.RunPanel$ExperimentRunner.run(Unknown Source)
Done...


Is there anybody has successfully used that "FilteredClassifier" before, and how? Thanks.

The FilteredClassifier used is in WEKA version 3.5.7.

Mark
06-06-2008, 05:08 PM
There is nothing wrong with the FilteredClassifier. The Remove filter uses attribute indexes starting from 1, not 0.

You can also use "first" and "last" (without the surrounding quotes) to refer to the first and last attributes respectively.

Cheers,
Mark.

haiyeong
06-06-2008, 11:01 PM
Now it works. And it works for my case very well. Thanks so much.

May I know the reason why *not* include a ID filed in the class "Instance" rather than introducing this "FilteredClassifier"? what's the design consideration?

Mark
06-07-2008, 12:10 AM
The FilteredClassifier is a general purpose meta classifier and not just for dealing with ID attributes. It allows you to apply any filter for preprocessing data before it gets to the base classifier. It allows for the "proper" application of supervised filtering techniques (such as discretization) with a classifier under cross-validation, so that the learning scheme doesn't have access to information gleaned from test folds (as compared to first applying discretization to a data set, and then performing cross-validation with a learner).

Whilst it is certainly possible to have and ID field in the Instance class, it would involve touching a lot of classes in Weka to achieve the same functionality as we have at the moment with ID attributes. Using a standard Attribute for an ID field means that we can seamlessly use it in existing Weka visualization classes (scatterplot and scatterplot matrix). Furthermore, you can do sophisticated filtering on IDs by using weka.filters.unsupervised.instance.SubsetByExpression.

Cheers,
Mark.