suggestion: how about adding a ID field in WEKA's Instance class?
In WEKA experiment, it can be configured to output the Prediction/Target together with IDs. However, the IDs field can only be set to one of Attributes in the dataset. This is not useful because the ID field in the dataset is usually removed before using it to train a classifier. Otherwise, including a ID field in the training data would most likely affect the classifier's performance. I think it's better off to have a ID field in Instance class, like:
protected String m_ID;
In this way, we separate the ID from the rest of fields which are used in training / testing process. And the output from WEKA experiment (Prediction/Target/ID) make more sense. What's your opinion?
Thanks for your effort to maintain this very nice forum.
Have you taken a look at the section in the Weka wiki on how to use ID attributes?
Essentially, you can keep them in your data and use a FilteredClassifier to remove them only for the purposes of learning the model (they remain in the data for outputting along with predictions or for visualization).
the problem is in "ClassifierSplitEvaluator"
The "FilteredClassifier" can solve the ID problem, but I don't think it work for my question.
In "weka.experiment.ClassifierSplitEvaluator", the private member "m_attID" is used to indicate the output ID fields in the cross-validation experiment. In this case, I didn't figure out how to bypass its internal mechanism to generate result of Prediction/Target together with ID field. Would you like to give me a hint? Thanks.
I'm not sure I understand what you are trying to do. The Experimenter always produces summary results (not predictions for individual test instances). The instances that are created by the Experimenter are the summary results computed for test folds, hold-out sets etc. When you select the options for outputting targets, predictions and an ID field, it creates String attributes in the resulting instances that contain a list of IDs or predictions/targets of each instance in the test fold/hold-out set, with each element separated by a "|" character.
Why do you need to "bypass" this mechanism of generating IDs, targets and predictions etc?
If you look at the source code of "ClassifierSplitEvaluator", you would find that the ID field has to be one of attributes in the dataset (an Instances object). If we remove the ID field before feeding the dataset to a classifier, then "ClassifierSplitEvaluator" cannot get the ID information for each instance in test-fold, which makes the Prediction meaningless.
Arghh! I feel like I'm starting to go mad :-) One last time - use the FilteredClassifier in your experiments. The FilteredClassifier was created so that a COPY of the original training data can be modified before being passed to the base classifier. If you use a FilteredClassifier with the filter set to be weka.filters.unsupervise.Remove (to remove the ID attribute), then the copy of the data passed to the classifier will have the ID removed. HOWEVER, the original data will still have the ID in it - this is the data from which the ClassifierSplitEvaluator will create the String attribute that contains the list of IDs for the current test fold.
Originally Posted by haiyeong
Does this handle your situtation, or am I still missing something?