PDA

View Full Version : Selecting Attributes with Weka



PuneetK
03-25-2009, 09:27 AM
Hello, hope you can help me.

I have 2 sets with 250 Attributes -> A separate training set (with 1600 instances) and a test set (with 400 instances). Now I want to choose 10-50 best attributes which achieve best accuracy. Which Attribute Evaluator and search method combination would you recommend to find best attributes and achieve best accuracy in my case.


Can I somehow consider training set and test set separately in the attribute selection mode?
(My first trials with Wrapper and Bestfirst, with 10fold cross-validation on the training-set and also with "use full trainingset" as attribute selection mode achieved no good results)

Thank you.

tdidomenico
03-25-2009, 11:21 AM
You will probably want to use the "Ranker" search method, combined with some some of the available attribute evaluators for that method. "Ranker" will give you a list of the attributes, ordered by their score according to the evaluator.

Cheers!

PuneetK
03-25-2009, 12:02 PM
Thank you for your answer
I want to try to make:

Attribute Evaluator: ClassifierSubsetEval with Jrip as Classifier and Testset as Holdoutfile

and SearchMethod: RankSearch with WrapperSubsetEval as Attribute Evaluator using Jrip as Classifier

Attribute Selection Mode: Use full training set

Is it a correct/acceptable choice according to my problem? Or did I misunderstood some options?

Mark
03-25-2009, 04:50 PM
Hi PuneetK,

These setups sound fine to me. I have just one cautionary message - be careful with the use of your holdout set. Is this data going to be used eventually to test the performance of your final classifier after feature selection has been performed (i.e. is it the test set that you mention in your first message)? If so, then the problem with this approach is that the attribute selection procedure will have seen the test data (i.e. the search for the best set of attributes will have been guided by performance on this data) and your final classifier performance could be overly optimistic. The best approach is to use a third data set (validation set) for guiding the search that is separate from the final test set. You can achieve this by splitting the training set into two separate data sets.

Cheers,
Mark.

PuneetK
04-06-2009, 08:47 AM
Dear Mark and tdidomenico,

I've tried ranker+infogain (and deleting every 10-15 of "bad attr." (kind of backward elimination) and checking results with some classifiers) and I 've tried my settings (in the last post) and had no good results in both cases, maybe you can recommend me sth else? I need sth. that checks for the accuracy of each attribute, and adds it to the attribute list if it improves accuracy. Sounds exactly like a wrapper/wrappersubseteval but with what options? (e.g. Which Search Method should I prefer, which classifier and which Attribute selection Method and additional options? )

Thank you for your help!

Mark
04-06-2009, 05:36 PM
Hi PuneetK,

Did the RankSearch with Wrapper + base classifier not give a reasonable result? Was it worse (significantly) than not doing feature selection? Is the goal to improve accuracy, or to not degrade accuracy by using a smaller subset of the original features?

The main reason that RankSearch is a good choice is that you have a fairly large number of attributes. Other search methods are quadratic in the number of attributes, which, when combined with the Wrapper and a base classifier, leads to at least a cubic runtime. Another search method to try is LinearForwardSelection:

http://wiki.pentaho.com/display/DATAMINING/LinearForwardSelection

This method is an extension of standard forward selection/best first that allows for either a fixed set (i.e. select no more than n features) or a fixed width (consider only adding a feature from the top n ranked features to the current subset at each step) approach to be used. Both these options result in a faster search than standard forward selection (they give similar and sometimes better results due to less overfitting).

Cheers,
Mark.

PuneetK
04-07-2009, 06:32 AM
Thank you Mark, I'll try with Ranksearch and LinearForwardSelection.

My goal is to improve prediction accuracy on a separate (supplied) test dataset. This testset has less and partly different instances than the trainingset.

Maybe my problem is, that I misinterprete the results.

First I made (as I already wrote below):
Attribute Evaluator: ClassifierSubsetEval with Jrip as Classifier and Testset as Holdoutfile

and SearchMethod: RankSearch with WrapperSubsetEval as Attribute Evaluator using Jrip as Classifier

Attribute Selection Mode: Use full training set
--
As result I've got 105 Attributes, what does this mean? Are this the optimal 105 Attributes, testing with Jrip on this dataset? How can I interprete this result in referrence to my separate test and training set?
----
Then I made a Ranker with Infogain, got the results immediately und tried to delete attributes and to measure accuracy (after each 10 deleted) with 5 different classifiers. There was no improving in the accuracy. (I deleted Attributes down to 10, no improvance all results are near 50%)

Ranker I've made with 10Fold Cross-Validation, because there is no possibility to choose a hold out set, I don't know whether it makes great difference (maybe I should choose Full Training Set)


Best regards,
PuneetK

Mark
04-08-2009, 07:12 PM
First of all let me clarify your data set situation :-)

Do you have three sets of data (training, holdout and testing)? If you only have two sets of data, and the testing data is used as the holdout set for your first configuration, then the results will be tuned to your test set.

Your second setup (using RankSearch and the Wrapper) is OK since it only uses the training data to perform feature selection.

Which of the two setups produced the 105 attribute result? If it was the first setup, then this is the best 105 attributes with respect to the performance of JRip on the holdout set. If it was the second setup, then this is the best 105 attributes from the ranked list chosen by cross-validation on the training data with respect to the performance of JRip. Which attribute evaluator was used with RankSearch to produce the ranking in this setup?

The easiest way to evaluate a bunch of different configurations is to make use of Weka's Experimenter. Here you can specify a set of algorithms to apply to a set of data sets. The Experimenter can display the results produced by repeated cross-validation or repeated holdout testing along with t-tests for significant differences in the results. Weka has an AttributeSelectedClassifier (in the meta package) that allows you to specify a base classifier and an attribute selection scheme to use. You can set up multiple copies of this classifier (each with different parameter settings) in order to compare the performance of different attribute selection schemes and parameter settings.

You said that your separate test set is somewhat different than the training set. This is OK as long as the basic assumption that both training and testing data comes from the same underlying distribution holds (otherwise generalization is not possible).

Hope this helps.

Cheers,
Mark.

PuneetK
04-09-2009, 05:29 AM
Thank you Mark, yes this helps.I'll also try the Experimenter

1. I have only two different sets (training and test)
---
2. No this was one Setup, not two, all this I took as suboptions, I assume this was pretty wrong.
And this Setup produced as result 105 Attributes:
As Attribute Evaluator I took: ClassifierSubsetEval with Jrip as Classifier and Testset as Holdoutfile

and as SearchMethod I took: RankSearch with WrapperSubsetEval as Attribute Evaluator using Jrip as Classifier

Attribute Selection Mode: Use full training set
--
I dont know whether I can choose a holdoutset on wrapper somehow, I propose this is only possible with classifiersubset as attr.eval.
In my newer observances I use Infogain as attr. eval. with Ranksearch hope this is a good choose.

And last question (sorry :))
If I choose a holdout set, what should I choose as Attribute Selection Mode: Cross Validation or Full training set in my case? I can't imagine, how a 10fold CV works with a separate trainingsset and a holdout/testset?

PuneetK
04-14-2009, 05:09 AM
Yes it worked now with holdout set. Like Mark said the result is overly optimistic. I've got 7-10 attributes and with them the accuracy was the highest. Of course this is overfitting, so I should think about another solution... Thank you for your help

Serge29
09-08-2009, 02:37 PM
Sorry, quick question to this topic. Where is the difference between Linear Forward Selection and Simple Best First Forward Selection? Thank You!

Mark
09-08-2009, 04:33 PM
Hi Serge,

Best first is essentially a beam search. That is, forward selection with limited backtracking. Linear forward selection is an extension of forward selection/best first that speeds up the process by considering only a subset of the total number of remaining attributes at each iteration of the search. Basically, it ranks the attributes individually before the search begins and then either:

1) considers only k of the top ranked attributes and performs a standard forward selection on this top k (fixed set), or

2) in each iteration of the forward selection considers adding to the best subset only attributes from the top k. In this mode, as an attribute is added to the final subset from the top k ranked attributes, k is increased by 1 by adding the next best attribute from the complete ranked list (fixed width).

Both of these approaches result in a significant speed-up when combined with a Wrapper type evaluation (i.e. using a classifier to evaluate the worth of an attribute subset) compared to standard forward selection/best first.

For more information, see this paper:

http://wiki.pentaho.com/download/attachments/3801465/guetlein_et_al.pdf?version=1

Cheers,
Mark.

anastasovskigoce
04-03-2012, 08:27 PM
Hi all,

I'd like to ask Mark one question.

When I select 10-cross validation and use InfoGainAttribute eval with the Ranker method, how does this all work? Do you hold one fold for testing and use the rest for training?

Thank you for the prompt reply.
Goce

Mark
04-03-2012, 10:22 PM
There is no "testing" component in this case. 1/10th of the data is just held out and the attribute eval (info gain in this case) are computed on the remaining 9/10ths. The purpose is just to get a feeling for how stable the metrics and attribute selection techniques are under slight changes in the data distribution.

Cheers,
Mark.

krishna27
04-22-2014, 05:02 AM
Hi mark,

I am trying to use weka source code for attribute ranking. I want only a list of attributes ordered according to a rank as an output. I understand that there is a ranker module in weka that i can use to get the ranking but i want the exact set of operations such as search , evaluation etc that i need to perform before i could perform the ranking. I have tried coding but i am naive and inexperienced in using weka code for ranking. SO PLEASE PLEASE HELP ME GET THIS.

public void attributeranker(Instances table) throws Exception
{

weka.attributeSelection.Ranker atrank=new weka.attributeSelection.Ranker();

weka.attributeSelection.GainRatioAttributeEval grat=new weka.attributeSelection.GainRatioAttributeEval();



weka.attributeSelection.AttributeSelection atsel= new weka.attributeSelection.AttributeSelection();

System.out.println(atsel.selectedAttributes());
double gainratio= grat.evaluateAttribute(1);
System.out.println("evaluation value 1: " + gainratio);
double[][] rankedattrs;
rankedattrs=atrank.rankedAttributes();
for(int i=0;i<rankedattrs.length;i++)
for(int j=0;j<rankedattrs.length;j++)
System.out.println(rankedattrs[i][j]);



}

the exception error is:

Exception in thread "main" java.lang.Exception: Attribute selection has not been performed yet!
at weka.attributeSelection.AttributeSelection.selectedAttributes(AttributeSelection.java:156)
at krishna.processing.reducts.Myreductalgorithm.attributeranker(Myreductalgorithm.java:132)
at krishna.examples.implementer.main(implementer.java:70)

I dont know the exact implementation but if you could guide me in on this, it would be awesome.


People please help me.

Mark
04-23-2014, 05:22 AM
After constructing your AttributeSelection object you'll need to do:

Instances myTrainingData = // training data obtained from somewhere...

atsel.setEvaluator(grat);
atsel.setSearch(atrank);
atsel.SelectAttributes(myTrainingData);

Cheers,
Mark.