PDA

View Full Version : Unproportionned classes.



neub
02-01-2008, 12:16 PM
Hello,

I've two classes (c0 & c1) to identify but they are not in the same ratio (about 10:1).

If I use a SVM (or other algorithm) as classifier and I've on my testing set a low error rate about 90%, but with not a desired classification:

Result on testing set:
----------------------------------
TP: 0.99 FP: 0.98 (c0 -> 1000 instances)
TP: 0.02 TP 0.02 (c1 -> 100 instances)

I've try the filter.supervised.Resample with a bias to uniform classes = 1.0. The training set is composed of 1000 instances for both c0 and c1.

This give me much desired score:

Result on testing set:
----------------------------------
TP: 0.80 FP:0.30 (c0 -> 1000 instances)
TP: 0.70 TP:0.20 (c1 -> 100 instances)

However 20% of FP is still to much (Maybe it's the limit of my data).

I was wondering if it is not strange to resample the value in this way.

Moreover I'v try weka.filters.supervised.instance.SpreadSubsample -M 1.0 -X 0.0 -S 5. This subsample the training set to obtain 100 instances for both c0 and c1.

The result is quite similar as the first one but the computation is much faster.

My question is why I need to do this "resampling/subsampling", there is not any weight assignation to do this task. If there is nothing which method is better dur to the fact that give more a less the same result.

It may be important to know that the number of support vectors is much higher when the set is Resample (2000 instances) than when the set is subsampled (200 instances).

neub
02-06-2008, 07:07 AM
I did not think of looking in wekalist archives: moreover unproportioned is called unbalanced...

here is the advice that I've follow...

https://list.scms.waikato.ac.nz/pipermail/wekalist/2005-June/004341.html

and the list of other advices (http://www.google.com/custom?q=wekalist+unbalanced+data&domains=list.scms.waikato.ac.nz&sa=Search&ie=ISO-8859-1&oe=ISO-8859-1&safe=active&hl=en&sitesearch=list.scms.waikato.ac.nz)

I'm still confuse for SVM classifier that should adjust weigth before classification...!!!

Harri Saarikoski
02-06-2008, 07:31 AM
Hi.



I've two classes (c0 & c1) to identify but they are not in the same ratio (about 10:1). If I use a SVM (or other algorithm) as classifier and I've on my testing set a low error rate about 90%, but with not a desired classification:

Result on testing set:
----------------------------------
TP: 0.99 FP: 0.98 (c0 -> 1000 instances)
TP: 0.02 TP 0.02 (c1 -> 100 instances)

I've try the filter.supervised.Resample with a bias to uniform classes = 1.0. The training set is composed of 1000 instances for both c0 and c1.

This give me much desired score:

Result on testing set:
----------------------------------
TP: 0.80 FP:0.30 (c0 -> 1000 instances)
TP: 0.70 TP:0.20 (c1 -> 100 instances)


-- A question: why would you want TP for lesser trained class to be any higher since calculating accuracy for both classes gives

- for the original data: 0.99*1000 + 0.02*100 = 992 (99.2%)
- for the undersampled data: 0.80*100 + 0.70*100 = 150 (75%)

Former is a better result in terms of overall accuracy. It is hard to beat 99%. Why would you want to undersample if the original proportions are the likely proportions in any new test data as well?



However 20% of FP is still to much (Maybe it's the limit of my data).

I was wondering if it is not strange to resample the value in this way.

Moreover I'v try weka.filters.supervised.instance.SpreadSubsample -M 1.0 -X 0.0 -S 5. This subsample the training set to obtain 100 instances for both c0 and c1. The result is quite similar as the first one but the computation is much faster.

My question is why I need to do this "resampling/subsampling", there is not any weight assignation to do this task. If there is nothing which method is better dur to the fact that give more a less the same result.

It may be important to know that the number of support vectors is much higher when the set is Resample (2000 instances) than when the set is subsampled (200 instances).

Number of support vectors increases with the size of data.


"In my experience, and many discussions with other in several workshops, undersampling is a bad idea, we should provide the learner as much instances as we can. Also, learning in unbalanced data sets keeping the skewed distribution is also a bad idea, from an engineering point of view: if the learner works optimally under 50/50 distribution, just provide it to the learner."

I agree with this post: there is an optimal breaking point in undersampling (i.e. only do it if you have plenty of instances per class).

Did I read your results correctly?
Harri