ttoulliu2002

07-24-2007, 12:17 AM

Hi:

I am using weka 3.5.6 for a classification problem.

It is a 38 samples with 7129 attributes. I used 10 fold

cross validation to create model for the classification.

The ROC is about 0.9+ But as I used another independent

test data set 34 samples with 7129 attributes. I got

very low ROC value about 0.5. I have tried to use all

algorithms for the classification. However, all of them

showed the same issue. Low ROC for test data set.

It seems overfitting but failed at test data set. I am

pasting one of my results below. How to resolve this

issue

Scheme: weka.classifiers.functions.SMO -C 1.0 -L 0.0010 -P 1.0E-12 -N 0 -V -1 -W 1 -K " weka.classifiers.functions.supportVector.PolyKernel -C 250007 -E 1.0"

Relation: cancer

Instances: 38

Attributes: 7130

[list of attributes omitted]

Test mode: 10-fold cross-validation

=== Classifier model (full training set) ===

SMO

Kernel used:

Linear Kernel: K(x,y) = <x,y>

Classifier for classes: A, B

BinarySMO

Machine linear: showing attribute weights, not support vectors.

0.0054 * (normalized) X1

+ 0.0017 * (normalized) X2

- 0.4598

Number of kernel evaluations: 647 (94.086% cached)

=== Stratified cross-validation ===

=== Summary ===

Correctly Classified Instances 36 94.7368 %

Incorrectly Classified Instances 2 5.2632 %

Kappa statistic 0.8648

K&B Relative Info Score 3271.0797 %

K&B Information Score 28.7188 bits 0.7558 bits/instance

Class complexity | order 0 33.192 bits 0.8735 bits/instance

Class complexity | scheme 2148 bits 56.5263 bits/instance

Complexity improvement (Sf) -2114.808 bits -55.6528 bits/instance

Mean absolute error 0.0526

Root mean squared error 0.2294

Relative absolute error 12.6005 %

Root relative squared error 50.395 %

Total Number of Instances 38

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class

1 0.182 0.931 1 0.964 0.909 A

0.818 0 1 0.818 0.9 0.909 B

=== Confusion Matrix ===

a b <-- classified as

27 0 | a = A

2 9 | b = B

=== Re-evaluation on test set ===

User supplied test set

Relation: cancer

Instances: unknown (yet). Reading incrementally

Attributes: 7130

=== Summary ===

Correctly Classified Instances 19 55.8824 %

Incorrectly Classified Instances 15 44.1176 %

Kappa statistic 0

Mean absolute error 0.4412

Root mean squared error 0.6642

Total Number of Instances 34

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class

1 1 0.559 1 0.717 0.5 A

0 0 0 0 0 0.5 B

=== Confusion Matrix ===

a b <-- classified as

19 0 | a = A

15 0 | b = B

Thanks

I am using weka 3.5.6 for a classification problem.

It is a 38 samples with 7129 attributes. I used 10 fold

cross validation to create model for the classification.

The ROC is about 0.9+ But as I used another independent

test data set 34 samples with 7129 attributes. I got

very low ROC value about 0.5. I have tried to use all

algorithms for the classification. However, all of them

showed the same issue. Low ROC for test data set.

It seems overfitting but failed at test data set. I am

pasting one of my results below. How to resolve this

issue

Scheme: weka.classifiers.functions.SMO -C 1.0 -L 0.0010 -P 1.0E-12 -N 0 -V -1 -W 1 -K " weka.classifiers.functions.supportVector.PolyKernel -C 250007 -E 1.0"

Relation: cancer

Instances: 38

Attributes: 7130

[list of attributes omitted]

Test mode: 10-fold cross-validation

=== Classifier model (full training set) ===

SMO

Kernel used:

Linear Kernel: K(x,y) = <x,y>

Classifier for classes: A, B

BinarySMO

Machine linear: showing attribute weights, not support vectors.

0.0054 * (normalized) X1

+ 0.0017 * (normalized) X2

- 0.4598

Number of kernel evaluations: 647 (94.086% cached)

=== Stratified cross-validation ===

=== Summary ===

Correctly Classified Instances 36 94.7368 %

Incorrectly Classified Instances 2 5.2632 %

Kappa statistic 0.8648

K&B Relative Info Score 3271.0797 %

K&B Information Score 28.7188 bits 0.7558 bits/instance

Class complexity | order 0 33.192 bits 0.8735 bits/instance

Class complexity | scheme 2148 bits 56.5263 bits/instance

Complexity improvement (Sf) -2114.808 bits -55.6528 bits/instance

Mean absolute error 0.0526

Root mean squared error 0.2294

Relative absolute error 12.6005 %

Root relative squared error 50.395 %

Total Number of Instances 38

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class

1 0.182 0.931 1 0.964 0.909 A

0.818 0 1 0.818 0.9 0.909 B

=== Confusion Matrix ===

a b <-- classified as

27 0 | a = A

2 9 | b = B

=== Re-evaluation on test set ===

User supplied test set

Relation: cancer

Instances: unknown (yet). Reading incrementally

Attributes: 7130

=== Summary ===

Correctly Classified Instances 19 55.8824 %

Incorrectly Classified Instances 15 44.1176 %

Kappa statistic 0

Mean absolute error 0.4412

Root mean squared error 0.6642

Total Number of Instances 34

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class

1 1 0.559 1 0.717 0.5 A

0 0 0 0 0 0.5 B

=== Confusion Matrix ===

a b <-- classified as

19 0 | a = A

15 0 | b = B

Thanks