PDA

View Full Version : add cluster with weka



francy_faraone
10-16-2007, 07:02 AM
Hallo.
I have a problem: after running the clustering algorithm SimpleKmean, I want to add the cluster attribute to the initial data set. The problem is that there isn't correspondence between the results with SimpleKMean and the results after applying the unsupervised attribute filter "addcluster". I show the results I refer to, below:


Clustered Instances from SimpleKMean:
0 108
1 79
2 71
3 76
4 149
(total 483)


Clustered instances after applying the filter:
cluster1 86
cluster2 79
cluster3 93
cluster4 83
cluster5 142
(total 483)

I have used the same settings both for the clustering algorithm and for the filter, but it seems that the distribution of the istances is different. Why? Which result do I have to consider?

Kind regards.

Maria Francesca.

Mark
10-16-2007, 06:32 PM
Hi there,

I can't seem to reproduce the problem using the current development version of Weka (3.5.6). I've run SimpleKMeans on the iris data (with class removed) generating 5 clusters from the Explorer, the command line, and the AddCluster filter. All three give me the same distribution of instances into clusters. What version of Weka are you using?

Cheers,
Mark.

francy_faraone
10-17-2007, 09:04 AM
My version of Weka is 3.5.2. Do you think is better for me to download your version?

Thank you very much.

Mark
10-17-2007, 04:34 PM
I've re-done my experiment using 3.5.2 and still don't have a problem (I also tried 5 clusters on the glass data set as well). Is there any chance you can send me your data set?

Cheers,
Mark.

francy_faraone
10-18-2007, 03:33 AM
Dear, I tried again and now it works. Thank you very much!! Anyway I'll download the new version of Weka.

donandre
06-28-2012, 06:00 AM
I've re-done my experiment using 3.5.2 and still don't have a problem (I also tried 5 clusters on the glass data set as well). Is there any chance you can send me your data set?

Cheers,
Mark.

Sorry to revive this old thread, but I experience the same problem and I wanted to give a bit more background information on WHY it sometimes doesn't seem to work and sometimes does. Maybe someone else that found this thread (like me) through Google will find this information useful. I'm using Weka 3.6.7 which at the time of this post is the latest stable version so this "bug" is still present. It's actually not a bug, but not using Weka correctly. I have to add however that Weka makes it difficult to be correctly used, so I think something can be done to avoid this in the future.

The problem basically circles around the attribute that you have chosen to be your "class" in the "Preprocess" tab. Weka will automatically pick the last feature as the class after importing (e.g. a CSV). Typically in a clustering data there is no class attribute however. So you want to set this to "No class" always. The difficult thing here however is that whenever you make an action, Weka will reset the class to the last feature, such as when you apply a "Randomize" filter or you use "Undo". It never seems to stick with "No class".
Now why's there a confusion. Let's say you forget that you set "No class" or you thought you had set it, but were not aware that Weka reset it in the meantime. Then you go to the "Cluster" tab and find that there's a separate list of ignored attributes. The cluster algorithms here absolutely don't care about which class attribute is set in "Preprocess". You make some clusterings and then return to "Preprocess" to add the cluster (actually that's post-processing, but okay). Now the "AddCluster" filter again does care about the class attribute! And it will give you different results when the "ignored attributes" and the "ignoredAttributeIndices" PLUS the class attribute are not the same set of excluded input attributes.

So some suggestions for WEKA:

The class combobox should stick to the selection made and not reset itself to the default last feature. This is really annoying!



I also think WEKA should place the class combobox a bit more prominently, maybe in the Attributes group box where the "All/None/Invert/Pattern" buttons are. It's easily overlooked, because it's squeezed between two graphs.
I further think that in the Cluster tab this attribute should be automatically added to the ignored attributes list. I sometimes forget about this.

Mark
06-29-2012, 04:26 AM
Probably we should change the behavior of the AddCluster filter - it should behave the same way regardless of whether a class attribute is set in the data or not. Since it has an option to specify the attribute to ignore, this would then make it the same as how the Cluster tab operates.

Many unsupervised filters do skip over the class attribute it it is set in the data. This is particularly useful - for example, you wouldn't usually want the Normalize filter (which is unsupervised) to normalize the class attribute in your data. For this reason (and for all supervised filters too) we have a class combo box on the preprocess panel. Note that the class set in the Preprocess panel is applicable only to the Preprocess panel (the Classify panel and Select Attributes panel have their own class combo boxes).

Whenever a new set of instances is set on the Preprocess panel (via loading a data set or applying a filter) the default is to set the class to the last attribute in the data. In many cases it would not be possible to maintain the last setting - e.g. a new dataset entirely is loaded; the filter applied removes various attributes (including the previously set class); the filter applied completely transforms the attribute space, etc.

Cheers,
Mark.

donandre
07-02-2012, 02:14 AM
Probably we should change the behavior of the AddCluster filter - it should behave the same way regardless of whether a class attribute is set in the data or not. Since it has an option to specify the attribute to ignore, this would then make it the same as how the Cluster tab operates.

Yes, it doesn't seem logical that one ignores the class and the other does not.


Whenever a new set of instances is set on the Preprocess panel (via loading a data set or applying a filter) the default is to set the class to the last attribute in the data. In many cases it would not be possible to maintain the last setting - e.g. a new dataset entirely is loaded; the filter applied removes various attributes (including the previously set class); the filter applied completely transforms the attribute space, etc.

Okay, I understand that loading a new dataset invalidates the class attribute, but I don't understand how shuffling the instances affects which attribute is set to be the class attribute.