View Full Version : K-means in weka: which distance measure?

10-20-2006, 08:06 AM
Hi all
I’m a student of Economy at Macerata University in Italy. I’m just preparing my final thesis on market basket analysis and I need to apply clustering with weka at transactional datasets composed by categorical binary variables. My intent is to group transaction in omogeneous cluster and then apply association rules on each of these obtained clusters. My questions are very simple but I’m not an expert of using such kind of data mining software tools:
1)first i need to know which kind of default distance measure is used in the k-means algorithm because there are no option to choose it and I can’t find anywhere. Is that an Euclidean one perhaps?
2) Second, i would like also to know if that measure fit well with binary data (0,1) or not!

Every suggestions and proposals are well accepted!!!!
Please help me soon!!!


10-20-2006, 09:08 AM

Weka uses the euklidian distance for the K-Means algorithm. If you need a different one, you would have to touch the code and override the 'distance()' method of class 'weka.clusterer.SimpleKMeans'.

Boolean data is mapped into 0 for false and 1 for true. Any numeric data is normalized to fit into the range of [0 to 1] too.

Some attributes might be more important than others - so after they have been normalized, it is common practice to weight them (or more primitive: to multiply that attribute by a factor).

dist = sqrt (weight1 * (instance1_att1 - instance2_att2)^2 + .. )

But Weka does not offer that option in the K-Means implementation.

So if all your attributes are equally important, then Weka will return the correct cluster configuration; if you need to weight them to make your model better fit the reality, then you'll have to touch the sources.

Whether the euclidian distance is the correct distance metrics for you, depends on what you want to do. As the computed distance is quadratic, it adds increased penalties the more 'different' your instances are.

For binary data, you dont have to worry at all. For zero and one, it does not matter whether you use a square-root, cubic root or none at all - the is still the same.


10-21-2006, 01:58 AM
Thank you Thomas for your fast answer,
my transactional dataset is composed by attributes on the columns that represents different (160) products of an e-commerce website and the rows (3290) are transactions. So each transaction is composed by a series of 1 and 0 representing the presence or absence of the different variables product. And i have no informations about the products (attributes), so for me they are equally important. I have put the question about the euclidean distance because some persons say to me that it was better, for the calculation of clusters with binary data in large dataset , to use a similarity measure like Jaccard coefficient. Is that true? Or how can i explain that also euclidean one is right and doesn't make me create an uncorrect cluster configuration?

Thank you so much!

10-22-2006, 05:08 AM
And again, there is no clear 'better' here, as it depends on the situation. The Jaccard coefficient is the right one, if your flags are not symetric. That is the case, if, for instance, the 'true' or 'false' state have no meaning in the context of your question, but the other one has.

(Symetric is defined as: The information contained in 'True' is the same as the Information contained in 'False'. Information here referrs to Shannon's information entropy.)

Lets look at some real world example for that: Assume we want to look at feedbacks we received from our customers. (Customers can give positive and negative feedback at the same time, or not feedback at all.)
Now we have the following dataset:

int Customer,
int Product,
boolean Positive_Feedback,
boolean Negative_Feedback

Now, if our booleans were symetric, the absense of positive or negative feedback would mean, that everything is fine. But in the real world, it is more likely, that our customers simply gave no feedback at all. So if 'Positive_Feedback' is true, we can tell, the customer gave us a thumbs-up signal. But if 'Positive_Feedback' is false, we cant tell anything about the customers opinion, as in that case, the flag simply tells us: There is no data available.

So in your case: If it is equally important whether the customer bought the product, then use the Euclidian distance. Otherwise, use the Jaccuard coefficuent instead.

The only way to 'know' which measure is good, is to look at the results. In datamining, it is common to have some preclassified datasets so that you can train your algorithms and to have some way to validate the quality. (Or in common words: If a modell is likely to be valid, if it is (at least) able to represent the current reality.)


11-15-2006, 02:40 PM

The latest version of Weka (available from CVS) includes a new XML-based data representation. This representation is richer than the standard arff one and includes the ability to specify attribute weights (previously this was only possible programatically in the Attribute class). As Thomas has pointed out, only trivial changes are needed to SimpleKMeans to take advantage of attribute weights.

Unfortunately, automatic attribute selection/weighting for k means (and EM based clustering) is somwhat more problematic. Both the squared error (euclidean distance) and the log likelihood are biased towards lower dimensions.