PDA

View Full Version : K means algorithm on mixed data



eliza.mmatei
03-26-2007, 05:13 AM
Hello everyone!

I'm a student from Romania and I'm trying to make an application in C# for my diploma, which includes the k means algorithm .
I'm stuck on something and I would very much appreciate if you could help me.
I want to apply clustering on surveys to classify people or products. But the variables for which I want to apply the K means algorithm are nominal (like education or favourite color) ,ordinal and binary(sex: M/F).
In this case which formula for distance I can use, or I transform the data into numeric data, but again: how do I do this? All the examples and comments I found about K means algorithm ware about numeric data and I understand how they work , but what about mixed data?

Thank you very much,
Eliza :)

Christo
03-26-2007, 06:34 PM
Hi Eliza,

Transform your nominal data into many binary variables, and then k-means

But the best way would be: Transform your nominal data into many binary variables, then use PCA (principal component analysis) or maybe correspondence analysis, and then use k-means on the new data (on a subset of the principal components that give 80% of the total inertia for example)

Christo

eliza.mmatei
03-27-2007, 02:27 AM
Hi!

Thank you for answering so quickly.
So, you say I should transform all the data, nominal, ordinal into binary.
I have two questions:
1. I will use the Jaccard distance now for the binary data?
2. I understand you can use a wighted formula for the mixed data, but where I read about it there was just the idea, not the example. Do you know anything about it?

Eliza