PDA

View Full Version : Document clustering using weka k-means



estella
06-06-2008, 12:26 PM
Hello!

I've been using weka in my diploma thesis for quite a while, in order to achieve documents clustering according to synonyms etc.. I've worked quite a lot on StringToWordVector for the preprocessing, and then I want to use k-means algorithm for the clustering. However, results are not very satisfactory and I was thinking that it would be important to take into account weights, such as tfidf. do you have any idea how I could implement this in weka k-means? And as for the distance, should i keep on using the euklidian distance or I could change it to something else? Any suggestions would be more than vlauable!!!!! :)

Mark
06-06-2008, 10:08 PM
Hi,

Are you talking about attribute or instance weights? For nearest neighbor-type distance-based weighting you could take a look at the code in IBk and the classes it uses for computing distances (in weka.core). SimpleKMeans really needs to be updated to use pluggable distance measures like IBk does.

As for distance measures for text problems, have you looked into the cosine distance/similarity for information retrieval? Here is a pretty accessible tutorial:

http://www.miislita.com/information-retrieval-tutorial/cosine-similarity-tutorial.html

Cheers,
Mark.

estella
06-07-2008, 05:29 AM
Thanks a lot for replying so fast! Well, basically, I'm talking about weighted attributes, meaning that for example words that appear more/less frequently in documents are more/less important for the clustering, something like that...

Hmm, as I'm not very familiar with whole weka yet, could you please let me know what is IBk? As for cosine similarity, I've already found it in papers and I think it might help, I just started reading the tutorial you told me, thank you very much!

Mark
06-07-2008, 08:53 PM
Hi,

IBk is Weka's implementation of the k nearest neighbors algorithm for classification/regression. You can find it in the weka.classifiers.lazy package.

Cheers,
Mark.

ebiosca
06-09-2008, 04:57 AM
Hi estella,
I use weka for classify and preprocess data and sets.
Weka have Xmeans algorithm for generate clusters. You can use by default with 3 distancies: Euclidean,Manhattan and Chebyshev.( select the algorithm an change the configuration parameters)
It's possible to create clusters with Xmeans and apply the Nearest Neighbors.

I hope this tip are correct and usefull. With Mark's permission :)