Hitachi Vantara Pentaho Community Forums
Results 1 to 2 of 2

Thread: k-means|| clustering implementation with distributedWekaHadoop??

  1. #1
    Join Date
    Nov 2015

    Question k-means|| clustering implementation with distributedWekaHadoop??

    Dear all...

    Could anyone please help me give an example of k-means|| clustering implementation with distributedWekaHadoop, I hv already read mark hall blog[1], it seems like common k-means algorithm but not clear enough how to configure, run and evaluate cluster with hadoop. compared to traditional k-means algorithm [2] and another k-means enhancement [3], does k-means|| pretty accurate in clustering with large datasets? does k-means|| guarantee a better or optimum result, or just better performance, by mean faster computing than another k-means algorithm?. I'm sorry for my newb question, I'm really appreciate for any help can provide. Thank You.


  2. #2
    Join Date
    Aug 2006


    The Weka implementation is essentially the k-means|| algorithm described in your second reference. The only difference is that, in the k-means++ initialisation phase, it collapses the two passes required for each ++ initialisation iteration into one pass. The paper describes a process where, in each ++ iteration, the distances of all training points to the current "sketch" (i.e. candidate start points) is computed and then new candidates are selected to add to the sketch with probability proportional to their distance from the sketch. Using weighted reservoir sampling, these two phases can be done simultaneously in one pass over the data. I believe that the k-means|| algorithm gives results as good as sequential k-means++ given enough iterations (actually a logarithmic in the initial cost number of iterations). How this compares to the mini-batch stochastic variant in your third reference I'm don't know. The goal of k-means++ is to give a better (closer to the optimum) solution; and similarly for k-means|| (with some tradeoffs for speed). Comparisons between Weka's hadoop/spark implementation and the desktop sequential k-means++ implementation are difficult due to how data is split in the distributed environments combined with differences in the random aspects of both. I think the best you can do (given a dataset that is computationally feasible in the sequential case) is to run both approaches multiple times and compare the best/average within cluster sum of squared error.


Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.