Hitachi Vantara Pentaho Community Forums
Results 1 to 7 of 7

Thread: Clustering probabilities for test instances

  1. #1
    Join Date
    Jul 2016
    Posts
    4

    Default Clustering probabilities for test instances

    Hey guys,

    I am exploring some fuzzy clustering approach and to get it done properly I am looking to way to get the probabilities of each cluster for the test instances.

    I know that I can save the arff file from Classification and get the prediction probabilities, but I do want the clustering output.

    Saving arff from the visualization GUI I am getting the following:

    0,ID12101,48,FEMALE,INNER_CITY,17546,NO,1,NO,NO,NO,NO,YES,cluster2

    which is probably the cluster prediction but there is no the probability.

    How can I get the probabilities for each test item to each clusters?

  2. #2
    Join Date
    Aug 2006
    Posts
    1,741

    Default

    Which clusterer are you using? To get cluster membership probabilities you need to use a DensityBasedClusterer such as EM. Any clusterer can be made into a DensityBasedClusterer by wrapping it in the MakeDensityBasedClusterer meta clusterer. This method fits normal distributions and discrete distributions to the attribute values within each cluster. These estimators are independent of one another given the cluster, so it is very much like the naive Bayes classifier.

    Given a DensityBasedClusterer you can append predicted probabilities to your data by using either the ClusterMembership filter or the Knowledge Flow (via a PredictionAppender step).

    Cheers,
    Mark.

  3. #3
    Join Date
    Jul 2016
    Posts
    4

    Default

    Thanks for your response, Mark!
    In my dataset I have two fields - id and string. And I would like to cluster entities using EM by strings (of, course, using StringToWordVector filter).
    I have tried both - WEKA GUI and java code and had issues with both.
    In WEKA GUI I have defined ClusterMembership on "MakeDensityBasedCluster", then "FilteredCluster", where cluster is "EM" and Filter is StringToWordVector (with idf, lowercase, normalizing all data, SnowballStemmer and Rainbow stopwords). When I run it, I get "Can't normalize array. Sum is NaN".
    Then I wrote the code that defines filter with all required properties (which are the same that in the WEKA GUI), then cluster using EM and then printout distributions using clusterer.distributionForInstance(filteredData...). It runs and even prints out output, the problem is that for each entry it assigns 1 to one of the clusters and for other clusters it assigns 0.
    Would you be so kind as to assist and tell me what could be the problem? I was trying to provide as an input some other dataset, that doesn't require tf/idf transformation and got pretty good distributions there, not 0 and 1.

    Thanks!
    Slava

    Update: I just checked and the same code works perfectly fine if one replaces clustering with the classification.
    Last edited by slavenya; 07-10-2016 at 06:29 PM.

  4. #4
    Join Date
    Aug 2006
    Posts
    1,741

    Default

    I'm not too sure what is going on in your case, but I'd try using the MultiFilter. All you'd need is the StringToWordVector first in the list, followed by the ClusterMembership filter using EM. There is no need to wrap EM in the MakeDensityBasedFilter as EM is already a density estimator. This works fine for me (at least with default settings in the StringToWordVector filter) on a Reuters news dataset.

    Cheers,
    Mark.

  5. #5
    Join Date
    Jul 2016
    Posts
    4

    Default

    When I'm using Multifilter as you have described on ReutersCorn-train.arff, after it finished to run, I click on Edit and see that for 6 clusters almost always the specific entry is assigned to only one cluster (with probability 1 (or 0.999999) and ~0 to the others), that doesn't seem like EM to me. Should it work this way? I changed number of clusters to 3 and 10 and got the same results.

    (Configuration is weka.filters.MultiFilter -F "weka.filters.unsupervised.attribute.StringToWordVector -R first-last -W 2500 -prune-rate -1.0 -I -N 1 -L -stemmer \"weka.core.stemmers.SnowballStemmer -S porter\" -stopwords-handler weka.core.stopwords.Rainbow -M 1 -O -tokenizer \"weka.core.tokenizers.WordTokenizer -delimiters \\\" \\\\r\\\\n\\\\t.,;:\\\\\\\'\\\\\\\"()?!\\\"\"" -F "weka.filters.unsupervised.attribute.ClusterMembership -W weka.clusterers.EM -- -I 100 -N 6 -X 10 -max -1 -ll-cv 1.0E-6 -ll-iter 1.0E-6 -M 1.0E-6 -K 10 -num-slots 1 -S 50").

  6. #6
    Join Date
    Aug 2006
    Posts
    1,741

    Default

    Try removing the class attribute from ReutersCorn and see what the clusters look like.

    Cheers,
    Mark.

  7. #7
    Join Date
    Jul 2016
    Posts
    4

    Default

    Hi Mark,

    This is what I have done and probabilities still seem 1 or approximately 1 for one cluster and ~0 to other.

    Thanks,
    Slava

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.