US and Worldwide: +1 (866) 660-7555
Results 1 to 5 of 5

Thread: Attribute selection in clustering?

  1. #1
    Join Date
    Apr 2012
    Posts
    20

    Default Attribute selection in clustering?

    Hi,

    I want to use an attribute selection technique like IG or CHI square for document clustering. How do I apply feature selection in clustering? like you do with AttributeSelectedClassifier in classification.

  2. #2
    Join Date
    Aug 2006
    Posts
    1,070

    Default

    Attribute selection for clustering is not as straightforward as attribute selection for classification because the clusterer "generates" the label for each instance, as opposed to classification where the label is "ground truth" and fixed. There aren't any good general purpose methods that I'm aware of. A number of density-based methods have been proposed, but this sort of approach is problematic. One major issue is that the likelihood of the fit of clusterer is not directly comparable between sets of attributes of different sizes. I.e., you find that the likelihood improves monotonically as more attributes are added (even under a cross-validation evaluation). So various corrections have to be applied for this, none of which are particularly satisfactory. However, I haven't looked into this area for quite a while now, so there could very well be some new approaches that I'm unaware of.

    Cheers,
    Mark.

  3. #3
    Join Date
    Apr 2012
    Posts
    20

    Default

    Hi,

    why can't we apply the same procedure that we apply for the classification like ranking the attributes first based on the dataset that you want to cluster. Assuming you have selected top 1000 features using IG or CHI (for example) then your clustering instances will be represented corresponding to the selected 1000 features (same thing as how you use feature selection for classification)

    I'm just wondering why you can't apply the same procedure may be I'm wrong.

    Thanks

  4. #4
    Join Date
    Aug 2006
    Posts
    1,070

    Default

    This would work if you already had cluster labels assigned to the instances, in which case it would basically be the same as attribute selection for classification. The problem is is that clustering is unsupervised; there are no labels - this is what you want to "learn" with the clusterer. What metric do you propose to use to evaluate an attribute so that a ranking can be produced? The goodness of the cluster assignments or likelihood of the data given the model has to be used somehow. As you change the number of attributes available to the clusterer the clustering and goodness will alter. In classification, the class labels don't change - just the predictions assigned by the model do. So we can always evaluate the "goodness" of the classifier with respect to ground truth.

    Cheers,
    Mark.

  5. #5
    Join Date
    Apr 2012
    Posts
    20

    Default

    yes I got your point.
    Thank you

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •