Hitachi Vantara Pentaho Community Forums
Results 1 to 5 of 5

Thread: Attribute Selection

  1. #1
    Join Date
    Jun 2016
    Posts
    2

    Default Attribute Selection

    Hello,

    I have some different datasets, mainly composed each of around 1Million numerical attributes, a numerical class and 200 instances.
    My objective is to filter irrelevant or redundant instances for each different set so as to obtain a smaller number of instances correlated to the class to build afterwards causal graphs/trees.
    For the first step (the filtering out) i wanted to use the correlation based feature selection algorithm. And as i guess the memory will not be enough, i'm using the linux command line.

    I'm running the following command at the moment:
    java -Xmx1024M -cp /software/weka-3.6.12/weka.jar weka.attributeSelection.CfsSubsetEval -s "weka.attributeSelection.BestFirst -D 1 -N 5 -Z " -i <the arff input file>

    However i have three questions:

    1. I would like to obtain some correlation coefficient to the class from the remaining attributes after the filtering out. That's why in the command line i included the -z option. However i'm not sure if the output that i obtain is the one i want. I get some Group:... and Merit:... lines, that i don't really understand what either of them means. What does it means? and How i could calculate the correlation coefficient i need from that output?

    2. I'm starting to be a bit confused...because i just read that there is also an attribute selection filter. So i tried both commands but i'm not sure what is the difference between both...and which one is the one i should use for my problem:

    the same as in the beginning of this message:
    java -Xmx1024M -cp /software/weka-3.6.12/weka.jar weka.attributeSelection.CfsSubsetEval -s "weka.attributeSelection.BestFirst -D 1 -N 5 -Z " -i <the arff input file>

    or?
    java -Xmx1024M -cp /software/weka-3.6.12/weka.jar weka.filters.supervised.attribute.AttributeSelection -S "weka.attributeSelection.BestFirst -D 1 -N 5 -Z " -E "weka.attributeSelection.CfsSubsetEval" -i <the arff input file> -o <the arff output file>

    3. With small arff input files, like the ones given as examples with weka, i manage to obtain some result. However, my input files have sizes of around 2GB so i don't manage to avoid the job to crash, getting: "Exception in thread "main" java.lang.OutOfMemoryError: Java heap space". I already tried to increase the memory size with -Xms1024m and -Xmx1024m... but still crashing... even with a smaller input file, that has a size of 700MB, it still crashes. I looked on my machine, and the Initial Heap Size=756MB and the Maximum Heap Size= 12100MB. So even adapting to this values, the job crashes...
    How could I run the attribute selection with such big files?

    Thank you so much in advance for your answer,
    Ana Castro

  2. #2
    Join Date
    Aug 2006
    Posts
    1,741

    Default

    First of all, you need to get the data loaded. Is it sparse - i.e. are there lots of zeros in the data? If so, you can save memory by using Weka's sparse ARFF format for your files:

    http://weka.wikispaces.com/ARFF+%28d...sion%29#Sparse ARFF files

    1Gb of heap space is probably not enough. Are you using a 64 bit Java VM? If so, try setting the heap size to something like 80% of your available RAM.

    Next, and probably more importantly, even if you manage to load your file successfully it is unlikely that you will get results from the CFS method. Computing the correlation matrix (even lazily) that is required by CFS has runtime that is quadratic in the number of input attributes; then searches such as forward selection and best first are also quadratic. Your only option with that many attributes is something with linear runtime - the attribute evaluators (that measure the goodness of each attribute individually) such as CorrelationAttributeEval, InfoGainAttributeEval etc. when combined with the Ranker search method have runtime that is linear in the number of attributes.

    Cheers,
    Mark.

  3. #3
    Join Date
    Jun 2016
    Posts
    2

    Default

    Thank you so much for the answer.

    I managed to open the file by simply removing attributes that i thought were not so important.
    And then I run the ReliefAttributeEval with the ranker in order to reduce the number of attributes to 100'000.
    Now i think it is feasible to do the CFS method.

    But how can i get the correlation coefficients for each attribute?

    Thank you,
    Ana

  4. #4

    Default

    Ana,

    With ReliefF you might try removing some variables(repeatedly) with RandomSubset(preprocessing filter) and see if ReliefF variable importance weight changes. I don't think ReliefF does a variable by variable weighting against the output(Mark step in and correct if necessary).

    If not, it would be nice to see a switch option in future versions of ReliefF whether to use one variable at a time to get a more accurate picture(perhaps) of variable importance.

    You can always throw in some randoms inputs as probes and shuffled output(used as a junk input) probes to see if your real variables are getting higher variable importance rankings than the junk or not.
    Last edited by Mike1961; 06-30-2016 at 06:45 PM.

  5. #5
    Join Date
    Aug 2006
    Posts
    1,741

    Default

    ReliefF does assign a relevance score/weight to each attribute. The nice thing about the Relief algorithm is that the importance scores are computed taking into account interactions with other attributes, unlike the other attribute evaluators such as Info gain etc. The downside of ReliefF is that it is a nearest neighbour base algorithm, so it doesn't scale well as the number of instances in the training data increases.

    CFS is a subset evaluator, so there aren't merit scores for each attribute. If the GreedyStepwise search method is used with CFS then you can turn on the option to produce a ranked list. This will output a score for each attribute, but it can't be interpreted as a correlation between a given attribute and the class.

    Cheers,
    Mark.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.