Hitachi Vantara Pentaho Community Forums
Results 1 to 5 of 5

Thread: RandomForest does not abide to -K parameter the number of features to randomly select

  1. #1
    Join Date
    Apr 2017
    Posts
    3

    Exclamation RandomForest does not abide to -K parameter the number of features to randomly select

    As title says. Tested with Weka 3.8.1. RandomSubSpace works flawlessly in this regard however RandomForest does not.

    Am I missing something?

    UPDATE: It's code looks fine
    Last edited by cibic; 04-13-2017 at 12:46 PM. Reason: Inspected the code

  2. #2
    Join Date
    Aug 2006
    Posts
    1,741

    Default

    What makes you say it doesn't honor the -K parameter? Do you have a test case that shows the problem?

    Cheers,
    Mark.

  3. #3
    Join Date
    Apr 2017
    Posts
    3

    Default

    Hi Mark,


    It's an honour to receive a reply from you. I can reproduce the issue everytime with my dataset (numeric).
    Output with all forests and tress is attached.


    This is the output of RandomForest with K=0. Out of 11 features in total it should select only 3 or 4:
    === Run information ===




    Scheme: weka.classifiers.trees.RandomForest -P 100 -print -I 9 -num-slots 8 -K 0 -M 1.0 -V 0.001 -S 1
    Relation: AllDays 8in train-weka.filters.unsupervised.attribute.Remove-R1,12,14-19
    Instances: 180
    Attributes: 11
    LogP1
    LogP2
    LogP3
    LogP4
    LogP5
    MolMass1
    MolVol1
    pKa
    D3
    D7
    CountsNP
    Test mode: evaluate on training data




    === Classifier model (full training set) ===




    RandomForest




    Bagging with 9 iterations and base learner




    weka.classifiers.trees.RandomTree -K 0 -M 1.0 -V 0.001 -S 1 -do-not-check-capabilitiesAll the base classifiers:








    RandomTree
    ==========




    D3 < 0.5
    | MolMass1 < 405058.55
    | | MolVol1 < 103.34
    | | | LogP5 < -0.6
    | | | | LogP3 < -0.16 : 4.16 (9/0.39)
    | | | | LogP3 >= -0.16 : 3.07 (12/0.38)
    | | | LogP5 >= -0.6
    | | | | LogP3 < 0.17 : 8.51 (10/3.17)
    | | | | LogP3 >= 0.17
    | | | | | LogP3 < 1.8
    | | | | | | LogP2 < 0.98 : 6.33 (8/5.47)
    | | | | | | LogP2 >= 0.98 : 4.12 (9/1.79)
    | | | | | LogP3 >= 1.8 : 6.16 (11/0.88)
    | | MolVol1 >= 103.34
    | | | MolVol1 < 104.41 : 2.13 (9/0.36)
    | | | MolVol1 >= 104.41
    | | | | LogP1 < 1.48 : 2.56 (14/0.5)
    | | | | LogP1 >= 1.48 : 2.9 (5/0.92)
    | MolMass1 >= 405058.55 : 9.36 (19/25)
    D3 >= 0.5
    | LogP4 < 1.54
    | | LogP4 < 0.58
    | | | LogP5 < -0.6
    | | | | LogP1 < 1.48
    | | | | | LogP3 < -1.38 : 7.19 (9/7)
    | | | | | LogP3 >= -1.38
    | | | | | | MolVol1 < 99.53
    | | | | | | | LogP5 < -0.69 : 8.06 (9/2.16)
    | | | | | | | LogP5 >= -0.69 : 8.07 (6/0.44)
    | | | | | | MolVol1 >= 99.53 : 7.9 (11/4.99)
    | | | | LogP1 >= 1.48 : 9.88 (6/4.6)
    | | | LogP5 >= -0.6
    | | | | MolMass1 < 100.11 : 7.89 (10/7.01)
    | | | | MolMass1 >= 100.11 : 12.6 (10/18.42)
    | | LogP4 >= 0.58 : 5.47 (8/4.97)
    | LogP4 >= 1.54 : 15.99 (5/6.74)




    Size of the tree : 37
    ...
    === Summary ===




    Correlation coefficient 0.763
    Mean absolute error 1.8949
    Root mean squared error 2.5891
    Relative absolute error 58.9434 %
    Root relative squared error 64.6856 %
    Total Number of Instances 180



    This is the output of RF with K=2:
    === Run information ===




    Scheme: weka.classifiers.trees.RandomForest -P 100 -print -I 9 -num-slots 8 -K 2 -M 1.0 -V 0.001 -S 1
    Relation: AllDays 8in train-weka.filters.unsupervised.attribute.Remove-R1,12,14-19
    Instances: 180
    Attributes: 11
    LogP1
    LogP2
    LogP3
    LogP4
    LogP5
    MolMass1
    MolVol1
    pKa
    D3
    D7
    CountsNP
    Test mode: evaluate on training data




    === Classifier model (full training set) ===




    RandomForest




    Bagging with 9 iterations and base learner




    weka.classifiers.trees.RandomTree -K 2 -M 1.0 -V 0.001 -S 1 -do-not-check-capabilitiesAll the base classifiers:








    RandomTree
    ==========




    D3 < 0.5
    | MolMass1 < 405058.55
    | | MolVol1 < 103.34
    | | | LogP3 < 0.33
    | | | | pKa < 7.04
    | | | | | MolMass1 < 74.57 : 4.16 (9/0.39)
    | | | | | MolMass1 >= 74.57 : 6.33 (8/5.47)
    | | | | pKa >= 7.04 : 8.51 (10/3.17)
    | | | LogP3 >= 0.33
    | | | | pKa < 10.8
    | | | | | LogP4 < 0.56 : 3.07 (12/0.38)
    | | | | | LogP4 >= 0.56 : 4.12 (9/1.79)
    | | | | pKa >= 10.8 : 6.16 (11/0.88)
    | | MolVol1 >= 103.34
    | | | MolMass1 < 87.66 : 2.56 (14/0.5)
    | | | MolMass1 >= 87.66
    | | | | LogP2 < -0.08 : 2.13 (9/0.36)
    | | | | LogP2 >= -0.08 : 2.9 (5/0.92)
    | MolMass1 >= 405058.55 : 9.36 (19/25)
    D3 >= 0.5
    | LogP1 < -1.07 : 12.6 (10/18.42)
    | LogP1 >= -1.07
    | | MolMass1 < 31.05 : 15.99 (5/6.74)
    | | MolMass1 >= 31.05
    | | | LogP4 < 0.58
    | | | | LogP2 < 0.96
    | | | | | MolVol1 < 99.68
    | | | | | | MolMass1 < 78.13
    | | | | | | | MolVol1 < 63.95 : 8.06 (9/2.16)
    | | | | | | | MolVol1 >= 63.95 : 8.07 (6/0.44)
    | | | | | | MolMass1 >= 78.13 : 7.89 (10/7.01)
    | | | | | MolVol1 >= 99.68
    | | | | | | LogP1 < 0.42 : 7.19 (9/7)
    | | | | | | LogP1 >= 0.42 : 7.9 (11/4.99)
    | | | | LogP2 >= 0.96 : 9.88 (6/4.6)
    | | | LogP4 >= 0.58 : 5.47 (8/4.97)




    Size of the tree : 37
    ...
    === Summary ===




    Correlation coefficient 0.763
    Mean absolute error 1.8949
    Root mean squared error 2.5891
    Relative absolute error 58.9434 %
    Root relative squared error 64.6856 %
    Total Number of Instances 180



    This is RF with K=8:
    === Run information ===




    Scheme: weka.classifiers.trees.RandomForest -P 100 -print -I 9 -num-slots 8 -K 8 -M 1.0 -V 0.001 -S 1
    Relation: AllDays 8in train-weka.filters.unsupervised.attribute.Remove-R1,12,14-19
    Instances: 180
    Attributes: 11
    LogP1
    LogP2
    LogP3
    LogP4
    LogP5
    MolMass1
    MolVol1
    pKa
    D3
    D7
    CountsNP
    Test mode: evaluate on training data




    === Classifier model (full training set) ===




    RandomForest




    Bagging with 9 iterations and base learner




    weka.classifiers.trees.RandomTree -K 8 -M 1.0 -V 0.001 -S 1 -do-not-check-capabilitiesAll the base classifiers:








    RandomTree
    ==========




    D7 < 0.5
    | LogP3 < 1.8
    | | LogP1 < -1.07 : 12.6 (10/18.42)
    | | LogP1 >= -1.07
    | | | LogP5 < 0.23
    | | | | LogP4 < -0.11
    | | | | | LogP3 < -1.38 : 7.19 (9/7)
    | | | | | LogP3 >= -1.38
    | | | | | | LogP1 < 1.04
    | | | | | | | LogP4 < -0.42 : 8.06 (9/2.16)
    | | | | | | | LogP4 >= -0.42 : 8.07 (6/0.44)
    | | | | | | LogP1 >= 1.04
    | | | | | | | LogP4 < -0.29 : 7.89 (10/7.01)
    | | | | | | | LogP4 >= -0.29 : 7.9 (11/4.99)
    | | | | LogP4 >= -0.11 : 9.88 (6/4.6)
    | | | LogP5 >= 0.23 : 5.47 (8/4.97)
    | LogP3 >= 1.8 : 15.99 (5/6.74)
    D7 >= 0.5
    | LogP4 < -0.21
    | | pKa < 6.4
    | | | LogP2 < -0.53 : 4.16 (9/0.39)
    | | | LogP2 >= -0.53 : 6.33 (8/5.47)
    | | pKa >= 6.4
    | | | LogP3 < -567.6 : 9.36 (19/25)
    | | | LogP3 >= -567.6 : 8.51 (10/3.17)
    | LogP4 >= -0.21
    | | LogP3 < 1.8
    | | | LogP4 < 0.58
    | | | | LogP1 < 0.16 : 2.13 (9/0.36)
    | | | | LogP1 >= 0.16
    | | | | | MolVol1 < 106.67
    | | | | | | LogP2 < 0.96 : 3.07 (12/0.38)
    | | | | | | LogP2 >= 0.96 : 2.9 (5/0.92)
    | | | | | MolVol1 >= 106.67 : 2.56 (14/0.5)
    | | | LogP4 >= 0.58 : 4.12 (9/1.79)
    | | LogP3 >= 1.8 : 6.16 (11/0.88)




    Size of the tree : 37
    ...
    === Summary ===




    Correlation coefficient 0.763
    Mean absolute error 1.8949
    Root mean squared error 2.5891
    Relative absolute error 58.9434 %
    Root relative squared error 64.6856 %
    Total Number of Instances 180



    It seems the sizes of the trees and internal model performance does not change and almost all trees use almost all features regardless of K.
    This is the output of RandomSubSpace with P=2 for 2 features and RF with K=8 for testing purporses:
    === Run information ===




    Scheme: weka.classifiers.meta.RandomSubSpace -P 2.0 -S 1 -num-slots 8 -I 10 -W weka.classifiers.trees.RandomForest -- -P 100 -print -I 9 -num-slots 8 -K 8 -M 1.0 -V 0.001 -S 1
    Relation: AllDays 8in train-weka.filters.unsupervised.attribute.Remove-R1,12,14-19
    Instances: 180
    Attributes: 11
    LogP1
    LogP2
    LogP3
    LogP4
    LogP5
    MolMass1
    MolVol1
    pKa
    D3
    D7
    CountsNP
    Test mode: evaluate on training data




    === Classifier model (full training set) ===




    All the base classifiers:




    FilteredClassifier using weka.classifiers.trees.RandomForest -P 100 -print -I 9 -num-slots 8 -K 8 -M 1.0 -V 0.001 -S 1553595926 on data filtered through weka.filters.unsupervised.attribute.Remove -V -R 10,5,11




    Filtered Header
    @relation 'AllDays 8in train-weka.filters.unsupervised.attribute.Remove-R1,12,14-19-weka.filters.unsupervised.attribute.Remove-V-R10,5,11'




    @attribute D7 numeric
    @attribute LogP5 numeric
    @attribute CountsNP numeric




    @data








    Classifier Model
    RandomForest




    Bagging with 9 iterations and base learner




    weka.classifiers.trees.RandomTree -K 8 -M 1.0 -V 0.001 -S 1553595926 -do-not-check-capabilitiesAll the base classifiers:








    RandomTree
    ==========




    D7 < 0.5
    | LogP5 < 1.27
    | | LogP5 < 0.23
    | | | LogP5 < -0.6
    | | | | LogP5 < -0.69 : 7.5 (16/2.27)
    | | | | LogP5 >= -0.69 : 7.53 (29/6.52)
    | | | LogP5 >= -0.6
    | | | | LogP5 < -0.4 : 10.3 (6/13.66)
    | | | | LogP5 >= -0.4 : 8.58 (4/5.43)
    | | LogP5 >= 0.23 : 5.77 (10/5.19)
    | LogP5 >= 1.27 : 12.97 (7/17.48)
    D7 >= 0.5
    | LogP5 < -0.6
    | | LogP5 < -568 : 8.24 (11/17.59)
    | | LogP5 >= -568
    | | | LogP5 < -0.69 : 4.08 (14/0.64)
    | | | LogP5 >= -0.69 : 2.6 (38/0.72)
    | LogP5 >= -0.6
    | | LogP5 < 0.23
    | | | LogP5 < -0.4 : 6.54 (9/3.94)
    | | | LogP5 >= -0.4 : 9.71 (18/6.36)
    | | LogP5 >= 0.23
    | | | LogP5 < 1.27 : 3.25 (10/1.18)
    | | | LogP5 >= 1.27 : 6.14 (8/0.26)




    Size of the tree : 25
    ...
    === Summary ===




    Correlation coefficient 0.7146
    Mean absolute error 2.2298
    Root mean squared error 2.8731
    Relative absolute error 69.3614 %
    Root relative squared error 71.7802 %
    Total Number of Instances 180
    Attached Files Attached Files
    Last edited by cibic; 04-14-2017 at 06:22 AM. Reason: Managed to get the output without getting blocked.

  4. #4
    Join Date
    Aug 2006
    Posts
    1,741

    Default

    I think you misunderstand the meaning of the -K parameter in RandomForest. It controls how many attributes (out of the complete set) are investigated each time a node is split in the RandomTree. E.g if K is set to two, then at each node it will randomly select two attributes to be splitting candidates - the one with the best information gain becomes the split. With a small number of input attributes and a deep enough tree it is quite likely that each attribute will be used somewhere in the tree.

    Cheers,
    Mark.

  5. #5
    Join Date
    Apr 2017
    Posts
    3

    Default

    This clears it up.

    Thank you Mark!

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.