Hitachi Vantara Pentaho Community Forums
Results 1 to 5 of 5

Thread: How can i get the prediction probability values for every leaf

  1. #1
    Join Date
    Aug 2015
    Posts
    7

    Question How can i get the prediction probability values for every leaf

    Hi all,
    In Weka API, I found a function to computes estimated errors for leaf (weka.classifiers.trees.j48.C45PruneableClassifierTree.getEstimatedErrorsForDistribution(Distribution theDistribution))
    but how can i computes prediction probability for every leaf node?
    Do you know the relationship between estimated errors and prediction probability?
    Thanks for read and please support!
    Last edited by vn.ngphuc; 11-04-2015 at 03:46 AM.

  2. #2
    Join Date
    Aug 2006
    Posts
    1,741

    Default

    The frequency counts of the training class values reaching each leaf are stored in the leaf. You can compute the probability distribution (and thus the predicted probability for the most frequent class) from those.

    Cheers,
    Mark.

  3. #3
    Join Date
    Aug 2015
    Posts
    7

    Default

    Quote Originally Posted by Mark View Post
    The frequency counts of the training class values reaching each leaf are stored in the leaf. You can compute the probability distribution (and thus the predicted probability for the most frequent class) from those.

    Cheers,
    Mark.
    Hi Mark,

    Can you tell me more about your idea! If the frequency counts of the training class values reaching each leaf are stored in the leaf then where can i get this value (What function in WEKA API)?
    How can i compute the probability? Is this function --> public double[] distributionForInstance(Instance instance). But this function for instance not leaf!?

    I just read through the C4.5 document, the below parts are interesting. I guess that the mentioned “e” is, what we are looking for?

    Name:  2015-11-12_10h17_26.jpg
Views: 191
Size:  21.9 KB

    Thanks Mark
    Last edited by vn.ngphuc; 11-11-2015 at 11:21 PM.

  4. #4
    Join Date
    Aug 2006
    Posts
    1,741

    Default

    You will need to dig into the J48 classes, and possibly modify the code as many instance variables have protected or private access. The class to look at is weka.classifiers.trees.j48.ClassifierTree. This class has a flag that indicates whether the node is a leaf or not. It also has a member variable, of type ClassifierSplitModel, called m_localModel - this can be queried to get the probability for each class value (see the private getProbs() method).

    Cheers,
    Mark.

  5. #5
    Join Date
    Aug 2015
    Posts
    7

    Default

    Quote Originally Posted by Mark View Post
    You will need to dig into the J48 classes, and possibly modify the code as many instance variables have protected or private access. The class to look at is weka.classifiers.trees.j48.ClassifierTree. This class has a flag that indicates whether the node is a leaf or not. It also has a member variable, of type ClassifierSplitModel, called m_localModel - this can be queried to get the probability for each class value (see the private getProbs() method).

    Cheers,
    Mark.

    Hi Mark,
    Currently, to computes prediction probability for every leaf node i used Stats.addErrs in package weka.classifiers.trees.j48. Please see below details:

    Stats.addErrs(m_distribution.perBag(index),m_distribution.numIncorrect(index),CF);

    /** * Computes estimated extra error for given total number of instances
    * and error using normal approximation to binomial distribution
    * (and continuity correction).
    *
    * @param N number of instances
    * @param e observed error
    * @param CF confidence value
    */
    public static double addErrs(double N, double e, float CF){


    // Ignore stupid values for CF
    if (CF > 0.5) {
    System.err.println("WARNING: confidence value for pruning " +
    " too high. Error estimate not modified.");
    return 0;
    }


    // Check for extreme cases at the low end because the
    // normal approximation won't work
    if (e < 1) {


    // Base case (i.e. e == 0) from documenta Geigy Scientific
    // Tables, 6th edition, page 185
    double base = N * (1 - Math.pow(CF, 1 / N));
    if (e == 0) {
    return base;
    }

    // Use linear interpolation between 0 and 1 like C4.5 does
    return base + e * (addErrs(N, 1, CF) - base);
    }

    // Use linear interpolation at the high end (i.e. between N - 0.5
    // and N) because of the continuity correction
    if (e + 0.5 >= N) {


    // Make sure that we never return anything smaller than zero
    return Math.max(N - e, 0);
    }


    // Get z-score corresponding to CF
    double z = Statistics.normalInverse(1 - CF);


    // Compute upper limit of confidence interval
    double f = (e + 0.5) / N;
    double r = (f + (z * z) / (2 * N) +
    z * Math.sqrt((f / N) -
    (f * f / N) +
    (z * z / (4 * N * N)))) /
    (1 + (z * z) / N);


    return (r * N) - e;
    }
    How do you think about this solution?

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.