Hitachi Vantara Pentaho Community Forums
Results 1 to 5 of 5

Thread: JRIP on unbalanced data result interpretation

  1. #1
    Join Date
    Mar 2014
    Posts
    12

    Default JRIP on unbalanced data result interpretation

    Hello all,

    I have unbalanced data-set, so I down-sample majority class and train JRIP on that balanced set. As a result I have rules with number of covered instances and number of misclassified instances. But the number of misclassified instances is from balanced dataset, so it is somehow not realistic, because there is far less instances from majority class in balanced dataset used for training. What should I do, can I use the rules for dataset description and interpret their "accuracy" only from balanced testsset?

  2. #2
    Join Date
    Aug 2006
    Posts
    1,741

    Default

    Hi,

    I assume you're using the SpreadSubsample or Resample filter to balance the class distribution? If so, then you can wrap this filter with JRip into a FilteredClassifier. This way, any evaluation you do, such as cross-validation or percentage split, will have the original class distribution in the test set/folds.

    Cheers,
    Mark.

  3. #3
    Join Date
    Mar 2014
    Posts
    12

    Default

    Hi Mark,
    thank you very much for your response. Yes, I am using Resample filter and I use it exactly that way, that I am testing on test set with original distribution. Maybe I put i wrong, I am interested in two numbers in brackets that follows each rule (number of instances covered by the rule, and number of misclassified by the rule) - these numbers come from balanced training set, am I right?. If so, my question is if they are of any value when they come from balanced distribution.

    Thank you very much.

    Matus

  4. #4
    Join Date
    Aug 2006
    Posts
    1,741

    Default

    Oh, I see. I guess the leaf statistics are of less use when they are computed from the balanced training data. However, they still give you a feeling for how well the rules perform under the "ideal conditions".

    Cheers,
    Mark.

  5. #5
    Join Date
    Mar 2014
    Posts
    12

    Default

    Hi Mark,
    thank you. I just have one more question, if you don't mind. When I check "output predictions" in "more options" I get probabilities for each instance. I would like to know how are these probabilities calculated, or more simple, are they calculated from balanced train set? And if so are they of any use? And this is the same for trees or any other classifier trained on balanced data?
    Thank you.

    Matus
    Last edited by makak; 03-07-2014 at 02:36 AM.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.