Hitachi Vantara Pentaho Community Forums
Results 1 to 3 of 3

Thread: 10-fold-cross-validation in WEKA GUI => Bias because of sequence folding/vectorizing?

  1. #1
    Join Date
    Dec 2016

    Question 10-fold-cross-validation in WEKA GUI => Bias because of sequence folding/vectorizing?

    Hi all!
    I have a question where I am unsure if the implementation of WEKA is misleading or I am mistaken.

    Let’s take a simple example: I have 10k documents, 2k of them labeled as positive regarding the target value, 8k labeled as negative. Now, I preprocess the text with a different tool and generate an Input File for WEKA (.arff). If I hand over this file to the WEKA GUI and apply 10-fold-cross-validation with e.g. BayesNet, I get quite good results in terms of precision and recall (P: 71, R: 47). But I guess this is due to the fact that the process in the GUI is

    Preprocessing (previously done by me) ==> Vectorizing (previously done by me) ==> Folding (done by WEKA [GUI]) ==> Training & Testing (x10) ⇒ Results [averaged]

    when it should be in theory

    Preprocessing ==> Folding ==> [Vectorizing ==> Training & Testing] (x10) ⇒ Results [averaged]

    Otherwise my train set might contain information which it shouldn’t see before, because features of the test set are already included in the “vector” (and the other way round).

    If I use my own implementation with own cross-validation and just use the Classifiers from WEKA via API, my results are far worse (P: 0.60, R: 0.22). I think this might be due to the sequence of the steps of Preprocessing, Vectorizing and Folding. Do I have the correct understanding of the process and might WEKA be misleading with its GUI suggesting it does unbiased 10-fold-cross-validation? Or is WEKA even smart enough to recognize this and “remove” the relevant features for the training?

    Thanks, I really appreciate your comments!
    Last edited by nkukit; 12-13-2016 at 09:04 AM.

  2. #2
    Join Date
    Aug 2006


    You are correct. In most cases (where a learning algorithm will be applied) all preprocessing should only be done using training data, and then the results applied to the test data. Otherwise information in test data is available at training time which leads to overly optimistic results. The way to avoid this in Weka is to use the FilteredClassifier, configured with one or more filters to use for preprocessing and a base learner to apply to the filtered data. That way the preprocessing only ever "learns" from training folds. You would need to implement your preprocessing & vectorization as one or more Weka filters. If you happen to be using python then you might be able to use the WekaPyscript package ( to adapt your python code into a Weka filter.

    Is your vectorization converting text data into term-frequencies? If so, have you looked at what is available in the StringToWordVector filter?


  3. #3
    Join Date
    Dec 2016


    Hi Mark,

    thank you for your quick response and help. We are currently mostly using Weka in our java code, so we were able to solve this issue by implementing the folding ourselfes. Unfortunately our preprocessing is a bit more complex, as it involves the usage of synsets from a German WordNet, so we abandoned StringToWordVector a while ago.

    The problem mainly applies when we try to validate results from our Java code in the Weka Explorer. But I think in this case the only way to solve this is by concatenating the features of each text (words & synsets) to a string and then use StringToWordVector with a filtered classifier. (Or see what WekaPyscript has to offer.)

    It might be worth warning about this explicitly in the documentation, as it is a mistake easily made.


Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.