Hitachi Vantara Pentaho Community Forums
Results 1 to 2 of 2

Thread: WEKA Naive Bayes implementation vs Scikit learn implementation

  1. #1

    Default WEKA Naive Bayes implementation vs Scikit learn implementation

    I have a dataset in which one column is of high cardinality i.e ~5000 unique categories and the size of the dataset is huge 3 million records and we have 2 other numeric columns. Weka NB seems to handling this problem fine but since Scikit learn NB since it requires one hot encoding will run into memory errors unless we do batch processing. My question is can anyone help me with how Weka NB is implemented and how does it contrast with Scikit learn's NB version. I am new to Java but any kind of explanation is appreciated. Thanks.

  2. #2
    Join Date
    Aug 2006


    I think you have covered the main difference between the two implementations already. All of scikit-learn's methods work on numeric numpy arrays, so categorical variables have to be encoded as numeric. One-hot encoding will massively increase the number of input columns when there are high arity nominal attributes present. Most of the new data will be zeros too. Perhaps there is a sparse representation in scikit-learn that might help?

    Without checking the docs, I'm not sure how scikit-learn's NB implementation treats the numeric input columns. Weka will use class-conditional gaussian estimators for numeric attributes, unless the option for supervised discretisation is turned on.

    Oh, one other difference is that Weka's implementation can be trained incrementally (i.e. only one instance need be held in main memory at any point in time during training). All of scikit-learn's schemes have to be batch trained. If you use Weka's command line or the Knowledge Flow you can train "Updateable" Weka classifiers incrementally.

    Last edited by Mark; 11-25-2016 at 10:17 PM.

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.