Hi,
I have a dataset in which one column is of high cardinality i.e ~5000 unique categories and the size of the dataset is huge 3 million records and we have 2 other numeric columns. Weka NB seems to handling this problem fine but since Scikit learn NB since it requires one hot encoding will run into memory errors unless we do batch processing. My question is can anyone help me with how Weka NB is implemented and how does it contrast with Scikit learn's NB version. I am new to Java but any kind of explanation is appreciated. Thanks.