PDA

View Full Version : Large Datasets



bt.johnson
02-01-2007, 04:37 PM
Does all data have to go into memory?

I am planning on analyzing very large datasets (1M+ records) and even the test CSV file I was testing hits the 1 GB memory threshold on my computer.

Is there a way to use Weka to analyze large datasets?

Thanks.

Mark
02-12-2007, 11:50 PM
Yes it is possible to analyze large datasets with Weka.

First you will need to convert your csv file to arff manually (since the csv to arff reader needs to load all data into memory in order to correctly determine the types of the attributes). This entails creating the arff header and prepending it to your csv file.

Once you have an arff file, you can apply a few learners that can train incrementally (i.e. process one instance at a time from the data). NaiveBayesUpdateable is one such algorithm, RacedIncrementalLogitBoost is another. To do this, you will either have to run the method from the comand line or use the KnowledgeFlow user interface (the Explorer is not set up to process data incrementally).

Cheers,
Mark.