weka grid status request
Hi weka folks,
For a rather large Data Mining assignment I want to make use of my tow main servers:
A) amd 64 freebsd server with 4GB running weka-3.4.13
B) dual-core amd64 freebsd server with 4 GB running weka-3.4.13
As I have these two servers, I would really want to use them simultaneously.
I came across http://cssa.ucd.ie/xin/weka/gweka-howto.html, which looks like exactly the tool I need. However, it seems that this project is not actively maintained anymore and so I don't know if I can use it with the version of weka that is installed on my servers.
Does anyone know what the status is about this project and if Pentaho has plans to adopt it as well?
My requirements are:
-it must support the CLI interface, because I have a few scripts which process big datasets repetitively
-it should be easy to configure
-training and cross validation should be parallelized
Hope someone here can knows about this or can point me in the good direction.
Last edited by rgilaard; 10-13-2008 at 06:18 PM.
I can't really comment on the status of GridWeka I'm afraid. The best you can do is download it and see if it's still compatible with recent versions of Weka.
There isn't really anything else that I'm aware of that meets all your requirements. Weka has a built in facility for distributing experiments (via the Experimenter of command line) to multiple machines. It uses Java's RMI mechanism. weka.experiment.RemoteEngine is, in fact, a general purpose compute engine, so you could write implementations of weka.experiment.Task that achieve your specific goals.
As far as I know, parallelizing the training of machine learning algorithms is a non-trivial research problem (depending on the type of learning algorithm). Methods such as naive Bayes are quite easy to parallelize, and there is some work on learning the structure of a decision tree in a parallel fashion. Of course, certain ensemble learning techniques also lend themselves naturally to parallel implementation.