PDA

View Full Version : PREDICTIVE J48 - how??



AJPRDG
09-21-2007, 10:18 AM
Hi there,
I am an ecology student who is using WEKA on recommendation of a colleague to produce a decision tree regarding presence absence data.

I have two data files -
one with ecological data on sampled (for the specis in question) sites.
the other with the same variables on unsampled sites. they are saved as .CSV files

I want to use the former to provide TRAINING data for the latter's TEST data.

I can get a decision tree out of the Training data easily enough, but I was wondering is there an easy way to apply this tree to the new Test data such that it will predict presence or absence? My current thoughts would involve coding excel logic functions, but that's not very clever...

I'm not a programmer so keeping the method simple for me would be much appreciated!

Thanks in advance,

Al P

Mark
09-23-2007, 08:28 PM
Hi,

Classifiers learned with Weka can be applied to new data to produce predictions. In the Weka Explorer's "Classify" panel, click on the "More options" button to pop up the more options window and then select "Output predictions". Now when you apply your classifier to a separate test set you will see a prediction for each instance in the test data.

Cheers,
Mark.

AJPRDG
09-24-2007, 07:55 AM
Thanks! I knew it'd be something obvious....

A

AJPRDG
09-24-2007, 08:55 AM
actually I say that it is obvious, and have immediately hit another problem.

So I train the J48 on TRAINING data. i save the model as D:\model.

I then load that model.
I then specify the TEST set using the dialog at the top left.
then Right click on results section and select "re-evaluate...."

however the process stops, telling me "train and test set not compatible".

both files are set up identically though, with 1st column "occupied" being the one i want to predict in test data. the other 32 columns are variables in the same order. the only differences are that the test data is larger 525 rows, and the occupied column for test data has been left blank.

Sorry to be clogging this up - but what is my fix here?

Alan

Mark
09-24-2007, 09:39 PM
The problem most likely stems from the fact that you have two separate CSV files - especially if there are nominal valued attributes involved. When Weka reads a CSV file it begins by assuming that all attributes are numeric. If, during reading, it discovers a value that can't be parsed as a number it converts the attribute in question to nominal (all previously seen values become nominal values that the attribute can take on, in the order seen).

Mark
09-24-2007, 09:46 PM
The problem most likely stems from the fact that you have two separate CSV files - especially if there are nominal valued attributes involved. When Weka reads a CSV file it begins by assuming that all attributes are numeric. If, during reading, it discovers a value that can't be parsed as a number it converts the attribute in question to nominal (all previously seen values become nominal values that the attribute can take on, in the order seen). So, with two separate CSV files you probably have, when converted to arff, headers that are incompatible because of different numbers/order of values for nominal attributes. You can confirm this by loading each CSV file into the Explorer, saving them as arff files and then taking a look at the header of each in a text editor. For example, if you have an attribute that has possible values A,B,C,D, in one arff file they might be declared C,D,A,B (because that was the order the values were encountered in the data when read) and in the second B,A,C,D. More seriously, it may be the case that one of the CSV files has a value that does not occur in the other.

The fix for this is to join both CSV files before reading in order to obtain one header that is compatible with both. This header can then replace the header obtained by reading each file separately. Some editing/cutting and pasting in a text editor is required for this unfortunately.

Cheers,
Mark.

AJPRDG
09-25-2007, 09:04 AM
thanks... will have a go at that.

the variables are numeric apart from the dependent which is nominal. i'll have a crack at the ideas you suggest and hopefully that'll solve it.

Al

updated -

you're a star - headers were indeed the issue, not only with the nominal class, i'd managed a spelling mistake on one of the variables.
Thanks for your help on what would thus be an embarrasingly easy problem, it's frustrating when you can't figure these things out for yourself though!