PDA

View Full Version : variable attribute size



Santiago
04-09-2009, 10:39 AM
I'm just starting out with Weka, and trying out different datasets.

I'm trying to classify some attributes as 1 or more classes, from natural language style input.

My attributes are actually sentences, and each word becomes an attribute.
The problem is that the sentence can have a variable number of words.

Also, from my example below, notice that I can have a variable number of classes associated with each query, but I can assume they are the last attributes, after the first one, which is the query.

In both the query and classes section, the order does not matter.

Sample traning input:
@attribute query string
@attribute class {mammals, reptiles, amphibians, birds, fish, walk, fly, swim, gray, blue, green, black, white, brown, yellow}

@data
a gray goose, bird, gray, flying
african elephant, mammal, brown, grey, walk
human beings, mammal, walk, black, white, brown
dolphin, mammal, swim, graytiger, mammal, walk, yellow
large tiger, mammal, walk, yellow
small tiger, mammal, walk, yellow, white


Some ideas I had was, assume that I won't have more then 5 words in a sentence, and preformat the file to include ? in "missing" values, for any sentences with less then 5 words:
So:
a gray goose, bird, gray, flying
- becomes
a,gray,goose,?,?,bird,gray,flying

I could do the same for classes as well.

Any suggestions would be greatly appreciated?
Thanks in advance, Santiago

Mark
04-14-2009, 11:48 PM
Hi Santiago,

Have you looked at Weka's StringToWordVector filter?

http://wiki.pentaho.com/display/DATAMINING/StringToWordVector

This will help you with automatic tokenization, stemming and conversion to the "Bag of Words" format from which you can apply standard machine learning algorithms.

I see that you have a multi-label classification problem. Weka isn't really set up for muli-label prediction although you can reformulate so that you have multiple binary learning problems from which you learn one model for each label (i.e. one label against the rest). Then you have the problem of deciding a heuristic by which to assign multiple labels to a test example given the output of the individual models.

Mulan is a project that is built on Weka and extends it to the multi-label setting. Take a look at:

http://www.ohloh.net/p/10085

Cheers,
Mark.