Find statistically significant words in text?
I have a data base of text (product descriptions) associated with various quantities (price, popularity, rating, etc).
What is the easiest way to mine this text data to look for words that are statistically significantly correlated with any of the quantities?
I think I should use a "regression" function of Weka. I'm a total newbie so any and all tips are appreciated.
Yes, you'll want to use a regression scheme. I'd suggest using the StringToWordVector filter to build a dictionary and vectorize your text files. This filter also has options that allow you to control the size of the dictionary (e.g. limit it to the x most frequently occurring words). As the number of words is likely to be quite large, it would probably best to apply support vector machines for regression (SMOreg). SVMs work well in high dimensional spaces. Using the default options will produce a linear model with weights for each attribute (rather than weights for each support vector).