NEWBIE! missing values
Hello all! I am new to WEKA and statistics in general. I have a background in NLP and being a code monkey. I am on a new task to look at data mining some medical data. That said, rarely do patients have consistent data. Sometimes you have regular vital reports, and sometimes you don't. Sometimes temperature might be missing, and others the blood pressure might be missing. So, the question I have...
It looks like I should use a ? when using the j48 tree. I have read that using that means that piece of data is not used in the gain or entropy calculations when training. What happens when you go to use the tree with new data. What happens if that is a decision node, and the data is missing?
That's correct. J48/C4.5 ignores missing values when computing information gain/gain-ratio for splitting decisions. At training or testing time, an instance that has a missing value for an attribute that is tested at a node actually gets split into fractional parts (with weight proportional to the number of instances in each sub-tree rooted at that point) - these "fractional" instances pass down into all subtrees.