PDA

View Full Version : changing certain value to missing



mcbob
09-17-2008, 03:40 AM
hi,
is there a way to change certain value of an attribute to missing value, e.g. -1 to "?" for all instances?
and a range, e.g. all negative numbers to "?" ?
I have tried NumericCleaner, but I cannot set the CloseToDefault value to "?".
thanks.

Mark
09-17-2008, 05:39 AM
Hi,

Although I've never tried this before myself, it turns out that it is possible with NumericCleaner. Without some deep knowledge of Weka, you'd probably never work it out though :-) Weka uses Double.NaN to represent missing values internally. Here is an example command line that sets all values less than 6.0 to missing for the first attribute in the iris data set:

java weka.filters.unsupervised.attribute.NumericCleaner -min 6.0 -min-default NaN -R first -i iris.arff

Cheers,
Mark.

mcbob
09-17-2008, 11:23 AM
perfect.
coming from SPSS, I was looking for the functionlity of the SPSS recode command. :)
thanks a lot.

wirefree
05-06-2009, 12:31 PM
I would greatly appreciate if you could address a more fundamental concern of mine i.e.:

By observing what set of properties of an attribute can I ascertain that it requires "cleaning"? Further, does Weka provide tools to ascertain the same?

Look forward to advise.

Best,
wirefree

Mark
05-06-2009, 10:17 PM
Hi,

Data cleaning typically involves domain-specific knowledge that you can apply in order to filter out or correct values that are considered illegal or exceed allowable ranges etc. So, such "rules" will differ depending on the application.

Outlier detection is closely related and involves automatically detecting data points that are out of the norm when compared with the overall distribution of the data. There is a body of research dedicated to this but I'm afraid it's an area that I'm not familiar with.

One very simple approach is to take a classifier that performs well on your data and remove any instances that are misclassified (the assumption being that they are outliers). There is a filter in Weka that can be used for this:

weka.filters.unsupervised.instance.RemoveMisclassified

Cheers,
Mark.

wirefree
05-06-2009, 10:37 PM
Appreciate the response, Mark.

With regards to your suggestion, I have investigated the efficacy of 'RemoveMisclassified' and have found it appealing when applied in the classification stage. However, presently I am addressing the issue of outliers at the pre-processing stage by evaluating each attribute individually based on simpler, domain-independent metrics such as distribution, mean, standard deviation, quartile ranges, etc.

To this effect I have been advised to consider 'InterquartileRange' in Filters>Unsupervised>Attributes. Applying the filter on, for example, all attributes yields a new column per attribute which identifies which of values are outliers. This is very well. However, my concern is centered around how to proceed from this point. I can formulate the following three alternatives:

1) Delete the instance: In which case, deleting the entire row would yield loss of information for all other attributes whose value for that row were not deemed outliers.
2) Replace with null value: In this case, what could one say on the efficacy of the resulting dataset which now contains missing values in place of outliers?
3) Deleting an instance where a minimum number of attributes contained outliers: In this case, such instances are far too few.

Would appreciate suggestions & comments.

Best,
wirefree

Mark
05-13-2009, 01:00 AM
Hi Wirefree,

In the absence of sensible default values to replace outliers with the problem does indeed become somewhat tricky (and there are no magic solutions).

Assuming that the goal is to produce a predictive model for your data... if you replace outliers with missing values then quite a few methods in Weka will simply impute these with means or modes computed from the training data. This is probably not too bad a solution, given that there aren't many good alternatives. Some methods, such as J48, handle missing values internally at training and prediction time by sending fractional instances down the branches of the tree when encountering a value that is missing for the test at a given node.

The RemoveMisclassified can be used as a general pre-processing step. It has been shown to work quite well for decision tree learners.

Cheers,
Mark.