I have a data set of about 9000 records with 34 attributes. This is patient information and using this data I am trying to predict Readmission Class as Yes or No. Based upon first 33 attributes of a patient, I am trying to predict 34th column Readmission. This is supervised learning. Currently, I am using SPSS and WEKA tools. Here are few of my queries:


  1. I have preprocessed the data and have done binning for all numeric data to make it nominal. Is this right?
  2. My data is unbalanced (13% is Yes and rest is No). Do I need to balance my data? If yes, whether to do Oversampling or Undersampling?
  3. Do I need to extract 75% training data from my dataset and only balance that or the whole 9000 records?
  4. Do I need to do Feature Selection? If yes, whether to do it after balancing or before?
  5. How to build the final predictive model? Which classifier to use e.g. Logistic Regression, Decision Tree and so on?
  6. What should be the evaluation criteria for my model? Accuracy or ROC or F-measure?


I will really appreciate your help in this area.

Regards
Reena