Use the REGDIAG macro. 1. Issue the following commands to SAS (with appropriate substitutions for "U:\"). LIBNAME GFLib 'U:\dm\sasdata' ; LIBNAME RCLib 'U:\dm\RC' ; 2. Read your Excel file into SAS with the EXCELSAS macro. 3. In practice, apply the UNIVAR and FREQ macros to perform exploratory data analysis and (if necessary) take action to correct mistakes, handle extreme outliers, or deal with severe departures from normality. 4. Apply the RANSPLIT macro to produce training, validation, and test data sets (or training and test data sets). For the illustration in Lecture 4, I am splitting the permanent SAS dataset {fev.sas7bdat} into training, validation, and test data sets. Note that you do not really need to specify all of the variables in your data set when you run the RANSPLIT macro; indeed, you may not have space to do so. However, the variables you specify are the ones for which SAS will generate the box plots. 5. If desired, rename the training, validation, and test data sets (or training and test data sets). For the illustration in Lecture 4, I used the following code. TITLE ' '; DATA RCLib.FEVTrain; SET RCLib.Train_; RUN; DATA RCLib.FEVValid; SET RCLib.Valid_; RUN; DATA RCLib.FEVTest; SET RCLib.Test_; RUN; 6. Apply the REGDIAG macro to the training data. Detailed instructions on how to fill in the blanks are given on pages 172-177 of the textbook. I will make some comments below. I used the following for the illustration in Lecture 4. RCLib.fevtrain fev [blank] 0.05 age hgt sex smoke age hgt sex smoke influence id [blank] [blank] word U:\dm\RC\ 27 yes Comments: Fernandez intends for you to include continuous explanatory variables in the fifth field and, if there are any, categorical explanatory variables in the third field. However, if you include categorical explanatory variables in the third field, the macro calls PROC GLM instead of PROC REG. In my opinion, the output from PROC REG is much more helpful in data mining than PROC GLM, especially when there is no validation data set. So, my preference is to include dichotomous explanatory variables in the fifth field with the continuous explanatory variables. For a non-dichotomous categorical explanatory variable, you have two options. One, if it is ordinal with artificial numerical designations for the categories, you can treat it as if it were continuous; such treatment is often viewed as less objectionable for an explanatory variable than for a response variable. Two, if the first option is inapplicable or unpalatable, you can define dichotomous indicator variables for all but one category and use these indicator variables as explanatory variables instead of the original categorical variable. The syntax below illustrates how to define such indicator variables. In this case, the categorical variable CAT has three possible values (1, 2, 3). We define CAT1 to equal 1 for those with CAT=1, and we define CAT2 to equal 1 for those with CAT=2. DATA Enhanced; SET Original; if CAT = 1 then CAT1 = 1; if CAT = 2 then CAT1 = 0; if CAT = 3 then CAT1 = 0; if CAT = 1 then CAT2 = 0; if CAT = 2 then CAT2 = 1; if CAT = 3 then CAT2 = 0; RUN; I have left blank the tenth field, which asks for a validation data set. What Fernandez has the macro do with the validation data set is not how a validation data set is typically used in data mining. 7. If you wish, you can modify your data set based on the results and rerun the REGDIAG macro with the modified data set. Here is syntax that illustrates such modifications. DATA Enhanced; SET Original; AgeSquared = Age*Age; HgtSquared = Hgt*Hgt; AgexHgt = Age*Hgt; logfev = log(FEV); SmokexHgt = Smoke*Hgt; SmokexAge = Smoke*Age; RUN;