Use the REGDIAG macro.

1. Issue the following commands to SAS (with appropriate
   substitutions for "U:\").

LIBNAME GFLib 'U:\dm\sasdata' ; 

LIBNAME RCLib 'U:\dm\RC' ;

2. Read your Excel file into SAS with the EXCELSAS macro.

3. In practice, apply the UNIVAR and FREQ macros to perform 
   exploratory data analysis and (if necessary) take action
   to correct mistakes, handle extreme outliers, or deal
   with severe departures from normality.

4. Apply the RANSPLIT macro to produce training, validation,
   and test data sets (or training and test data sets).
	For the illustration in Lecture 4, I am splitting 
   the permanent SAS dataset {fev.sas7bdat} into training,
   validation, and test data sets.  Note that you do not 
   really need to specify all of the variables in your data 
   set when you run the RANSPLIT macro; indeed, you may not 
   have space to do so.  However, the variables you specify 
   are the ones for which SAS will generate the box plots.

5. If desired, rename the training, validation, and test
   data sets (or training and test data sets).
	For the illustration in Lecture 4, I used the 
   following code.

TITLE ' ';
DATA RCLib.FEVTrain;
SET RCLib.Train_;
RUN;
DATA RCLib.FEVValid;
SET RCLib.Valid_;
RUN;
DATA RCLib.FEVTest;
SET RCLib.Test_;
RUN;

6. Apply the REGDIAG macro to the training data.  Detailed 
   instructions on how to fill in the blanks are given on 
   pages 172-177 of the textbook.  I will make some comments 
   below.  I used the following for the illustration in 
   Lecture 4.
	RCLib.fevtrain		fev
	[blank]			0.05
	age hgt sex smoke	
	age hgt sex smoke	
	influence		id	
	[blank]			[blank]
	word			U:\dm\RC\		
	27			yes
   Comments:  Fernandez intends for you to include continuous
   explanatory variables in the fifth field and, if there are
   any, categorical explanatory variables in the third field.
   However, if you include categorical explanatory variables
   in the third field, the macro calls PROC GLM instead of 
   PROC REG.  In my opinion, the output from PROC REG is much
   more helpful in data mining than PROC GLM, especially when
   there is no validation data set.  So, my preference is to
   include dichotomous explanatory variables in the fifth
   field with the continuous explanatory variables.  
	For a non-dichotomous categorical explanatory variable,
   you have two options.  One, if it is ordinal with artificial  
   numerical designations for the categories, you can treat it 
   as if it were continuous; such treatment is often viewed as 
   less objectionable for an explanatory variable than for a 
   response variable.  Two, if the first option is inapplicable  
   or unpalatable, you can define dichotomous indicator variables 
   for all but one category and use these indicator variables as 
   explanatory variables instead of the original categorical 
   variable.  The syntax below illustrates how to define such 
   indicator variables.  In this case, the categorical variable 
   CAT has three possible values (1, 2, 3).  We define CAT1 to 
   equal  1  for those with CAT=1, and we define CAT2 to equal  
   1  for those with CAT=2.  

DATA Enhanced;
SET Original;
  if CAT = 1 then CAT1 = 1;
  if CAT = 2 then CAT1 = 0;
  if CAT = 3 then CAT1 = 0;
  if CAT = 1 then CAT2 = 0;
  if CAT = 2 then CAT2 = 1;
  if CAT = 3 then CAT2 = 0;
RUN;
 
	I have left blank the tenth field, which asks for a
   validation data set.  What Fernandez has the macro do with
   the validation data set is not how a validation data set is
   typically used in data mining.

7. If you wish, you can modify your data set based on the 
   results and rerun the REGDIAG macro with the modified data
   set.  Here is syntax that illustrates such modifications.

DATA Enhanced;
SET Original;
   AgeSquared = Age*Age;
   HgtSquared = Hgt*Hgt;
   AgexHgt = Age*Hgt;
   logfev = log(FEV);
   SmokexHgt = Smoke*Hgt;
   SmokexAge = Smoke*Age;
RUN;