Use Enterprise Miner to fit a regression tree. 1. Issue the following commands to SAS (with appropriate substitutions for "U:\"). LIBNAME GFLib 'U:\dm\sasdata' ; LIBNAME RCLib 'U:\dm\RC' ; 2. If necessary, apply macros to convert Excel data to SAS (EXCELSAS, see MacroInstr.txt), to divide the data into training/test or training/validation/test subsets (RANSPLIT, see MacroInstr.txt), or to explore the data (UNIVAR and FREQ/FREQUENCY, see MacroInstr2.txt). 3. If necessary, assign numerical designations to categories and shorten variable names. Illustrative code may be found in MacroInstr5.txt. 4. In SAS, go to the "Solutions" menu. Go to "Analysis" and then select "Enterprise Miner". Do not invoke the tutorial (unless you want to do so to satisfy your own curiosity). From the "File" menu choose "New" and then "Project". A box will appear with "Name" and "Location" fields as well as with "Create", "Cancel", and "Browse" buttons. Press the "Browse" button and select a subdirectory such as 'U:\dm\RC'. In the "Name" field, type a name like 'RegTree'. Then press "Create". 5. In the left panel of the SAS Enterprise Miner window, you will see a diagram with 'RegTree' and, immediately below it, 'Untitled'. You can right-click on 'Untitled' to assign a name such as 'FEV'. 6. Near the top of the SAS Enterprise Miner window, click the "Input Data Source" icon (far left) and, while holding the mouse button down, drag it into the right panel. Assuming that you have training, validation, and test data, repeat this process twice so that you have three "Input Data Source" icons in the right panel. 7. Double click on one of the "Input Data Source" icons in the right panel. Press the "Select" button and then choose an appropriate library such as 'RCLib'. Choose a training data set like 'FEVTRAIN'. In the "Role" field, change "RAW" to "TRAIN". Then click on the "Variables" tab near the top of the "Input Data Source" box. You can alter entries in the "Model Role" column by right clicking and then selecting "Set Model Role". You want the response variable to be identified as "target", any (candidate) explanatory variables to be identified as "input", and the ID variable (if any) to be identified as "ID". Any variables that you know you will not be using at all may be identified as "rejected". Also, make any necessary adjustments in the "Measurement" column. When finished, click the white on red X in the upper right corner of the "Input Data Source" box and confirm the changes. Assuming that you have validation and test data, repeat this process twice (except that "RAW" will be changed to "VALIDATE" and "TEST"). 8. Near the top of the SAS Enterprise Miner window, click the "Tree" icon (middle) and drag it into the right panel. By holding the left mouse button down, 'draw' arrows from the data set icons to the "Tree" icon. Also, drag the "Reporter" icon (right) into the right panel, then 'draw' an arrow from the "Tree" icon to the "Reporter" icon. 9. Double click on the "Tree" icon. Click the "Basic" tab. Choose "Variance Reduction" for the splitting criterion, deselect "Treat missing as an acceptable value", and change the number of "Surrogate rules saved in each node" to 3. When finished, close the "Tree" box and assign a model name such as 'FEVEx'. 10. Right-click the "Tree" icon and choose "Run". You will be asked if you want to view the results, which include average squared error figures for the training and validation data sets as more leaves are added to the regression tree. When finished, click the white on red X. You can view a schematic of the regression tree by right-clicking the "Tree" icon, choosing "Interactive", and selecting "Start". 11. Right-click on the "Reporter" icon and choose "Run". You can "Open" the report now or simply note to which subdirectory it has been saved. Some important items in the report not found in the results you already examined are the average squared error for the test data set and the "English rules" describing the regression tree. 12. Suppose that you want to see the tree's predictions for specific individuals in the test data set (or, actually, for any data set that has the same explanatory variables as the training and validation data sets). You can do so by imitating the approach shown below. DATA RCLib.FEVtestPred; SET RCLib.FEVtest; Copy and paste the contents of {http://www.richardcharnigo.net/CPH636S09/FEVreport/em_report_822392082.txt}, which is accessible from the "Datastep Score Code" link at {http://www.richardcharnigo.net/CPH636S09/FEVreport/em_report.html}. RUN; PROC PRINT DATA=RCLib.FEVtestPred; VAR P_FEV; RUN; Use Enterprise Miner to fit a classification tree. Steps 1 through 8 are essentially the same as those presented above for fitting a regression tree. 9. Double click on the "Tree" icon. Click the "Basic" tab. Choose "Gini Reduction" for the splitting criterion, deselect "Treat missing as an acceptable value", and change the number of "Surrogate rules saved in each node" to 3. When finished, close the "Tree" box and assign a model name such as 'SAEx'. 10. Right-click the "Tree" icon and choose "Run". You will be asked if you want to view the results, which include misclassification rates for the training and validation data sets as more leaves are added to the classification tree. When finished, click the white on red X. You can view a schematic of the classification tree by right-clicking the "Tree" icon, choosing "Interactive", and selecting "Start". 11. Right-click on the "Reporter" icon and choose "Run". You can "Open" the report now or simply note to which subdirectory it has been saved. Some important items in the report not found in the results you already examined are the misclassification rate for the test data set and the "English rules" describing the classification tree. 12. Suppose that you want to see the tree's estimated probabilities for specific individuals in the test data set (or, actually, for any data set that has the same explanatory variables as the training and validation data sets). You can do so by imitating the approach shown below. DATA RCLib.SAtest2Pred; SET RCLib.SAtest2; Copy and paste the contents of {http://www.richardcharnigo.net/CPH636S09/SAreport/em_report_822554338.txt}, which is accessible from the "Datastep Score Code" link at {http://www.richardcharnigo.net/CPH636S09/SAreport/em_report.html}. RUN; PROC PRINT DATA=RCLib.SAtest2Pred; VAR P_CHD1; RUN; 13. Suppose that you want to obtain correct classification rates on the test data set (or, actually, any data set that has the same explanatory variables as the training and validation data sets) with thresholds other than 0.50. You can do so by imitating the approach shown below, after having performed step 12. If you insert WHERE CHD = 1; immediately after the line starting with PROC MEANS, the code will give you sensitivities instead of correct classification rates. If you insert WHERE CHD = 0; immediately after the line starting with PROC MEANS, the code will give you specificities instead of correct classification rates. DATA RCLib.SAtest2Pred; SET RCLib.SAtest2Pred; ESTRISK = P_CHD1; PREDWITHCUTOFF05 = 1 - (ESTRISK < 0.05); PREDWITHCUTOFF10 = 1 - (ESTRISK < 0.10); PREDWITHCUTOFF15 = 1 - (ESTRISK < 0.15); PREDWITHCUTOFF20 = 1 - (ESTRISK < 0.20); PREDWITHCUTOFF25 = 1 - (ESTRISK < 0.25); PREDWITHCUTOFF30 = 1 - (ESTRISK < 0.30); PREDWITHCUTOFF35 = 1 - (ESTRISK < 0.35); PREDWITHCUTOFF40 = 1 - (ESTRISK < 0.40); PREDWITHCUTOFF45 = 1 - (ESTRISK < 0.45); PREDWITHCUTOFF50 = 1 - (ESTRISK < 0.50); PREDWITHCUTOFF55 = 1 - (ESTRISK < 0.55); PREDWITHCUTOFF60 = 1 - (ESTRISK < 0.60); PREDWITHCUTOFF65 = 1 - (ESTRISK < 0.65); PREDWITHCUTOFF70 = 1 - (ESTRISK < 0.70); PREDWITHCUTOFF75 = 1 - (ESTRISK < 0.75); PREDWITHCUTOFF80 = 1 - (ESTRISK < 0.80); PREDWITHCUTOFF85 = 1 - (ESTRISK < 0.85); PREDWITHCUTOFF90 = 1 - (ESTRISK < 0.90); PREDWITHCUTOFF95 = 1 - (ESTRISK < 0.95); CORRECT05 = (PREDWITHCUTOFF05 = CHD); CORRECT10 = (PREDWITHCUTOFF10 = CHD); CORRECT15 = (PREDWITHCUTOFF15 = CHD); CORRECT20 = (PREDWITHCUTOFF20 = CHD); CORRECT25 = (PREDWITHCUTOFF25 = CHD); CORRECT30 = (PREDWITHCUTOFF30 = CHD); CORRECT35 = (PREDWITHCUTOFF35 = CHD); CORRECT40 = (PREDWITHCUTOFF40 = CHD); CORRECT45 = (PREDWITHCUTOFF45 = CHD); CORRECT50 = (PREDWITHCUTOFF50 = CHD); CORRECT55 = (PREDWITHCUTOFF55 = CHD); CORRECT60 = (PREDWITHCUTOFF60 = CHD); CORRECT65 = (PREDWITHCUTOFF65 = CHD); CORRECT70 = (PREDWITHCUTOFF70 = CHD); CORRECT75 = (PREDWITHCUTOFF75 = CHD); CORRECT80 = (PREDWITHCUTOFF80 = CHD); CORRECT85 = (PREDWITHCUTOFF85 = CHD); CORRECT90 = (PREDWITHCUTOFF90 = CHD); CORRECT95 = (PREDWITHCUTOFF95 = CHD); RUN; TITLE ' '; PROC MEANS DATA=RCLib.SAtest2Pred MEAN; VAR CORRECT05 CORRECT10 CORRECT15 CORRECT20 CORRECT25 CORRECT30 CORRECT35 CORRECT40 CORRECT45 CORRECT50 CORRECT55 CORRECT60 CORRECT65 CORRECT70 CORRECT75 CORRECT80 CORRECT85 CORRECT90 CORRECT95; RUN;