My previous post described how to use the “missing response trick” to score a regression model. As I said in that article, there are other ways to score a regression model. This article describes using the SCORE procedure, a SCORE statement, the relatively new PLM procedure, and the CODE statement.
The following DATA step defines a small set of data. The goal of the analysis is to fit various regression models to Y as a function of X, and then evaluate each regression model on a second data set, which contains 200 evenly spaced X values.
/* the original data; fit model to these values */ data A; input x y @@; datalines; 1 4 2 9 3 20 4 25 5 1 6 5 7 -4 8 12 ; /* the scoring data; evaluate model on these values */ %let NumPts = 200; data ScoreX(keep=x); min=1; max=8; do i = 0 to &NumPts-1; x = min + i*(max-min)/(&NumPts-1); /* evenly spaced values */ output; /* no Y variable; only X */ end; run;
The SCORE procedure
Some SAS/STAT procedures can output parameter estimates for a model to a SAS data set. The SCORE procedure can read those parameter estimates and use them to evaluate the model on new values of the explanatory variables. (For a regression model, the SCORE procedure performs matrix multiplication: you supply the scoring data X and the parameter estimates b and the procedure computes the predicted values p = Xb.)
The canonical example is fitting a linear regression by using PROC REG. You can use the OUTEST= option to write the parameter estimates to a data set. That data set, which is named RegOut in this example, becomes one of the two input data sets for PROC SCORE, as follows:
proc reg data=A outest=RegOut noprint; YHat: model y = x; /* name of model is used by PROC SCORE */ quit; proc score data=ScoreX score=RegOut type=parms predict out=Pred; var x; run;
It is worth noting that the label for the MODEL statement in PROC REG is used by PROC SCORE to name the predicted variable. In this example, the YHat variable in the Pred data set contains the predicted values. If you do not specify a label on the MODEL statement, then a default name such as MODEL1 is used. For more information, see the documentation for the SCORE procedure.
The SCORE statement
Nonparametric regression procedures cannot output parameter estimates because…um…because they are nonparametric! Nonparametric regression procedures support a SCORE statement, which enables you to specify the scoring data set. The following example shows the syntax of the SCORE statement for the TPSPLINE procedure, which fits a thin-plate spline to the data:
proc tpspline data=A; model y = (x); score data=ScoreX out=Pred; run;
The STORE statement and the PLM procedure
Although the STORE statement and the PLM procedure were introduced in SAS/STAT 9.22 (way back in 2010), some SAS programmers are still not aware of these features. Briefly, the idea is that sometimes a scoring data set is not available when a model is fit, so the STORE statement saves all of the information needed to recreate and evaluate the model. The saved information can be read by the PLM procedure, which includes a SCORE statement, as well as many other capabilities. A good introduction to the PLM procedure is Tobias and Cai (2010), “Introducing PROC PLM and Postfitting Analysis for Very General Linear Models.”
For this example, the GLM procedure is used to fit the data. Because of the shape of the previous thin-plate spline curve, a cubic model is fit. The STORE statement is used to save the model information in an item store named WORK.ScoreExample. (I’ve used the WORK libref, but use a permanent libref if you want the item store to persist across SAS sessions.) Many hours or days later, you can use the PLM procedure to evaluate the model on a new set of data, as shown in the following statements:
proc glm data=A; model y = x | x | x; store work.ScoreExample; /* store the model */ quit; proc plm restore=work.ScoreExample; score data=ScoreX out=Pred; /* evaluate the model on new data */ run;
The STORE statement is supported by many SAS/STAT regression procedures, including the GENMOD, GLIMMIX, GLM, GLMSELECT, LIFEREG, LOGISTIC, MIXED, ORTHOREG, PHREG, PROBIT, SURVEYLOGISTIC, SURVEYPHREG, and SURVEYREG procedures. It also applies to the RELIABILITY procedure in SAS/QC software.
The CODE statement
In SAS/STAT 12.1 the CODE statement was added to several SAS/STAT regression procedures. It is also part of the PLM procedure. The CODE statement offers yet another option for scoring data. The CODE statement writes DATA step statements into a text file. You can then use the %INCLUDE statement to insert those statements into a DATA step. In the following example, DATA step statements are written to the file glmScore.sas. You can include that file into a DATA step in order to evaluate the model on the ScoreX data:
proc glm data=A noprint; model y = x | x | x; code file='glmScore.sas'; quit; data Pred; set ScoreX; %include 'glmScore.sas'; run;
For this example, the predicted values are in a variable called P_y in the Pred data set. The CODE statement is supported by many predictive modeling procedures, such as the GENMOD, GLIMMIX, GLM, GLMSELECT, LOGISTIC, MIXED, PLM, and REG procedures in SAS/STAT software. In addition, the CODE statement is supported by the HPLOGISTIC and HPREG procedures in SAS High-Performance Analytics software.
In summary, there are many ways to score SAS regression models. For PROC REG and linear models with an explicit design matrix, use the SCORE procedure. For nonparametric models, use the SCORE statement. For scoring data sets long after a model is fit, use the STORE statement and the PLM procedure. For scoring inside the DATA step, use the CODE statement. For regression procedures that do not support these options (such as PROC TRANSREG) use the missing value trick from my last post.