# SAS的基本统计功能

• §4.1 一些单变量检验问题
• 4.1.1 正态性检验
• 4.1.2 两独立样本的均值检验
• 4.1.3 成对总体均值检验
• §4.2 回归分析
• 4.2.1 用SAS/INSIGHT进行曲线拟合
• 4.2.2 用SAS/INSIGHT进行线性回归分析
• 4.2.3 用SAS/INSIGHT拟合广义线性模型
• 4.2.4 用REG过程进行回归分析
• §4.3 方差分析入门
• 4.3.1 用ANOVA过程进行单因素方差分析
• 4.3.2 用NPAR1WAY进行非参数单因素方差分析
• 4.3.3 多重比较
• 4.3.4 多因素方差分析
• §4.4 列联表分析
• 4.4.1 列联表的输入与制表
• 4.4.2 列联表独立性检验
• 4.4.3 属性变量关联度计算

## 一些单变量检验问题

### 正态性检验

```proc univariate data=sasuser.gpa normal;
var gpa;
run;

Univariate Procedure
Moments
…………
W:Normal   0.951556  Pr<W         0.0001
…………

```

### 两独立样本的均值检验

```proc ttest data=sasuser.gpa;
class sex;
var satm;
run;

TTEST PROCEDURE
Variable: SATM         Math SAT Score
SEX          N                 Mean              Std Dev            Std Error
-----------------------------------------------------------------------------
Female     145         611.77241379          84.02056171           6.97752786
Male        79         565.02531646          82.92937599           9.33028376

Variances        T       DF    Prob>|T|
---------------------------------------
Unequal     4.0124    162.2      0.0001
Equal       3.9969    222.0      0.0001

For H0: Variances are equal, F' = 1.03    DF = (144,78)    Prob>F' = 0.9114

```

```proc npar1way data=sasuser.gpa wilcoxon;
class sex;
var gpa;
run;
```

```                      N P A R 1 W A Y  P R O C E D U R E
Wilcoxon Scores (Rank Sums) for Variable GPA
Classified by Variable SEX
Sum of      Expected       Std Dev          Mean
SEX           N        Scores      Under H0      Under H0         Score
Female      145    16067.5000    16312.5000    463.429146    110.810345
Male         79     9132.5000     8887.5000    463.429146    115.601266
Average Scores Were Used for Ties

Wilcoxon 2-Sample Test (Normal Approximation)
(with Continuity Correction of .5)
S =  9132.50    Z = 0.527589    Prob > |Z| = 0.5978

T-Test Approx. Significance = 0.5983

Kruskal-Wallis Test (Chi-Square Approximation)
CHISQ = 0.27949    DF =  1    Prob > CHISQ = 0.5970

```

SAS/INSIGHT中未提供两独立样本检验的功能。

### 成对总体均值检验

```data new;
set sasuser.gpa;
dmv = satm - satv;
keep dmv;
run;
proc univariate data=new;
var dmv;
run;
```

```                             Univariate Procedure
Variable=DMV
Moments

N               224  Sum Wgts        224
Mean       90.73661  Sum           20325
Std Dev    92.82931  Variance    8617.28
Skewness   -0.10367  Kurtosis   -0.34625
USS         3765875  CSS         1921653
CV         102.3063  Std Mean   6.202419
T:Mean=0   14.62923  Pr>|T|       0.0001
Num ^= 0        215  Num > 0         181
M(Sign)        73.5  Pr>=|M|      0.0001
Sgn Rank     9757.5  Pr>=|S|      0.0001

```

## 回归分析

### 用SAS/INSIGHT进行线性回归分析

• 因变量向量
• 矩阵，一般第一列元素全是1，代表截距项
• 未知参数向量
• 随机误差向量，元素独立且方差为相等的 （未知）。
• 正常情况下，系数的估计为
• 拟合值（或称预报值）为
• 其中 空间内向 的列张成的线性空间 投影的投影算子矩阵，叫做“帽子”矩阵。
• 拟合残差为
• 残差平方和为
• 误差项方差的估计为（要求设计阵 满秩）均方误差（MSE）
• 在线性模型的假设下，若设计阵 满秩， 分别是 的无偏估计，系数估计的方差阵
• 判断回归结果优劣的一个重要指标为复相关系数平方（决定系数） （其中 ），它代表在因变量的变差中用模型能够解释的部分的比例，所以 越大说明模型越好。

```                  WEIGHT    =    HEIGHT      AGE
Response Distribution:  Normal
```

```                              Model Equation
WEIGHT  =   -  141.2238   +    3.5970  HEIGHT    +      1.2784  AGE
```

```                                Summary of Fit
Mean of Response          100.0263  R-Square        0.7729
Root MSE                   11.5111  Adj R-Sq        0.7445
```

```                            Analysis of Variance
Source              DF  Sum of Squares  Mean Square      F Stat    Prob > F
Model                2       7215.6371    3607.8186     27.2275      0.0001
Error               16       2120.0997     132.5062           .           .
C Total             18       9335.7368            .           .           .

```

```                               Type III Tests
Source              DF  Sum of Squares  Mean Square      F Stat    Prob > F
HEIGHT               1       2091.1460    2091.1460     15.7815      0.0011
AGE                  1         22.3880      22.3880      0.1690      0.6865

```

```                             Parameter Estimates
Variable            DF    Estimate   Std Error      T Stat   Prob >|T|
INTERCEPT            1   -141.2238     33.3831     -4.2304      0.0006
HEIGHT               1      3.5970      0.9055      3.9726      0.0011
AGE                  1      1.2784      3.1101      0.4110      0.6865
Parameter Estimates
Tolerance  Var Inflation
.         0.0000
0.3416         2.9276
0.3416         2.9276
```

### 用SAS/INSIGHT拟合广义线性模型

```	各联系函数定义如下：
Identity		恒等变换
Log			自然对数

```

Logit

Probit ，其中 为标准正态分布函数

Comp. Log-Log

Power 在对话框的Power输入框指定。

### 用REG过程进行回归分析

SAS/STAT中提供了几个回归分析过程，包括REG（回归）、RSREG（二次响应面回归）、ORTHOREG（病态数据回归）、NLIN（非线性回归）、TRANSREG（变换回归）、CALIS（线性结构方程和路径分析）、GLM（一般线性模型）、GENMOD（广义线性模型），等等。我们这里只介绍REG过程，其它过程的使用请参考《SAS系统――SAS/STAT软件使用手册》。

REG过程的基本用法为：

```	PROC REG DATA=输入数据集 选项;
VAR  可参与建模的变量列表;
MODEL 因变量＝自变量表 / 选项;
PRINT  输出结果;
PLOT  诊断图形;
RUN;
```

REG过程是交互式过程，在使用了RUN语句提交了若干个过程步语句后可以继续写其它的REG 过程步语句，提交运行，直到提交QUIT语句或开始其它过程步或数据步才终止。

```proc reg data=sasuser.class;
var weight height age;
model weight=height age;
run;

```

```Model: MODEL1
Dependent Variable: WEIGHT     Weight in pounds

Analysis of Variance

Sum of         Mean
Source          DF      Squares       Square      F Value       Prob>F

Model            2   7215.63710   3607.81855       27.228       0.0001
Error           16   2120.09974    132.50623
C Total         18   9335.73684

Root MSE      11.51114     R-square       0.7729
Dep Mean     100.02632     Adj R-sq       0.7445
C.V.          11.50811

Parameter Estimates

Parameter      Standard    T for H0:
Variable  DF      Estimate         Error   Parameter=0    Prob > |T|

INTERCEP   1   -141.223763   33.38309350        -4.230        0.0006
HEIGHT     1      3.597027    0.90546072         3.973        0.0011
AGE        1      1.278393    3.11010374         0.411        0.6865

Variable
Variable  DF     Label

INTERCEP   1  Intercept
HEIGHT     1  Height in inches
AGE        1  Age in years

```

```  model weight=height;
run;

Model: MODEL2
Dependent Variable: WEIGHT     Weight in pounds
……………

```

```  model weight=height age / selection=stepwise;
run;
```

```              Stepwise Procedure for Dependent Variable WEIGHT

Step 1   Variable HEIGHT Entered    R-square = 0.77050684   C(p) =  1.16895797

DF         Sum of Squares      Mean Square          F   Prob>F

Regression       1          7193.24911864    7193.24911864      57.08   0.0001
Error           17          2142.48772347     126.02868962
Total           18          9335.73684211

Parameter        Standard          Type II
Variable         Estimate           Error   Sum of Squares          F   Prob>F

INTERCEP    -143.02691844     32.27459130    2475.04717580      19.64   0.0004
HEIGHT         3.89903027      0.51609395    7193.24911864      57.08   0.0001

Bounds on condition number:            1,            1
------------------------------------------------------------------------------

All variables left in the model are significant at the 0.1500 level.
No other variable met the 0.1500 significance level for entry into the model.

Summary of Stepwise Procedure for Dependent Variable WEIGHT

Variable        Number   Partial    Model

Step   Entered Removed     In      R**2     R**2      C(p)          F   Prob>F
Label

1   HEIGHT               1    0.7705   0.7705    1.1690    57.0763   0.0001
Height in inches
```

REG过程给出的缺省结果比较少。如果要输出高分辨率诊断图形的话需要在PROC REG 过程语句中加上GRAPHICS选项，用PRINT语句和PLOT语句显示额外的结果。为了显示模型的预测值（拟合值）和95％预测界限，使用语句

```print cli;
run;
```

```                  Dep Var   Predict   Std Err  Lower95%  Upper95%
Obs   WEIGHT      Value   Predict   Predict   Predict  Residual

1   84.0000   77.2683     3.963   52.1503     102.4    6.7317
2   98.0000     111.6     2.995   87.0659     136.1  -13.5798
3   90.0000     107.7     2.768   83.2863     132.1  -17.6807
4   77.0000   76.4885     4.042   51.3145     101.7    0.5115
5   84.5000   90.1351     2.889   65.6780     114.6   -5.6351
6     112.0     116.3     3.354   91.5388     141.0   -4.2586
7   50.5000   56.9933     6.251   29.8835   84.1032   -6.4933
8     112.5     100.7     2.577   76.3612     125.0   11.8375
9     102.5     101.8     2.587   77.5263     126.1    0.6678
10     112.5     126.0     4.296     100.6     151.4  -13.5062
11     102.5     104.6     2.645   80.2279     128.9   -2.0615
12     133.0     118.2     3.525   93.3827     143.0   14.7919
13   83.0000   80.3875     3.659   55.4757     105.3    2.6125
14   84.0000     100.7     2.577   76.3612     125.0  -16.6625
15   99.5000   87.0159     3.098   62.4451     111.6   12.4841
16     150.0     137.7     5.613     111.2     164.2   12.2967
17     128.0     109.6     2.872   85.1821     134.1   18.3698
18   85.0000   81.1673     3.587   56.3025     106.0    3.8327
19     112.0     116.3     3.354   91.5388     141.0   -4.2586

Sum of Residuals                      0
Sum of Squared Residuals      2142.4877
Predicted Resid SS (Press)    2651.3521
```

```print clm;
```

plot weight * height / conf95;

```plot residual. * predicted.;
```

```plot rstudent. * obs.;
```

## 方差分析入门

### 用ANOVA过程进行单因素方差分析

```		PROC ANOVA DATA＝数据集;
CLASS  因素;
MODEL  指标＝因素;
RUN;

```

```proc anova data=sasuser.veneer;
class brand;
model wear=brand;
run;
```

```                           Analysis of Variance Procedure
Class Level Information
Class    Levels    Values
BRAND         5    ACME AJAX CHAMP TUFFY XTRA
Number of observations in data set = 20

Dependent Variable: WEAR   Amount of material worn away

Source                  DF      Sum of Squares        Mean Square  F Value    Pr > F
Model                    4          0.61700000         0.15425000     7.40    0.0017
Error                   15          0.31250000         0.02083333
Corrected Total         19          0.92950000

R-Square                C.V.           Root MSE          WEAR Mean
0.663798            6.155120         0.14433757         2.34500000

Source                  DF            Anova SS        Mean Square  F Value    Pr > F
BRAND                    4          0.61700000         0.15425000     7.40    0.0017
```

### 用NPAR1WAY进行非参数单因素方差分析

NPAR1WAY过程的调用与ANOVA过程不同，因为它是单因素方差分析过程，所以只要用CLASS 语句给出分类变量（因素），用VAR语句给出指标就可以了，一般格式为：

```			PROC NPAR1WAY DATA=数据集 WILCOXON;
CLASS  因素;
VAR  指标;
RUN;

```

```proc npar1way data=sasuser.veneer wilcoxon;
class brand;
var wear;
run;
```

```                         N P A R 1 W A Y  P R O C E D U R E
Wilcoxon Scores (Rank Sums) for Variable WEAR
Classified by Variable BRAND

Sum of       Expected        Std Dev           Mean
BRAND          N         Scores       Under H0       Under H0          Score

ACME           4           40.0           42.0     10.4830691     10.0000000
CHAMP          4           44.0           42.0     10.4830691     11.0000000
AJAX           4           12.0           42.0     10.4830691      3.0000000
TUFFY          4           69.0           42.0     10.4830691     17.2500000
XTRA           4           45.0           42.0     10.4830691     11.2500000

Average Scores Were Used for Ties

Kruskal-Wallis Test (Chi-Square Approximation)
CHISQ =  11.982      DF =  4      Prob > CHISQ = 0.0175

```

### 多重比较

```		MEANS  因素 / 选项;
```

```proc anova data=sasuser.veneer;
class brand;
model wear=brand;
means brand;
run;
```

```                    Level of       -------------WEAR------------
BRAND      N       Mean              SD

ACME       4     2.32500000       0.17078251
AJAX       4     2.05000000       0.12909944
CHAMP      4     2.37500000       0.17078251
TUFFY      4     2.60000000       0.14142136
XTRA       4     2.37500000       0.09574271
```

```  means brand / t;
run;
```

```                          T tests (LSD) for variable: WEAR

NOTE: This test controls the type I comparisonwise error rate not the
experimentwise error rate.

Alpha= 0.05  df= 15  MSE= 0.020833
Critical Value of T= 2.13
Least Significant Difference= 0.2175

Means with the same letter are not significantly different.

T Grouping              Mean      N  BRAND

A            2.6000      4  TUFFY

B            2.3750      4  XTRA
B
B            2.3750      4  CHAMP
B
B            2.3250      4  ACME

C            2.0500      4  AJAX

```

Bonferroni t检验通过把每次比较的错误率取得很小来控制总误差率。比如，共有10次比较时，把每次比较的错误率控制在0.005就可以保证总错误率不超过0.05，但是，这样得到的实际总第一类错误率可能要比预定的水平小得多。在MEANS语句中使用BON语句可以执行Bonferroni t检验，缺省总错误率控制水平为0.05。对上面的胶合板数据增加如下语句：

```  means brand / bon;
run;
```

```                    Bonferroni (Dunn) T tests for variable: WEAR

NOTE: This test controls the type I experimentwise error rate, but
generally has a higher type II error rate than REGWQ.

Alpha= 0.05  df= 15  MSE= 0.020833
Critical Value of T= 3.29
Minimum Significant Difference= 0.3354

Means with the same letter are not significantly different.

Bon Grouping              Mean      N  BRAND

A            2.6000      4  TUFFY
A
B       A            2.3750      4  XTRA
B       A
B       A            2.3750      4  CHAMP
B       A
B       A            2.3250      4  ACME
B
B                    2.0500      4  AJAX
```

```  means brand/ regwq;
run;
```

```          Ryan-Einot-Gabriel-Welsch Multiple Range Test for variable: WEAR

NOTE: This test controls the type I experimentwise error rate.

Alpha= 0.05  df= 15  MSE= 0.020833

Number of Means         2         3         4         5
Critical Range   0.264828 0.2917322 0.2941581   0.31516

Means with the same letter are not significantly different.

REGWQ Grouping              Mean      N  BRAND

A            2.6000      4  TUFFY
A
A            2.3750      4  XTRA
A
A            2.3750      4  CHAMP
A
A            2.3250      4  ACME

B            2.0500      4  AJAX

```

MEANS语句的选项可以同时使用。在MEANS语句中可以用ALPHA=水平值来指定检验的水平。ANOVA过程中还提供了其它的多重比较方法，请自己参考有关资料。

### 多因素方差分析

SAS提供了若干个方差分析过程，可以考虑多个因素、有交互作用、有嵌套等情况的方差分析。用GLM过程还可以用一般线性模型来处理方差分析问题。在这里我们只介绍如何用ANOVA 过程进行均衡设计的多因素方差分析。

 B:氧化锌 A:促进剂 1 2 3 4 1 31, 33 34, 36 35, 36 39, 38 2 33, 34 36, 37 37, 39 38, 41 3 35, 37 37, 38 39, 40 42, 44

```data rubber;
input A B STREN;
cards;
1 1 31
1 1 33
1 2 34
1 2 36
1 3 35
1 3 36
1 4 39
1 4 38
2 1 33
……………
;
run;
```

```data rubber;
do  a=1 to 3;
do  b=1 to 4;
do  r=1 to 2;
input stren @@;
output;
end;
end;
end;
cards;
31 33  34 36  35 36  39 38
33 34  36 37  37 39  38 41
35 37  37 38  39 40  42 44
;
run;
```

```proc anova data=rubber;
class a b;
model  stren = a b a*b;
run;
```

MODEL语句中中A表示因素A的主效应，B表示因素B的主效应，A*B表示A和B的交互作用。结果如下：

```                         Analysis of Variance Procedure
Class Level Information
Class    Levels    Values
A             3    1 2 3
B             4    1 2 3 4
Number of observations in data set = 24

Dependent Variable: STREN
Sum of            Mean

Source                  DF          Squares          Square   F Value     Pr > F
Model                   11     193.45833333     17.58712121     12.06     0.0001
Error                   12      17.50000000      1.45833333
Corrected Total         23     210.95833333
R-Square             C.V.        Root MSE           STREN Mean
0.917045         3.260152       1.2076147            37.041667

Source                  DF         Anova SS     Mean Square   F Value     Pr > F
A                        2      56.58333333     28.29166667     19.40     0.0002
B                        3     132.12500000     44.04166667     30.20     0.0001
A*B                      6       4.75000000      0.79166667      0.54     0.7665
```

```proc anova data=rubber;
class a b;
model  stren = a b / t;
run;

```

```  means a b;
run;
```

ANOVA也可以用来分析正交设计的结果。例如，为了提高某种试剂产品的收率（指标），考虑如下几个因素对其影响：

 A：反应温度 1 (50℃) 2 (70℃) B：反应时间 1 (1小时) 2 (2小时) C：硫酸浓度 1 (17%) 2 (27%) D：硫酸产地 1 (天津) 2 (上海) E：操作方式 1 (搅拌) 2 (不搅拌)

```data  exp;
input temp time conc manu mix  prod;
cards;
1 1 1 1 1  65
1 1 1 2 2  74
1 2 2 1 2  71
1 2 2 2 1  73
2 1 2 1 2  70
2 1 2 2 1  73
2 2 1 1 1  62
2 2 1 2 2  69
;
run;
proc anova data=exp;
class temp time conc manu mix;
model prod = temp--mix;
means temp--mix / t;
run;
```

```Source                  DF         Anova SS     Mean Square   F Value     Pr > F
TEMP                     1      10.12500000     10.12500000     16.20     0.0565
TIME                     1       6.12500000      6.12500000      9.80     0.0887
CONC                     1      36.12500000     36.12500000     57.80     0.0169
MANU                     1      55.12500000     55.12500000     88.20     0.0111
MIX                      1      15.12500000     15.12500000     24.20     0.0389
```

## 列联表分析

### 列联表的输入与制表

 男生 女生 本地 4 6 外地 14 7

```data class;
input sno sex \$ from \$;
label sex='性别' from='来源';
cards;
1  男  本地
2  女  外地
3  男  外地
…………/* 所有学生的记录 */
;
run;
```

```proc freq data=class;
tables from * sex;
run;
```

```data classt;
input from \$ sex \$ numcell;
label sex='性别' from='来源';
cards;

;
run;

```

```proc freq data=classt;
tables from * sex;
weight numcell;
run;
```

```proc freq data=classt;
tables from * sex / nopct norow nocol;
weight numcell;
run;
```

### 列联表独立性检验

```例如，为了探讨吸烟与慢性支气管炎有无关系，调查了339人，情况如下：

```
 患慢性支气管炎 未患慢性支气管炎 吸烟 43 162 不吸烟 13 121

: X与Y相互独立

```data bron;
input smoke \$ bron \$ numcell;
label smoke='吸烟'  bron='慢性支气管炎';
cards;

;
proc freq data=bron;
tables smoke*bron / nopct norow nocol  chisq expected;
weight numcell;
run;
```

```                            TABLE OF SMOKE BY BRON

SMOKE(吸烟)     BRON(慢性支气管炎)

Frequency|
Expected |患病    |未患    |  Total
---------+--------+--------+
不吸烟   |     13 |    121 |    134
| 22.136 | 111.86 |
---------+--------+--------+
吸烟     |     43 |    162 |    205
| 33.864 | 171.14 |
---------+--------+--------+
Total          56      283      339

STATISTICS FOR TABLE OF SMOKE BY BRON

Statistic                     DF     Value        Prob
------------------------------------------------------
Chi-Square                     1     7.469       0.006
Likelihood Ratio Chi-Square    1     7.925       0.005
Continuity Adj. Chi-Square     1     6.674       0.010
Mantel-Haenszel Chi-Square     1     7.447       0.006
Fisher's Exact Test (Left)                    4.09E-03
(Right)                      0.998
(2-Tail)                  6.86E-03
Phi Coefficient                     -0.148
Contingency Coefficient              0.147
Cramer's V                          -0.148

Sample Size = 339

```

### 属性变量关联度计算

```data cows;
input herdsize  disease  numcell;
label herdsize='牛群大小'  disease='患病程度';
cards;
1 0 9
1 1 5
1 2 9
2 0 18
2 1 4
2 2 19
3 0 11
3 1 88
3 2 136

```

;

```proc freq data=cows;
tables herdsize*disease / measures expected nopercent norow nocol;
weight numcell;
title '奶牛疾病数据分析';
run;
```

```                 HERDSIZE(牛群大小)     DISEASE(患病程度)

Frequency|
Expected |       0|       1|       2|  Total
---------+--------+--------+--------+
1 |      9 |      5 |      9 |     23
| 2.9231 | 7.4615 | 12.615 |
---------+--------+--------+--------+
2 |     18 |      4 |     19 |     41
| 5.2107 | 13.301 | 22.488 |
---------+--------+--------+--------+
3 |     11 |     88 |    136 |    235
| 29.866 | 76.237 |  128.9 |
---------+--------+--------+--------+
Total          38       97      164      299

STATISTICS FOR TABLE OF HERDSIZE BY DISEASE

Statistic                            Value        ASE
------------------------------------------------------
Gamma                                0.411       0.101
Kendall's Tau-b                      0.217       0.061
Stuart's Tau-c                       0.148       0.044

Somers' D C|R                        0.276       0.078
Somers' D R|C                        0.171       0.048

Pearson Correlation                  0.282       0.066
Spearman Correlation                 0.233       0.066

Lambda Asymmetric C|R                0.000       0.000
Lambda Asymmetric R|C                0.109       0.079
Lambda Symmetric                     0.035       0.026

Uncertainty Coefficient C|R          0.099       0.026
Uncertainty Coefficient R|C          0.144       0.037
Uncertainty Coefficient Symmetric    0.117       0.030
Sample Size = 299
```