# 关联规则：R与SAS的比较

 算法 R/ARULES SAS/EM Apriori Yes Yes ECLAT Yes No FP-Growth No No

R的代码主要来自《R and Data Mining》，我只加了下载数据的代码和对代码的中文说明。

1）下载泰坦尼克数据

setInternet2(TRUE)
con <- url(“http://www.rdatamining.com/data/titanic.raw.rdata“)
close(con) # url() always opens the connection
str(titanic.raw)

2）关联分析

library(arules)
# find association rules with default settings
rules <- apriori(titanic.raw)
inspect(rules)

3）只保留结果中包含生存变量的关联规则

# rules with rhs containing “Survived” only
rules <- apriori(titanic.raw, parameter = list(minlen=2, supp=0.005, conf=0.8), appearance = list(rhs=c(“Survived=No”, “Survived=Yes”), default=”lhs”),control = list(verbose=F))
rules.sorted <- sort(rules, by=”lift”)
inspect(rules.sorted)

R 总共生成了12条跟人员生存相关的规则：
lhs       rhs      support      confidence      lift
1 {Class=2nd, Age=Child}                         => {Survived=Yes}
0.010904134 1.0000000 3.095640
2 {Class=2nd, Sex=Female, Age=Child}  => {Survived=Yes}
0.005906406 1.0000000 3.095640
3 {Class=1st, Sex=Female}                      => {Survived=Yes}
0.064061790 0.9724138 3.010243
4 {Class=1st, Sex=Female, Age=Adult}    => {Survived=Yes}
0.063607451 0.9722222 3.009650
5 {Class=2nd, Sex=Male, Age=Adult}        => {Survived=No}
0.069968196 0.9166667 1.354083
6 {Class=2nd, Sex=Female}                      => {Survived=Yes}
0.042253521 0.8773585 2.715986
7 {Class=Crew, Sex=Female}                   => {Survived=Yes}
0.009086779 0.8695652 2.691861
8 {Class=Crew, Sex=Female, Age=Adult} => {Survived=Yes}
0.009086779 0.8695652 2.691861
9 {Class=2nd, Sex=Male}                           => {Survived=No}
0.069968196 0.8603352 1.270871
10 {Class=2nd, Sex=Female, Age=Adult}  => {Survived=Yes}
0.036347115 0.8602151 2.662916
11 {Class=3rd, Sex=Male, Age=Adult}       => {Survived=No}
0.175829169 0.8376623 1.237379
12 {Class=3rd, Sex=Male}                          => {Survived=No}
0.191731031 0.8274510 1.222295

4）去除冗余的规则

# find redundant rules
subset.matrix <- is.subset(rules.sorted, rules.sorted)
subset.matrix[lower.tri(subset.matrix, diag=T)] <- NA
redundant <- colSums(subset.matrix, na.rm=T) >= 1
which(redundant)

# remove redundant rules
rules.pruned <- rules.sorted[!redundant]
inspect(rules.pruned)

lhs       rhs      support      confidence      lift
1 {Class=2nd, Age=Child}                   => {Survived=Yes}
0.010904134  1.0000000 3.095640
2 {Class=1st, Sex=Female}                => {Survived=Yes}
0.064061790  0.9724138 3.010243
3 {Class=2nd, Sex=Female}               => {Survived=Yes}
0.042253521  0.8773585 2.715986
4 {Class=Crew, Sex=Female}            => {Survived=Yes}
0.009086779  0.8695652 2.691861
5 {Class=2nd, Sex=Male, Age=Adult} => {Survived=No}
0.069968196  0.9166667 1.354083
6 {Class=2nd, Sex=Male}                   => {Survived=No}
0.069968196  0.8603352 1.270871
7 {Class=3rd, Sex=Male, Age=Adult}  => {Survived=No}
0.175829169  0.8376623 1.237379
8 {Class=3rd, Sex=Male}                    => {Survived=No}
0.191731031  0.8274510 1.222295

5）结果的解释

1 {Class=2nd, Age=Child}              => {Survived=Yes} 0.010904134  1.0000000 3.095640
2 {Class=1st, Sex=Female}           => {Survived=Yes} 0.064061790  0.9724138 3.010243
3 {Class=2nd, Sex=Female}          => {Survived=Yes} 0.042253521  0.8773585 2.715986
4 {Class=Crew, Sex=Female}       => {Survived=Yes} 0.009086779  0.8695652 2.691861

rules <- apriori(titanic.raw, parameter = list(minlen=3, supp=0.002, conf=0.2), appearance = list(rhs=c(“Survived=Yes”), lhs=c(“Class=1st”, “Class=2nd”, “Class=3rd”, “Age=Child”, “Age=Adult”), default=”none”), control = list(verbose=F))
rules.sorted <- sort(rules, by=”confidence”)
inspect(rules.sorted)

lhs                        rhs           support     confidence lift
1 {Class=2nd, Age=Child} => {Survived=Yes} 0.010904134 1.0000000 3.0956399
2 {Class=1st, Age=Child} => {Survived=Yes} 0.002726034 1.0000000 3.0956399
3 {Class=1st, Age=Adult} => {Survived=Yes} 0.089504771 0.6175549 1.9117275
4 {Class=2nd, Age=Adult} => {Survived=Yes} 0.042707860 0.3601533 1.1149048
5 {Class=3rd, Age=Child} => {Survived=Yes} 0.012267151 0.3417722 1.0580035
6 {Class=3rd, Age=Adult} => {Survived=Yes} 0.068605179 0.2408293 0.7455209

6）可视化

# visualize rules
library(arulesViz)
plot(rules)
plot(rules, method=”graph”, control=list(type=”items”))
plot(rules, method=”paracoord”, control=list(reorder=TRUE))

1）下载泰坦尼克数据

proc iml;
submit /R;
setInternet2(TRUE)
con <- url(http://www.rdatamining.com/data/titanic.raw.rdata)
close(con) # url() always opens the connection
endsubmit;

call ImportDataSetFromR(“Work.titanic”, “titanic.raw”);
run;quit;

2）将数据转换成SAS/EM要求的格式

data items2;
set titanic;
length tid 8;
length item \$8;
tid = _n_;
item = class;
output;
item = sex;
output;
item = age;
output;
item = survived;
output;
keep tid item;
run;

3）关联分析

proc dmdb data=items2 dmdbcat=dbcat;
class tid item;
run; quit;

proc assoc data=items2 dmdbcat=dbcat pctsup=0.5 out=frequentItems;
id tid;
target item;
run;

proc rulegen in=frequentItems dmdbcat=dbcat out=rules minconf=80;
run ;

proc sort data=rules;
by descending conf;
run ;

4） 只保留结果中包含生存变量的关联规则

data surviverules;
set rules(where=(set_size>1 and (_rhand=‘Yes’ or _rhand=‘No’)));
run;

proc print data=surviverules;
var conf support lift rule ;
run ;

SAS 结果:

OBS CONF SUPPORT LIFT RULE
1 100.00 1.09 3.10 2nd & Child ==> Yes
2 100.00 0.59 3.10 2nd & Child & Female ==> Yes
3 100.00 0.50 3.10 2nd & Child & Male ==> Yes
4 97.24 6.41 3.01 1st & Female ==> Yes
5 97.22 6.36 3.01 1st & Adult & Female ==> Yes
6 91.67 7.00 1.35 2nd & Adult & Male ==> No
7 87.74 4.23 2.72 2nd & Female ==> Yes
8 86.96 0.91 2.69 Crew & Female ==> Yes
9 86.96 0.91 2.69 Adult & Crew & Female ==> Yes
10 86.03 7.00 1.27 2nd & Male ==> No
11 86.02 3.63 2.66 2nd & Adult & Female ==> Yes
12 83.77 17.58 1.24 3rd & Adult & Male ==> No
13 82.75 19.17 1.22 3rd & Male ==> No

http://support.sas.com/documentation/onlinedoc/miner/em43/assoc.pdf
http://support.sas.com/documentation/onlinedoc/miner/em43/sequence.pdf
http://support.sas.com/documentation/onlinedoc/miner/em43/rulegen.pdf

mbscore(购物篮数据的预测，是EM 6.1/SAS 9.2 时新引入的过程步，支持层次关联<Hierarchical Association>)

 3 100 0.5 3.1 2nd & Child & Male ==> Yes

data min_support;
set frequentItems;
if count=int(2201*0.005);
run;

proc print data=min_support;
run;quit;

OBS SET_SIZE COUNT ITEM1 ITEM2 ITEM3 ITEM4 ITEM5 ITEM6
1 3 11 2nd Child Male
2 4 11 2nd Child Male Yes

1）如何将PMML文件导入R生成Rule对象
2）如何在SAS EMM 中使用PMML？