# 算法三：朴素贝叶斯算法

P(B|A)=P(AB)/P(A)

P(A)是A的先验概率或边缘概率。之所以称为”先验”是因為它不考虑任何B方面的因素。

P(A|B)是已知B发生后A的条件概率(直白来讲，就是先有B而后=>才有A)，也由于得自B的取值而被称作A的后验概率。

P(B|A)是已知A发生后B的条件概率(直白来讲，就是先有A而后=>才有B)，也由于得自A的取值而被称作B的后验概率。

P(B)是B的先验概率或边缘概率，也作标准化常量。

P(B|A)=P(b1|A)*P(b2|A)*…*P(bn|A)

“Good good study, Day day up.”

data<-data[,-1]#去掉了日期这一个没有可作为分类变量价值的变量

prior.yes<-sum(data[,5] ==”Yes”) / length(data[,5]);

prior.no<-sum(data[,5] ==”No”) / length(data[,5]);

bayespre<- function(condition) {

post.yes <-

sum((data[,1] == condition[1]) & (data[,5] == “Yes”)) /sum(data[,5] == “Yes”) *

sum((data[,2] == condition[2]) & (data[,5] == “Yes”)) /sum(data[,5] == “Yes”) *

sum((data[,3] == condition[3]) & (data[,5] == “Yes”)) /sum(data[,5] == “Yes”) *

sum((data[,4] == condition[4]) & (data[,5] == “Yes”)) /sum(data[,5] == “Yes”) *

prior.yes;

post.no <-

sum((data[,1] == condition[1]) & (data[,5] == “No”)) /sum(data[,5] == “No”) *

sum((data[,2] == condition[2]) & (data[,5] == “No”)) /sum(data[,5] == “No”) *

sum((data[,3] == condition[3]) & (data[,5] == “No”)) /sum(data[,5] == “No”) *

sum((data[,4] == condition[4]) & (data[,5] == “No”)) /sum(data[,5] == “No”) *

prior.no;

return(list(prob.yes = post.yes,

prob.no = post.no,

prediction = ifelse(post.yes>=post.no, “Yes”, “No”)));

}

bayespre(c(“Rain”,”Hot”,”High”,”Strong”))

bayespre(c(“Sunny”,”Mild”,”Normal”,”Weak”))

bayespre(c(“Overcast”,”Mild”,”Normal”,”Weak”))

>bayespre(c(“Rain”,”Hot”,”High”,”Strong”))

\$prob.yes

[1] 0.005291005

\$prob.no

[1] 0.02742857

\$prediction

[1] “No”

>bayespre(c(“Sunny”,”Mild”,”Normal”,”Weak”))

\$prob.yes

[1] 0.02821869

\$prob.no

[1] 0.006857143

\$prediction

[1] “Yes”

>bayespre(c(“Overcast”,”Mild”,”Normal”,”Weak”))

\$prob.yes

[1] 0.05643739

\$prob.no

[1] 0

\$prediction

[1] “Yes”

>bayespre(animals,c(“no”,”yes”,”no”,”sometimes”,”yes”))

\$prob.mammals

[1] 0

\$prob.amphibians

[1] 0.1

\$prob.fishes

[1] 0

\$prob.reptiles

[1] 0.0375

\$prediction

[1] amphibians

Levels: amphibians birds fishesmammals reptiles

>bayespre(animals,c(“no”,”yes”,”no”,”yes”,”no”))

\$prob.mammals

[1] 0.0004997918

\$prob.amphibians

[1] 0

\$prob.fishes

[1] 0.06666667

\$prob.reptiles

[1] 0

\$prediction

[1] fishes

Levels: amphibians birds fishesmammals reptiles

> bayespre(animals,c(“yes”,”no”,”no”,”yes”,”no”))

\$prob.mammals

[1] 0.0179925

\$prob.amphibians

[1] 0

\$prob.fishes

[1] 0.01666667

\$prob.reptiles

[1] 0

\$prediction

[1] mammals

Levels: amphibians birds fishesmammals reptiles

>bayespre(c(“foggy”,”Hot”,”High”,”Strong”))

\$prob.yes

[1] 0

\$prob.no

[1] 0

\$prediction

[1] “Yes”

P(xi|yj)=(nc+mp)/(n+m)

n是类yj中的样本总数，nc是类yj中取值xi的样本数，m是称为等价样本大小的参数，而p是用户指定的参数。如果没有训练集(即n=0)，则P(xi|yj)=p, 因此p可以看作是在类yj的样本中观察属性值xi的先验概率。等价样本大小决定先验概率和观测概率nc/n之间的平衡，提高了估计的稳健性。

R语言中Naive Bayes的实现函数

R的e1071包的naiveBayes函数提供了naive bayes的具体实现，其用法如下：

## S3 method for class ‘formula’

naiveBayes(formula, data, laplace = 0, …, subset, na.action = na.pass)

## Default S3 method:

naiveBayes(x, y, laplace = 0, …)

data(Titanic)

m <- naiveBayes(Survived ~ ., data = Titanic)

m

R中的文本处理工具

tm_map(x, FUN, …, useMeta = FALSE, lazy = FALSE)

Dictionary() 函数常用于在文本挖掘中展现相关的词条时。当将字典(Dictionary)传递到DocumentTermMatrix() 以后，生成的矩阵会根据字典提取计算词汇出现在每篇文档的频率。(这个在之后会有例子，就不多说了)

strsplit(x, split, fixed = FALSE, perl = FALSE, useBytes = FALSE)

X：字串向量，每个元素都将单独进行拆分。

Split：为拆分位置的字串向量，默认为正则表达式匹配(fixed=FALSE)fixed=TRUE，表示使用普通文本匹配或正则表达式的精确匹配。

Perl：表示可以选择是否兼容Perl的正则表达式的表述方式。

for每篇训练文档:

for每个类别:

if 词条in 文档：增加该词条计数值，增加所有词条计数值

For 每个类别:

For 每个词条

Prob=词条数目/总词条数目

Return prob

docIdKey wordclass

1“Adaptive weighting” “run length” “control chart”spc

2“run length” “control chart” spc

3“control chart” “EWMA” “run length”spc

P(”control chart”| spc)=(3+1)/(8+7)=4/15=2/7

P(”main effect”| spc) = (0+1)/(8+7)=1/15

P(”control chart”|doe)=(0+1)/(7+3)=0.1

P(spc |d)=4/15*4/15*1/15*8/11≈0.003447811

P(doe|d)= 0.1*0.1*0.2*0.1*3/11≈5.454545e-05

R代码：

1、建立词袋：

library(tm)

txt1<-“D:/R/data/email/ham”

txtham<-tm_map(txtham,stripWhitespace)

txtham<-tm_map(txtham,tolower)

txtham<-tm_map(txtham,removeWords,stopwords(“english”))

txtham<-tm_map(txtham,stemDocument)

txt2<-“D:/R/data/email/spam”

txtspam<-tm_map(txtspam,stripWhitespace)

txtspam<-tm_map(txtspam,tolower)

txtspam<-tm_map(txtspam,removeWords,stopwords(“english”))

txtspam<-tm_map(txtspam,stemDocument)

2、词汇计数(包括词类数目与词量数目)

dtm1<-DocumentTermMatrix(txtham)

n1<-length(findFreqTerms(dtm1,1))

dtm2<-DocumentTermMatrix(txtspam)

n2<-length(findFreqTerms(dtm2,1))

setwd(“D:/R/data/email/spam”)

name<-list.files(txt2)

data1<-paste(“spam”,1:23)

lenspam<-0

for(i in 1:length(names)){

assign(data1[i],scan(name[i],”character”))

lenspam<-lenspam+length(get(data[i]))

}

setwd(“D:/R/data/email/ham”)

names<-list.files(txt1)

data<-paste(“ham”,1:23)

lenham<-0

for(i in 1:length(names)){

assign(data[i],scan(names[i],”character”))

lenham<-lenham+length(get(data[i]))

}

3、naive Bayes模型建立(使用m估计，p=1/m,m为词汇总数)

prob<-function(char,corp,len,n){

d<-Dictionary(char)

re<-DocumentTermMatrix(corp, list(dictionary = d));

as.matrix(re)

dtm<-DocumentTermMatrix(corp)

n<-length(findFreqTerms(dtm, 1))

prob<-(sum(re[,1])+1)/(n+len)

return(prob)

}

testingNB<-function(sentences){

pro1<-0.5

pro2<-0.5

for(i in1:length(sentences)){

pro1<-pro1*prob(sentences[i],txtham,lenham,n1)

}

for(i in1:length(sentences)){

pro2<-pro2*prob(sentences[i],txtspam,lenspam,n2)

}

return(list(prob.ham = pro1,

prob.span =pro2,

prediction =ifelse(pro1>=pro2/10, “ham”, “spam”)))

}

4、测试(利用test里的4封邮件,仅以ham2.txt，spam1.txt为例)

#读取文档，并且实现分词与填充

email<-scan(“D:/R/data/email/test/ham2.txt”,”character”)

sentences<-unlist(strsplit(email,”,|\\?|\\;|\\!”))#分词

library(Snowball)#实现填充

a<-tolower(SnowballStemmer(sentences))# 实现填充并除去大小写因素

#测试

testingNB(a)

\$prob.ham

[1] 3.537766e-51

\$prob.span

[1] 4.464304e-51

\$prediction

[1] “ham”

\$prob.ham

[1] 5.181995e-95

\$prob.span

[1] 1.630172e-84

\$prediction

[1] “spam”