Outlier Detection 孤立点检测
This page shows an example on outlier detection with the LOF (Local Outlier Factor) algorithm.
The LOF algorithm
LOF (Local Outlier Factor) is an algorithm for identifying density-based local outliers [Breunig et al., 2000]. With LOF, the local density of a point is compared with that of its neighbors. If the former is signi.cantly lower than the latter (with an LOF value greater than one), the point is in a sparser region than its neighbors, which suggests it be an outlier.
Function lofactor(data, k) in packages DMwR and dprep calculates local outlier factors using the LOF algorithm, where k is the number of neighbors used in the calculation of the local outlier factors.
Calculate Outlier Scores
> # remove “Species”, which is a categorical column
> iris2 <- iris[,1:4]
> outlier.scores <- lofactor(iris2, k=5)
> # pick top 5 as outliers
> outliers <- order(outlier.scores, decreasing=T)[1:5]
> # who are outliers
 42 107 23 110 63
Visualize Outliers with Plots Next, we show outliers with a biplot of the first two principal components.
> n <- nrow(iris2)
> labels <- 1:n
> labels[-outliers] <- “.”
> biplot(prcomp(iris2), cex=.8, xlabs=labels)
We can also show outliers with a pairs plot as below, where outliers are labeled with “+” in red.
> pch <- rep(“.”, n)
> pch[outliers] <- “+”
> col <- rep(“black”, n)
> col[outliers] <- “red”
> pairs(iris2, pch=pch, col=col)
Parallel Computation of LOF Scores
Package Rlof provides function lof(), a parallel implementation of the LOF algorithm. Its usage is similar to the above lofactor(), but lof() has two additional features of supporting multiple values of k and several choices of distance metrics. Below is an example of lof().
> outlier.scores <- lof(iris2, k=5)
> # try with different number of neighbors (k = 5,6,7,8,9 and 10)
> outlier.scores <- lof(iris2, k=c(5:10))