• 登录
Skip to content

一起大数据-技术文章心得

一起大数据网由数据爱好者发起并维护,专注数据分析、挖掘、大数据相关领域的技术分享、交流。不定期组织爱好者聚会,期待通过跨行业的交流和碰撞,更好的推进各领域数据的价值落地。

Menu
  • 首页
  • 大数据案例
  • 数据&电子书
  • 视频
    • Excel视频
    • VBA视频
    • Mysql视频
    • 统计学视频
    • SPSS视频
    • R视频
    • SAS视频
    • Python视频
    • 数据挖掘视频
    • 龙星计划-数据挖掘
    • 大数据视频
    • Machine Learning with Python
  • 理论
    • 统计学
    • 数据分析
    • 机器学习
    • 大数据
  • 软件
    • Excel
    • Modeler
    • Python
    • R
    • SAS
    • SPSS
    • SQL
    • PostgreSQL
    • KNIME
  • 技术教程
    • SQL教程
    • SPSS简明教程
    • SAS教程
    • The Little SAS Book
    • SAS EG教程
    • R语言教程
    • Python3教程
    • IT 技术速查手册
    • Data Mining With Python and R
    • SAS Enterprise Miner
  • 问答社区
  • 我要提问
Menu

Path Analysis Using R

Posted on 2021年6月3日
Path analysis is a type of statistical method to investigate the direct and indirect relationship among a set of exogenous (independent, predictor, input) and endogenous (dependent, output) variables. Path analysis can be viewed as generalization of regression and mediation analysis where multiple input, mediators, and output can be used. The purpose of path analysis is to study relationships among a set of observed variables, e.g., estimate and test direct and indirect effects in a system of regression equations and estimate and test theories about the absence of relationships

Path diagrams

Path analysis is often conducted based on path diagrams. Path diagram represents a model using shapes and paths. For example, the diagram below portrays the multiple regression model Y=β0+βXX+βWW+βZZ+eY=β0+βXX+βWW+βZZ+e.

In a path diagram, different shapes and paths have different meanings:

  • Squares or rectangular boxes: observed or manifest variables
  • Circles or ovals: errors, factors, latent variables
  • Single-headed arrows: linear relationship between two variables. Starts from an independent variable and ends on a dependent variable.
  • Double-headed arrows: variance of a variable or covariance between two variables
  • Triangle: a constant variable, usually a vector of ones

A simplified path diagram is often used in practice in which the intercept term is removed and the residual variances are directly put on the outcome variables. For example, for the regression example, the path diagram is shown below.

In R, path analysis can be conducted using R package lavaan. We now show how to conduct path analysis using several examples.

Example 1. Mediation analysis — Test the direct and indirect effects

The NLSY data include three variables – mother’s education (ME), home environment (HE), and child’s math score. Assume we want to test whether home environment is a mediator between mother’s education and child’s math score. The path diagram for the mediation model is:

To estimate the paths in the model, we use the R package lavaan. To specify the mediation model, we follow the rules below. First, a model is put into a pair of quotation marks. Second, to specify the regression relationship, we use a symbol ~. The variable on the left is the outcome and the ones on the right are predictors or covariates. Third, parameter names can be used for paths in model specification such as a, b and cp. Fourth, we can define new parameters using the notation :=. On the left is the name of the new parameter and on the right is the formula to define the new parameter such as a*b that defines the mediation effect and a*b + cp that defines the total effect.

To estimate the model, the sem() function from lavaan can be used. To view the results, the summary() function is used. For example, for the mediation example, the output is given below. From the output, we can see

> library(lavaan)
This is lavaan 0.5-23.1097
lavaan is BETA software! Please report any bugs.
> usedata('nlsy')
> 
> mediation<-'
+ math ~ b*HE + cp*ME
+ HE ~ a*ME
+ ab := a*b
+ total := a*b + cp
+ '
> 
> mediation.res<-sem(mediation, data=nlsy)
> summary(mediation.res)
lavaan (0.5-23.1097) converged normally after  21 iterations

  Number of observations                           371

  Estimator                                         ML
  Minimum Function Test Statistic                0.000
  Degrees of freedom                                 0
  Minimum Function Value               0.0000000000000

Parameter Estimates:

  Information                                 Expected
  Standard Errors                             Standard

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)
  math ~                                              
    HE         (b)    0.465    0.143    3.252    0.001
    ME        (cp)    0.463    0.120    3.869    0.000
  HE ~                                                
    ME         (a)    0.139    0.043    3.249    0.001

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)
   .math             20.621    1.514   13.620    0.000
   .HE                2.724    0.200   13.620    0.000

Defined Parameters:
                   Estimate  Std.Err  z-value  P(>|z|)
    ab                0.065    0.028    2.298    0.022
    total             0.528    0.120    4.410    0.000

> 
  • An individual path can be tested. For example, the coefficient from ME to HE is 0.139, which is significant based on the z-test.
  • The residual variance parameters are also automatically estimated.
  • The mediation effect is estimated and tested using the defined parameter. For example, the mediation effect here is 0.065 with the standard error 0.028. It is significant based on a z-test (Sobel test). Note that the result is the same as the mediation analysis before.

Example 2. Testing a theory of no direct effect

Assume we hypothesize that there is no direct effect from ME to math. To test the hypothesis, we can fit a model illustrated below.

The input and output of the analysis are given below. To evaluate the hypothesis, we can check the model fit. The null hypothesis is “H0H0: The model fits the data well or the model is supported”. The alternative hypothesis is “H1H1: The model does not fit the data or the model is rejected”. The model with the direct effect fits the data perfectly. Therefore, if the current model also fits the data well, we fail to reject the null hypothesis. Otherwise, we reject it. The test of the model can be conducted based on a chi-squared test. From the output, the Chi-square is 14.676 with 1 degree of freedom. The p-value is about 0. Therefore, the null hypothesis is rejected. This indicates that the model without direct effect is not a good model.

> library(lavaan)
This is lavaan 0.5-23.1097
lavaan is BETA software! Please report any bugs.
> usedata('nlsy')
> 
> model2<-'
+ math ~ b*HE
+ HE ~ a*ME
+ '
> 
> model2.res<-sem(model2, data=nlsy)
> summary(model2.res)
lavaan (0.5-23.1097) converged normally after  17 iterations

  Number of observations                           371

  Estimator                                         ML
  Minimum Function Test Statistic               14.676
  Degrees of freedom                                 1
  P-value (Chi-square)                           0.000

Parameter Estimates:

  Information                                 Expected
  Standard Errors                             Standard

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)
  math ~                                              
    HE         (b)    0.556    0.144    3.873    0.000
  HE ~                                                
    ME         (a)    0.139    0.043    3.249    0.001

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)
   .math             21.453    1.575   13.620    0.000
   .HE                2.724    0.200   13.620    0.000

> 

Example 3: A more complex path model

Path analysis can be used to test more complex theories. In this example, we look at how age and education influence EPT using the ACTIVE data. Both age and education may influence EPT directly or through memory and reasoning ability. Therefore, we can fit a model shown below.

Suppose we want to test the total effect of age on EPT and its indirect effect. The direct effect is the path from age to ept1 directly, denoted by p1. One indirect path goes through hvltt1, that is p2*p7. The second indirect effect through ws1 is p3*p8. The third indirect effect through ls1 is p4*p9. The last indirect effect through lt1 is p5*p10. The total indirect effect is p2*p7+p3*p8+p4*p9+p5*p10. The total effect is the sum of them p1+p2*p7+p3*p8+p4*p9+p5*p10.

The output from such a model is given below. From it, we can see that the indirect effect ind1=p2*p7 is significant. The total indirect (indirect) from age to EPT is also significant. Finally, the total effect (total) from age to EPT is significant.

> library(lavaan)
This is lavaan 0.5-23.1097
lavaan is BETA software! Please report any bugs.
> usedata('active.full')
> 
> active.model<-'
+ hvltt1 ~ p1*age + edu
+ ws1 ~ p2*age + edu
+ ls1 ~ p3*age + edu
+ lt1 ~ p4*age + edu
+ ept1 ~ p5*age + p6*edu + p7*hvltt1 + p8*ws1 + p9*ls1 + p10*lt1
+ ws1~~ls1
+ ws1~~lt1
+ ls1~~lt1
+ hvltt1~~ls1
+ hvltt1~~ws1
+ hvltt1~~lt1
+ ind1 := p1*p7
+ total := p5 + p1*p7 + p2*p8 + p3*p9 + p4*p10
+ indirect := p1*p7 + p2*p8 + p3*p9 + p4*p10
+ ' 
> 
> active.res<-sem(active.model, data=active.full)
> summary(active.res)
lavaan (0.5-23.1097) converged normally after  79 iterations

  Number of observations                          1114

  Estimator                                         ML
  Minimum Function Test Statistic                0.000
  Degrees of freedom                                 0

Parameter Estimates:

  Information                                 Expected
  Standard Errors                             Standard

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)
  hvltt1 ~                                            
    age       (p1)   -0.161    0.027   -6.074    0.000
    edu               0.429    0.052    8.177    0.000
  ws1 ~                                               
    age       (p2)   -0.226    0.026   -8.737    0.000
    edu               0.704    0.051   13.772    0.000
  ls1 ~                                               
    age       (p3)   -0.276    0.029   -9.658    0.000
    edu               0.877    0.057   15.486    0.000
  lt1 ~                                               
    age       (p4)   -0.085    0.015   -5.894    0.000
    edu               0.394    0.029   13.723    0.000
  ept1 ~                                              
    age       (p5)    0.014    0.021    0.644    0.519
    edu       (p6)    0.448    0.045    9.913    0.000
    hvltt1    (p7)    0.202    0.025    8.045    0.000
    ws1       (p8)    0.196    0.038    5.179    0.000
    ls1       (p9)    0.246    0.035    7.090    0.000
    lt1      (p10)    0.151    0.051    2.953    0.003

Covariances:
                   Estimate  Std.Err  z-value  P(>|z|)
 .ws1 ~~                                              
   .ls1              16.606    0.819   20.287    0.000
   .lt1               5.714    0.371   15.390    0.000
 .ls1 ~~                                              
   .lt1               6.573    0.415   15.852    0.000
 .hvltt1 ~~                                           
   .ls1               8.444    0.713   11.838    0.000
   .ws1               7.572    0.643   11.769    0.000
   .lt1               2.856    0.349    8.191    0.000

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)
   .hvltt1           20.618    0.874   23.601    0.000
   .ws1              19.588    0.830   23.601    0.000
   .ls1              24.030    1.018   23.601    0.000
   .lt1               6.174    0.262   23.601    0.000
   .ept1             12.177    0.516   23.601    0.000

Defined Parameters:
                   Estimate  Std.Err  z-value  P(>|z|)
    ind1             -0.033    0.007   -4.847    0.000
    total            -0.144    0.026   -5.594    0.000
    indirect         -0.158    0.017   -9.340    0.000

> 

推荐访问


数据分析交流:数据分析交流
Excel学习: Excel学习交流
Python交流:一起学习Python(数据分
SQL交流:一起学习SQL(数据分析
微博:一起大数据

最新提问

  • SQL Chat
  • sql server 不允许保存更改。您所做的更改要求删除并重新创建以下表。您对无法重新创建的表进行了更改或者启用了”阻止保存要求重新创建表的更改”选项。
  • 偏相关分析
  • 复相关系数
  • 【R语言】熵权法确定权重
  • 如何破解Excel VBA密码
  • 解决 vba 报错:要在64位系统上使用,请检查并更新Declare 语句
  • 基于 HuggingFace Transformer 的统一综合自然语言处理库
  • sqlserver分区表索引
  • Navicat连接数据库后不显示库、表、数据

文章标签

ARIMA CBC Excel GBDT KNN Modeler Mysql pandas PostgreSQL python python数据可视化 R SAS sklearn SPSS SQL SVM Tableau TensorFlow VBA 主成分分析 关联规则 决策树 协同过滤 可视化 因子分析 大数据 大数据分析 推荐系统 数据分析 数据可视化 数据挖掘 数据透视表 文本挖掘 时间序列 机器学习 深度学习 神经网络 结构方程 统计学 联合分析 聚类 聚类分析 逻辑回归 随机森林
©2023 一起大数据-技术文章心得 | Design: Newspaperly WordPress Theme