• 登录
Skip to content

一起大数据-技术文章心得

一起大数据网由数据爱好者发起并维护,专注数据分析、挖掘、大数据相关领域的技术分享、交流。不定期组织爱好者聚会,期待通过跨行业的交流和碰撞,更好的推进各领域数据的价值落地。

Menu
  • 首页
  • 大数据案例
  • 数据&电子书
  • 视频
    • Excel视频
    • VBA视频
    • Mysql视频
    • 统计学视频
    • SPSS视频
    • R视频
    • SAS视频
    • Python视频
    • 数据挖掘视频
    • 龙星计划-数据挖掘
    • 大数据视频
    • Machine Learning with Python
  • 理论
    • 统计学
    • 数据分析
    • 机器学习
    • 大数据
  • 软件
    • Excel
    • Modeler
    • Python
    • R
    • SAS
    • SPSS
    • SQL
    • PostgreSQL
    • KNIME
  • 技术教程
    • SQL教程
    • SPSS简明教程
    • SAS教程
    • The Little SAS Book
    • SAS EG教程
    • R语言教程
    • Python3教程
    • IT 技术速查手册
    • Data Mining With Python and R
    • SAS Enterprise Miner
  • 问答社区
  • 我要提问
Menu

Causal ML包详解:使用Python进行uplift建模和因果推断

Posted on 2023年2月8日

简述:本文旨在带你快速开始使用Python进行uplift建模和因果推断,文章包括CausalML包安装方式和Python代码。英文原文来自https://github.com/uber/causalml,或可参考Causal ML。

Causal ML: 用于使用机器学习进行增益建模和因果推理的Python包

CausalML是一个Python包,它使用基于最近研究的机器学习算法提供了一套增益建模(uplift modeling)和因果推理(causal inference)方法[1]。它提供了一个标准界面,允许用户根据实验或观察数据估计条件平均干预效果(Conditional Average Treatment Effect,CATE)或个体干预效果(Individual Treatment Effect,ITE)。本质上,它估计了在没有对模型形式进行有力假设的情况下,具有观察到的特征X的用户的干预T对结果Y的因果影响。典型用例包括:

  • 目标人群优化:在广告活动中,提高ROI的一个重要方法是将广告目标对准广告敏感人群,也即在给定KPI(如参与度或付费额)中获得良好响应的客户群。CATE通过从A/B实验或历史观察数据中估计个人层面广告暴露的KPI效果来识别这些客户。
  • 用户个性化参与:企业有多种与客户互动的方式,例如线上推送、短信触达等。可以使用CATE来估计每个客户的CATE和 ITE 组合,以获得最佳的个性化推荐系统。

该软件包目前支持以下方法:

  • 基于树的算法
    • 基于KL散度、欧几里德距离和卡方检验的Uplift树/随机森林[2]
    • 基于上下文处理选择的Uplift树/随机森林[3]
    • 因果树[4]-(正在进行的工作)
  • 元学习算法
    • S-learner[5]
    • T-learner[5]
    • X-learner[5]
    • R-learner[6]
    • Doubly Robust (DR) learner[7]
    • TMLE learner[8]
  • 工具变量算法
    • 两阶段最小二乘法(2SLS)
    • Doubly Robust (DR) IV[9]
  • 基于神经网络的算法
    • CEVAE[10]
    • DragonNet[11]-需要causalml[tf]安装(参见安装)

安装

建议使用conda进行安装。存储库中提供了Python 3.6、3.7、3.8和3.9的conda环境文件。为使用inference.tf下的模块(如DragonNet),需要额外依赖tensorflow。详细说明请参见下文。

使用conda安装

这将创建一个名为causalml-[tf-]py3x的新conda虚拟环境,其中x位于[6,7,8,9]中。例如,causalml-py37或causalml-tf-py38。如果要更改环境的名称,请在envs/中更新相关的YAML文件。

$ git clone https://github.com/uber/causalml.git
$ cd causalml/envs/
$ conda env create -f environment-py38.yml	# for the virtual environment with Python 3.8 and CausalML
$ conda activate causalml-py38
(causalml-py38)

使用tensorflow安装causalml

$ cd causalml/envs/
$ conda env create -f environment-tf-py38.yml	# for the virtual environment with Python 3.8 and CausalML
$ conda activate causalml-tf-py38
(causalml-tf-py38) pip install -U numpy			# this step is necessary to fix [#338](https://github.com/uber/causalml/issues/338)
In

使用pip安装

$ git clone https://github.com/uber/causalml.git
$ cd causalml
$ pip install -r requirements.txt
$ pip install causalml

使用tensorflow安装causalml

$ git clone https://github.com/uber/causalml.git
$ cd causalml
$ pip install -r requirements-tf.txt
$ pip install causalml[tf]
$ pip install -U numpy							# this step is necessary to fix [#338](https://github.com/uber/causalml/issues/338)

使用源码安装

$ git clone https://github.com/uber/causalml.git
$ cd causalml
$ pip install -r requirements.txt
$ python setup.py build_ext --inplace
$ python setup.py install

快速开始

S、T、X和R学习器的平均干预效果评估

from causalml.inference.meta import LRSRegressor
from causalml.inference.meta import XGBTRegressor, MLPTRegressor
from causalml.inference.meta import BaseXRegressor
from causalml.inference.meta import BaseRRegressor
from xgboost import XGBRegressor
from causalml.dataset import synthetic_data

y, X, treatment, _, _, e = synthetic_data(mode=1, n=1000, p=5, sigma=1.0)

lr = LRSRegressor()
te, lb, ub = lr.estimate_ate(X, treatment, y)
print('Average Treatment Effect (Linear Regression): {:.2f} ({:.2f}, {:.2f})'.format(te[0], lb[0], ub[0]))

xg = XGBTRegressor(random_state=42)
te, lb, ub = xg.estimate_ate(X, treatment, y)
print('Average Treatment Effect (XGBoost): {:.2f} ({:.2f}, {:.2f})'.format(te[0], lb[0], ub[0]))

nn = MLPTRegressor(hidden_layer_sizes=(10, 10),
                 learning_rate_init=.1,
                 early_stopping=True,
                 random_state=42)
te, lb, ub = nn.estimate_ate(X, treatment, y)
print('Average Treatment Effect (Neural Network (MLP)): {:.2f} ({:.2f}, {:.2f})'.format(te[0], lb[0], ub[0]))

xl = BaseXRegressor(learner=XGBRegressor(random_state=42))
te, lb, ub = xl.estimate_ate(X, treatment, y, e)
print('Average Treatment Effect (BaseXRegressor using XGBoost): {:.2f} ({:.2f}, {:.2f})'.format(te[0], lb[0], ub[0]))

rl = BaseRRegressor(learner=XGBRegressor(random_state=42))
te, lb, ub =  rl.estimate_ate(X=X, p=e, treatment=treatment, y=y)
print('Average Treatment Effect (BaseRRegressor using XGBoost): {:.2f} ({:.2f}, {:.2f})'.format(te[0], lb[0], ub[0]))

详细信息,请参见Meta-learner example notebook。

CausalML的可解释性

因果ML提供了解释干预建模效果的方法。

  1. 元学习器的特征重要性
from causalml.inference.meta import BaseSRegressor, BaseTRegressor, BaseXRegressor, BaseRRegressor
from causalml.dataset.regression import synthetic_data

# Load synthetic data
y, X, treatment, tau, b, e = synthetic_data(mode=1, n=10000, p=25, sigma=0.5)
w_multi = np.array(['treatment_A' if x==1 else 'control' for x in treatment]) # customize treatment/control names

slearner = BaseSRegressor(LGBMRegressor(), control_name='control')
slearner.estimate_ate(X, w_multi, y)
slearner_tau = slearner.fit_predict(X, w_multi, y)

model_tau_feature = RandomForestRegressor()  # specify model for model_tau_feature

slearner.get_importance(X=X, tau=slearner_tau, model_tau_feature=model_tau_feature,
                        normalize=True, method='auto', features=feature_names)

# Using the feature_importances_ method in the base learner (LGBMRegressor() in this example)
slearner.plot_importance(X=X, tau=slearner_tau, normalize=True, method='auto')

# Using eli5's PermutationImportance
slearner.plot_importance(X=X, tau=slearner_tau, normalize=True, method='permutation')

# Using SHAP
shap_slearner = slearner.get_shap_values(X=X, tau=slearner_tau)

# Plot shap values without specifying shap_dict
slearner.plot_shap_values(X=X, tau=slearner_tau)

# Plot shap values WITH specifying shap_dict
slearner.plot_shap_values(X=X, shap_dict=shap_slearner)

# interaction_idx set to 'auto' (searches for feature with greatest approximate interaction)
slearner.plot_shap_dependence(treatment_group='treatment_A',
                              feature_idx=1,
                              X=X,
                              tau=slearner_tau,
                              interaction_idx='auto')
Meta Learner Feature Importances

详细内容参见feature interpretations example notebook。

Uplift树可视化

from IPython.display import Image
from causalml.inference.tree import UpliftTreeClassifier, UpliftRandomForestClassifier
from causalml.inference.tree import uplift_tree_string, uplift_tree_plot

uplift_model = UpliftTreeClassifier(max_depth=5, min_samples_leaf=200, min_samples_treatment=50,
                                    n_reg=100, evaluationFunction='KL', control_name='control')

uplift_model.fit(df[features].values,
                 treatment=df['treatment_group_key'].values,
                 y=df['conversion'].values)

graph = uplift_tree_plot(uplift_model.fitted_uplift_tree, features)
Image(graph.create_png())
Uplift Tree Visualization

详细内容参见Uplift Tree visualization example notebook。

参考

文档

  • Causal ML API documentation

CausalML团队的会议演讲和出版物

  • (Talk) Introduction to CausalML at Causal Data Science Meeting 2021
  • (Talk) Introduction to CausalML at 2021 Conference on Digital Experimentation @ MIT (CODE@MIT)
  • (Talk) Causal Inference and Machine Learning in Practice with EconML and CausalML: Industrial Use Cases at Microsoft, TripAdvisor, Uber at KDD 2021 Tutorials (website and slide links)
  • (Publication) CausalML White Paper Causalml: Python package for causal machine learning
  • (Publication) Uplift Modeling for Multiple Treatments with Cost Optimization at 2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)
  • (Publication) Feature Selection Methods for Uplift Modeling

引用

要在出版物中引用CausalML,可以参考以下来源:

Whitepaper: CausalML: Python Package for Causal Machine Learning

Bibtex:

@misc{chen2020causalml, title={CausalML: Python Package for Causal Machine Learning}, author={Huigang Chen and Totte Harinen and Jeong-Yoon Lee and Mike Yung and Zhenyu Zhao}, year={2020}, eprint={2002.11631}, archivePrefix={arXiv}, primaryClass={cs.CY} }

著作

  1. Chen, Huigang, Totte Harinen, Jeong-Yoon Lee, Mike Yung, and Zhenyu Zhao. “Causalml: Python package for causal machine learning.” arXiv preprint arXiv:2002.11631 (2020).
  2. Radcliffe, Nicholas J., and Patrick D. Surry. “Real-world uplift modelling with significance-based uplift trees.” White Paper TR-2011-1, Stochastic Solutions (2011): 1-33.
  3. Zhao, Yan, Xiao Fang, and David Simchi-Levi. “Uplift modeling with multiple treatments and general response types.” Proceedings of the 2017 SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics, 2017.
  4. Athey, Susan, and Guido Imbens. “Recursive partitioning for heterogeneous causal effects.” Proceedings of the National Academy of Sciences 113.27 (2016): 7353-7360.
  5. Künzel, Sören R., et al. “Metalearners for estimating heterogeneous treatment effects using machine learning.” Proceedings of the national academy of sciences 116.10 (2019): 4156-4165.
  6. Nie, Xinkun, and Stefan Wager. “Quasi-oracle estimation of heterogeneous treatment effects.” arXiv preprint arXiv:1712.04912 (2017).
  7. Bang, Heejung, and James M. Robins. “Doubly robust estimation in missing data and causal inference models.” Biometrics 61.4 (2005): 962-973.
  8. Van Der Laan, Mark J., and Daniel Rubin. “Targeted maximum likelihood learning.” The international journal of biostatistics 2.1 (2006).
  9. Kennedy, Edward H. “Optimal doubly robust estimation of heterogeneous causal effects.” arXiv preprint arXiv:2004.14497 (2020).
  10. Louizos, Christos, et al. “Causal effect inference with deep latent-variable models.” arXiv preprint arXiv:1705.08821 (2017).
  11. Shi, Claudia, David M. Blei, and Victor Veitch. “Adapting neural networks for the estimation of treatment effects.” 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), 2019.
  12. Zhao, Zhenyu, Yumin Zhang, Totte Harinen, and Mike Yung. “Feature Selection Methods for Uplift Modeling.” arXiv preprint arXiv:2005.03447 (2020).
  13. Zhao, Zhenyu, and Totte Harinen. “Uplift modeling for multiple treatments with cost optimization.” In 2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pp. 422-431. IEEE, 2019.

相关项目

  • uplift: uplift models in R
  • grf: generalized random forests that include heterogeneous treatment effect estimation in R
  • rlearner: A R package that implements R-Learner
  • DoWhy: Causal inference in Python based on Judea Pearl’s do-calculus
  • EconML: A Python package that implements heterogeneous treatment effect estimators from econometrics and machine learning methods

发表评论 取消回复

要发表评论,您必须先登录。

推荐访问


数据分析交流:数据分析交流
Excel学习: Excel学习交流
Python交流:一起学习Python(数据分
SQL交流:一起学习SQL(数据分析
微博:一起大数据

最新提问

  • SQL Chat
  • sql server 不允许保存更改。您所做的更改要求删除并重新创建以下表。您对无法重新创建的表进行了更改或者启用了”阻止保存要求重新创建表的更改”选项。
  • 偏相关分析
  • 复相关系数
  • 【R语言】熵权法确定权重
  • 如何破解Excel VBA密码
  • 解决 vba 报错:要在64位系统上使用,请检查并更新Declare 语句
  • 基于 HuggingFace Transformer 的统一综合自然语言处理库
  • sqlserver分区表索引
  • Navicat连接数据库后不显示库、表、数据

文章标签

ARIMA CBC Excel GBDT KNN Modeler Mysql pandas PostgreSQL python python数据可视化 R SAS sklearn SPSS SQL SVM Tableau TensorFlow VBA 主成分分析 关联规则 决策树 协同过滤 可视化 因子分析 大数据 大数据分析 推荐系统 数据分析 数据可视化 数据挖掘 数据透视表 文本挖掘 时间序列 机器学习 深度学习 神经网络 结构方程 统计学 联合分析 聚类 聚类分析 逻辑回归 随机森林
©2023 一起大数据-技术文章心得 | Design: Newspaperly WordPress Theme