Vector Research Blog: Causal Effect Estimation Using Machine Learning

January 19, 2024

Insights Research

By Elham Dolatabadi, Maia Norman, Farnaz Kohankhaki, Rahul G. Krishnan, Deval Pandya, George-Kirollos Saad, Wen Xu, and Joanna Yu

Access the Causal Inference Laboratory GitHub repository here.

There is a growing interest in various domains, such as healthcare, retail, finance, and education, to answer causal rather than associative questions. This shift in focus reflects a recognition of the significance of understanding the underlying causal mechanisms that drive observed relationships. By developing a better understanding of causality, one can move beyond surface-level correlations and uncover the factors and interventions that truly impact outcomes. Causal analysis empowers decision-makers to make more data-driven and informed choices and ultimately drive meaningful change in their respective domains. We aim to make causal effect estimation frameworks easily accessible to developers, researchers, and innovators by providing an overview of the various components in a causal effect estimation workflow, and an explanation of how to implement state-of-the-art machine learning (ML) algorithms and toolkits for solving a wide variety of causal problems.

This material has been presented as a hands-on Causal Inference Laboratory organized by Vector, to enhance working knowledge of causal techniques and applications, and foster interdisciplinary collaboration among subject matter experts across diverse industries and sectors. The goal of the Laboratory was to facilitate the practical applications of causal effect estimation to diverse challenges, including estimation of treatment effect with real-world data in precision medicine, maximizing the efficiency of algorithmic trading across capital markets, churn prediction, supply chain optimization, and dynamic pricing. A core feature of this interactive Laboratory was the implementation of a causal effect estimation workflow. The workflow encompasses various aspects of causal effect estimation, such as model estimation and selecting appropriate estimation methods using techniques from recent research conducted by leading groups in the field.

From “correlation does not imply causation” to causal analysis

Correlation means an association but not causation. Two variables could be statistically correlated but not necessarily causally linked. In this case, this type of association is confounding and both correlated variables are associated with a third variable, which is a causal variable and tends to co-occur with the data that we’re measuring.1 For instance, suppose there exists a positive correlation between the increase in ice cream sales and the number of shark attacks at the beach. One might mistakenly infer that buying more ice cream somehow contributes to a higher risk of shark attacks or vice versa. However, in reality, the correlation is likely due to the fact that both ice cream sales and shark attacks tend to increase during the summer months. Seasonality, with more people spending time at the beach during warm weather, is the common factor influencing both variables. The more people are at the beach (contributing to higher ice cream sales), the greater the chance of a shark attack occurring. 

Therefore, any association is not necessarily causation and measuring causation is not as simple as measuring association. Traditional statistics and ML excel at the identification of associations among variables and the estimation of probabilities linked to past or future events. However, when it comes to understanding the actual cause-and-effect relationships between variables in dynamic and changing conditions, they fall short. Causal analysis goes beyond inferring probabilities and associations but also how probabilities change when conditions may dynamically shift. This could include changes induced by external interventions, treatments, or any other factors that influence the system being analyzed.2

How to turn a causal question into an estimation problem?

As we learned above, there is a distinction between causal analysis and statistical analysis. In the context of causal analysis, the phrase ”identification” is referred to as the process of moving from a causal analysis to an equivalent statistical analysis. The process of building and estimating a causal effect, as shown in Figure 1, is a multi-step process consisting of establishing and then identifying the causal model which includes making assumptions to transform these into statistical models and applying various estimation methods to estimate the causal models that align with the chosen identification strategy. There is, therefore, a remarkable surge in interest in developing methodologies for constructing causal models and effectively identifying causal effects from observational data. Recent advances in ML and deep learning have provided researchers with powerful tools to navigate the complexities related to estimating causal effects.3-7 These techniques offer flexibility, automation, and scalability, enabling more accurate and insightful causal effect estimations from observational data.

Figure 1: The causal effect estimation workflow contains three modules for identifying, estimating, and evaluating causal models from real-world observational data.

It is important to note that multidisciplinary teamwork, including subject matter experts, is critical to the process of building a robust causal effect estimation model. Collaboration between domain knowledge experts and ML engineers ensures that the research questions are well-defined, the study design is appropriate, the analysis accounts for relevant factors, and the results are correctly interpreted in the context of the subject matter.

Causal Workflow Details and Guidance

Our Causal Effect Estimation Workflow guide, below, provides a detailed overview of the technical modules included in our causal effect estimation workflow summarized in Figure 1, including leveraging estimators such as conditional outcome modeling, representation learning, and double machine learning to identify, estimate and evaluate causal models from real-world observational data, demonstrated on three real-world datasets.


In this work, we focus on the problem of causal effect estimation under the Rubin-Neyman potential outcomes framework with conditional unconfoundedness8. The observed data consists of samples of random variables (𝑌, 𝑋, 𝑊, 𝑇) from an unknown distribution, where 𝑋 represents all of the observed covariates, 𝑇 represents the treatment assignment, 𝑌 represents the outcome of interest, and 𝑊 represents the confounders satisfying the backdoor criterion relative to 𝑇 and 𝑌 (where 𝑊 blocks all backdoor paths from 𝑇 to 𝑌 and does not contain any descendants of 𝑇).9 We consider only binary treatments 𝑇 ∈ {0, 1}. Hence, we have two potential outcomes 𝑌 (1) = 𝑌 (do(𝑇 = 1)) and 𝑌 (0) = 𝑌(do(𝑇 = 0)). This study focuses on estimating the Treatment Effect, which is a prevalent causal effect problem encompassing the impact of treatment on the individual or population level.

We denote the individual treatment effect (ITE) with 𝜏i which is mathematically defined as:

                𝜏i ≘𝑌i(1) − 𝑌i(0)                    

Due to the fundamental problem of causal inference, we can’t access individual treatment effects (we cannot observe both Yi(1) and Yi(0)). But we can estimate the average treatment effect (ATE), 𝜏, which is measured by taking the average over ITE:

                𝜏 ≘𝔼[𝑌i(1) − 𝑌i(0)] = 𝔼[𝑌i(do(𝑇 = 1)) − 𝑌i(do(𝑇 = 0))]         

In order to make the treatment effects identifiable3, 10, 11 we posit the three assumptions of Consistency (𝑌 = 𝑇.𝑌 (1)+(1−𝑇 )𝑌 (0)), Conditional Ignorability (𝑌 (0), 𝑌 (1) ㅛ𝑇|𝑊) and Overlap (0 < ℼ(𝑤) < 1, ∀ 𝑤 ∈ 𝑊):

                𝜏 ≘𝔼𝑤[𝔼[𝑌|𝑇 = 1, 𝑊 = 𝑤] − 𝔼𝑤𝔼[𝑌|𝑇 = 0, 𝑊 =𝑤]]        

We denote μ(𝑤) as an expected potential outcome and π(𝑤) as a propensity score as follows:

Causal Model Estimation

The model estimation within our workflow supports building multiple groups of estimators for the identified causal estimands. The first two groups of estimators, Conditional Outcome Modeling (COM) and Grouped Conditional Outcome Modeling (GCOM), leverage a range of linear and non-linear ML models, including ordinary least squares, random forests, and feed-forward neural networks. Furthermore, our workflow includes implementations of representation learning based estimators such as TARNet5 and DragonNet6 based on deep learning techniques. Additionally, we provide other data efficient models using the Double ML framework,4 all of which are accompanied by a dedicated pipeline explicitly designed for hyperparameter tuning. 

Conditional Outcome Modeling 

COM estimators12, which are also called “S-learner”, or “G-computation estimators” in the literature, involve fitting a single estimator that models outcome variables, 𝑌,  as a function of concatenated treatment assignment and other relevant covariates. Using COM, the goal is to fit an estimator, 𝜇(𝘵, 𝑤), to the conditional expectation, 𝔼[𝑌 |𝑇 = 𝘵, 𝑊 = 𝑤], which could be a statistical model or an ML model. More precisely, the ATE is the empirical mean, 𝔼𝑤, over the n sample observation data points:

GCOM appears to address the issue of zero ATE; however, it introduces another downside by not utilizing all of the available data for model training.

One significant drawback of COM estimators is their potential to overlook the treatment variable, especially in scenarios where the confounding variables are high-dimensional. This can lead to biased COM estimation and an ATE of zero.13

Grouped Conditional Outcome Modeling

GCOM13 estimators, which are also called “T-learner”, involve building two separate models 𝜇1(𝑤) and 𝜇2 (𝑤) predicting 𝑌 from 𝑊 in group one where 𝑇 = 1 and in group two where 𝑇 = 0, respectively. For the binary treatment, the ATE is defined as:

Representation Learning

Like other domains, the field of causal estimation also benefits from advancements in deep learning. Causal estimation models rooted in deep neural networks are designed to address the shortcomings through learning complex nonlinear representations of observational data. Intuitively, these models trade off two objectives: improving the prediction accuracy of factual and counterfactual outcomes while minimizing the distance between the distribution of the treatment population and that of the control population. Two notable approaches are TARNet5 and DragonNet.6 TARNet is a two-headed neural network architecture consisting of a single deep neural network that is followed by two distinct sub-networks, each dedicated to a specific treatment group. The single network is a COM estimator, 𝜇(𝘵, 𝑤), leveraging the entire observation to learn a treatment-agnostic representation. Conversely, the sub-networks utilize the relevant subset of the representation specific to each treatment group to predict the outcome variable, 𝑌. DragonNet, akin to TAR-Net, is a deep neural network model that incorporates an additional head for estimating the propensity score, π(𝑤), in addition to the COM estimation, 𝜇(𝘵, 𝑤), accomplished by the other two heads. The third head in the network acts like a regularizer and trades off prediction quality to achieve a better representation of the propensity score.

Double Machine Learning

The concept of double/debiased machine learning or Double ML (DML) also known as R-learner4 has emerged as a method to achieve an unbiased estimation of causal effects using ML models. What sets DML apart is its ability to provide confidence interval guarantees and rapid convergence rates. Unlike TARNetand DragonNet,6 where the backbone ML models are typically neural networks and treatments are typically binary or discrete, DML offers the flexibility to utilize diverse ML models and accommodate continuous treatments. In DML, we employ a two-stage process to fit the following partial linear model.

  1. We fit two estimators, one estimator, 𝜇(𝑤), to predict an expected potential outcome 𝑌 from covariates 𝑊 and another estimator, π(𝑤), to predict treatment, 𝑇, from covariates, 𝑊. For both estimations we can leverage ML and the reason it is called double ML is that we use ML twice.
  2. We partial out the covariate effects by looking at the residuals of both models, which are the differences between the predicted and actual values. In other words, we confound the effect of treatment on the outcome with this partialling out.

Then, we fit a model to predict the residuals of outcome values, ui, using the residuals of treatment values, 𝓋i to get the estimated β1 and therefore ATE.

Causal Model Selection

Model selection poses a significant challenge in causal effect estimation due to the fundamental problem of not being able to observe counterfactuals directly. This distinguishing characteristic makes the model selection task in causal effect estimation more complex than in other ML and statistical approaches. Therefore, the commonly used cross-validation approach is impractical. Instead, proxy metrics based on auxiliary nuisance models and the utility of decision policy based on the heterogeneous treatment effect of estimators have been proposed in the literature.14 The model selection module within our framework enables the building of various evaluation metrics tailored to three distinct groups of datasets, including semi/fully synthetic datasets, randomized control trials, and real-world observational datasets.

Precision Metrics

Two widely used evaluation metrics for known counterfactuals and ground truth ATE are the expected precision in the estimation of heterogeneous effects (PEHE)15 and absolute error in ATE, respectively. Expected PEHE, ∈PEHE, quantifies the ability of a model to capture the heterogeneity of the causal effects of treatment among individuals in a population, which measures the discrepancy between the estimated and ground truth treatment effects at the individual level:

Limited access to heterogeneous treatment effects at the individual level and the availability of a ground truth ATE in randomized control trials have led to the adoption of absolute error in ATE, ∈ATE

Approximation-Based Metrics

This class of metrics denoted as 𝑀 constructs an approximate ground truth treatment effect using nuisance models in the absence of counterfactuals which provides the ”true,” but technically unobserved treatment effect. The approximation-based metrics quantify the heterogeneity disparity between the estimated treatment effects and the approximated values in a manner analogous to PEHE. With an exception made for estimators like Double ML, which are unable to estimate heterogeneous treatment effects, the discrepancy between the ATE and the approximated ATE can be evaluated. A lower value indicates a better alignment between the estimates and the approximations:

where 𝑀PEHE denotes approximation-based metrics for estimation of heterogeneous effects and 𝑀ATE for estimation of average treatment effects.

There are four commonly employed approaches for approximating the ground truth treatment effect: matching,16 weighting,11 modelling,17 and influence function.17 Matching includes finding the nearest neighbour (nn) from the opposite treatment group for each sample in the observation based on their covariates values. The matching treatment effect is defined as the difference between the observed outcomes among nearest neighbour samples.

Another common approach to approximate the ground truth treatment effect is weighting using Inverse Propensity Weighting (IPW). Using IPW, we can construct a pseudo population in which the distributions of outcomes are balanced between the two treatment groups. The weighting treatment effect is then defined as:

Additionally, we can employ regression models as in COM (S-learner) and GCOM (T-learner) estimators, both of which are referred to as plug-in PEHE methods17 to approximate the ground truth ITE as follows:

According to Alaa & Van Der Schaar,17 the modelling or plug-in PEHE methods can truly demonstrate their comparative performances only when the differences between them are sufficiently small, i.e., 𝑀~0. Otherwise, when the differences are significant, they tend to exhibit a bias depending on the model being used. To overcome this limitation, influence functions were proposed to obtain unbiased estimates of PEHE and its variance. These influence functions capture the functional derivatives of the causal effect and provide more robust and generalizable metrics. In the context of the GCOM plug-in PEHE, an additional term incorporating an influence function based on a Taylor-like expansion is defined as follows:


Benchmarking Datasets

Simulated factual and counterfactual outcomes: Infant Health and Development Program (IHDP)

The original Infant Health and Development Program (IHDP) is a randomized controlled study designed to evaluate the effect of home visits from specialist doctors on the cognitive test scores of premature infants. The dataset eventually was transformed from a randomized design to an observational setting and emerged as a widely used benchmark for causal estimation.15 The transformation included inducing selection bias by removing non-random subsets of the treated individuals so the treatment. Additionally, the outcomes were simulated using the original covariates and treatments. The benchmarking dataset consists of 747 subjects and 25 variables. The treatment refers to Home visits by specialists, and the outcomes of interest are cognitive test scores. The dataset includes up to 100 realizations for both factual and counterfactual outcomes. Following Hill,15 we used the noiseless outcome as the true outcome for building estimators in our experiments. We report the treatment effects averaged over 100 realizations of the factual and counterfactual outcomes with 80/20 train/test splits.

Randomized Controlled Trial: Jobs

The Jobs dataset, introduced by LaLonde,18 is a widely used benchmark in the causal effect estimation community. In this dataset, the treatment refers to job training, and the outcomes of interest are employment status following the training. The dataset includes 8 covariates, including age, education, and previous earnings. Following Shalit et al.,5 we combined the LaLonde experimental sample (297 treated, 425 control) with the PSID comparison group (2490 control).

Real-World Observational Data: TWINS Dataset

The TWINS dataset, covering twin births in the USA from 1989 to 1991,19 is predominantly geared towards investigating relative weight’s impact on twins’ mortality rate. As a benchmark for causal effect estimation,11 the dataset employs an artificial binary treatment, specifically being heavier at birth. The binary outcome measures the mortality of each twin during their first year. Since the dataset provides records for both twins, we treat their outcomes as two potential outcomes based on the treatment assignment of being born heavier or not. This setup allows for causal effect estimation analyses regarding the effect of relative weight on twin mortality rates. The dataset has 23968 samples (11984 treated, 11984 control) and 46 covariates relating to the parents, the pregnancy and birth.


For each dataset, we present evaluations encompassing a range of causal estimators in Tables 1, 2, and 3 for IHDP, Jobs, and TWINS, respectively. The evaluations are the average and standard deviation of precision and approximation-based metrics over 10 runs using our framework’s model selection module. With the exception of Double ML, all approximation-based results, 𝑀PEHE, presented in the table are associated with the estimation of heterogeneous (individual) treatment effects on out-of-sample or test sets. However, the metrics for Double ML, 𝑀ATE, relate to average treatment effects, and their values are not within the same range as other estimators and cannot be compared directly.

Using the approximation-based metrics, we adopt the subsequent procedure to identify the optimal estimator when access to the ground truth treatment effects is unavailable: first, for each metric derived from approximations, we opt for the lowest value exhibiting the least mean. Second, the estimators exhibiting the majority of these lower values will then be chosen as the best estimators. The bolded metrics in the tables below are those which satisfy the specified conditions as outlined.

An interesting observation within the IHDP dataset in Table 1 emerges, wherein the performance metrics remain consistently aligned across all estimators, albeit showing minor fluctuations. Notably, DragonNet and Double ML stand out by showcasing superior performance compared to the other estimators, with a marginal difference though. As we can see in Table 2, on the Jobs dataset, based on ∈ATE using the ground truth data, COM ordinary least squares outperform other estimators. Nonetheless, following our above-mentioned model selection strategy based on approximate metrics, COM random forest regression demonstrates impressive performance in estimating the PEHE across all nuisance models. TWINS dataset, as shown in Table 3, reveals a pattern similar to that of the IHDP dataset. Across all estimators, the performance measures consistently showcase a remarkable level of coherence. However, COM ordinary least squares outperformed other estimators based on both precision and approximation metrics.

Table 1: Results of IHDP100 dataset on the test set (or out of sample). Note 1: The outcome values were normalized using a MinMaxScaler to confine them within the range of 0 and 1. The transformation is done by applying the following scaling:
Note 2: as shown by 𝑀ATE, the metrics for Double ML pertain to the average treatment effects, resulting in different ranges compared to those of 𝑀PEHE.
Table 2: Results of Jobs dataset on the test set. Note 1: +The dataset lacks groundtruth counterfactual outcomes, so ∈PEHE cannot be measured.
Table 3: Results of TWINS dataset on the test set.


In conclusion, we hoped to provide a comprehensive overview of our causal effect estimation workflow by detailing the intricacies of each component and their implementations across various experiments. We hope this will assist researchers and innovators in effectively applying and implementing causal concepts and advance real-world applications of causal effect estimation.

Contributions and Acknowledgements

We would like to recognize the valuable technical and theoretical contributions of participants in the Causal Inference Laboratory, as well as the contributions of the individuals named below:

Academic and Project Advisors

Vahid Balazadeh Meresht1, Rahul G. Krishnan1,2+ˆ@<, Deval Pandya2+@<%, Andres


Vector Project Team

Elham Dolatabadi1,2,3+∧&@∼<>, Amirmohammad Kazemeini2<>@%, Farnaz Kohankhaki2>, Maia Norman2,4+&@<>%, George Saad2∧&@>, Wen Xu2∧&@>, Joanna Yu2@<>

Laboratory Facilitators

Dami Aremu2>, Winnie Au2<>, Asic Chen1>, Michael Cooper1>, Shaaf Farooq2>, Sedef Kocak2>, Tahniat Khan2>, Umar Khan2>, Farnam Mansouri4>, Shayaan Mehdi2>, Amirmohammad Shahbandegan2>, Ian Shi1>


  1. Neal, B.: Introduction to causal inference: From a machine learning perspective. Course Lect. Notes (2020)
  2. Pearl, J.: Causal inference in statistics: An overview (2009)
  3. Battocchi, K., Dillon, E., Hei, M., Lewis, G., Oka, P., Oprescu, M., Syrgkanis, V.: Econml: A python package for ml-based heterogeneous treatment effects estimation. Version 0. x (2019)
  4. Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., Robins, J., et al.: Double/debiased machine learning for treatment and causal parameters. Tech. rep. (2017)
  5. Shalit, U., Johansson, F.D., Sontag, D.: Estimating individual treatment effect: generalization bounds and algorithms. In: International conference on machine learning. pp. 3076–3085. PMLR (2017)
  6. Shi, C., Blei, D., Veitch, V.: Adapting neural networks for the estimation of treatment effects. Advances in neural information processing systems 32 (2019)
  7. Wager, S., Athey, S.: Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association 113(523), 1228–1242 (2018)
  8. Rubin, D.B.: Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of educational Psychology 66(5), 688 (1974)
  9. Pearl, J.: Causal inference. Causality: objectives and assessment pp. 39–58 (2010)
  10. Imbens, G.W., Wooldridge, J.M.: Recent developments in the econometrics of program evaluation. Journal of economic literature 47(1), 5–86 (2009)
  11. Peters, J., Janzing, D., Sch ̈olkopf, B.: Elements of causal inference: foundations and learning algorithms. The MIT Press (2017)
  12. Hastie, T., Tibshirani, R., Friedman, J.H., Friedman, J.H.: The elements of statistical learning: data mining, inference, and prediction, vol. 2. Springer (2009)
  13. Künzel, S.R., Sekhon, J.S., Bickel, P.J., Yu, B.: Metalearners for estimating heterogeneous treatment effects using machine learning. Proceedings of the national academy of sciences 116(10), 4156–4165 (2019)
  14. Mahajan, D., Mitliagkas, I., Neal, B., Syrgkanis, V.: Empirical analysis of model selection for heterogenous causal effect estimation. arXiv preprint arXiv:2211.01939 (2022)
  15. Hill, J.L.: Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics 20(1), 217–240 (2011)
  16. Rolling, C.A., Yang, Y.: Model selection for estimating treatment effects. Journal of the Royal Statistical Society Series B: Statistical Methodology 76(4), 749–769 (2014)
  17. Alaa, A., Van Der Schaar, M.: Validating causal inference models via influence functions. In: International Conference on Machine Learning. pp. 191–201. PMLR (2019)
  18. LaLonde, R.J.: Evaluating the econometric evaluations of training programs with experimental data. The American economic review pp. 604–620 (1986)
  19. Almond, D., Chay, K.Y., Lee, D.S.: The costs of low birth weight. The Quarterly Journal of Economics 120(3), 1031–1083 (2005)


Headshot of Vector Faculty Member Xi He.
Trustworthy AI

How Vector Researcher Xi He uses differential privacy to help keep data private

Solar panels sit on green grass. There are trees and skyscrapers in the background

Harnessing AI For Sustainability

Three people stare at a laptop with a Vector logo on it

Vector Research Blog: Structured Neural Networks for Density Estimation and Causal Inference