Almost Perfect: A Benchmark on Algorithms for Quasi-Experiments
.ical

09-19, 14:40–15:15 (Europe/Amsterdam), Rembrandt

In this presentation, we will compare four algorithms that can be used for quasi-experiments in terms of the bias and variance between predicted and actual treatment effects and the confidence/credible intervals associated with the predictions:
- Difference-in-Differences,
- Synthetic Control,
- Meta-Learners,
- Graphical Causal Models,

By the end of this lesson, attendees will understand the shortcomings and benefits of the different algorithms and be better informed about which one best suits their needs.

Almost Perfect: A Benchmark on Algorithms for Quasi-Experiments

In academia and industry, researchers need to conduct experiments to validate the effects of treatments, such as the mortality reduction of a new cancer treatment or the conversion rate of improvement of a new sales funnel. This effect is best measured through (double-blind) randomized experiments, where people are randomly assigned to either the group that receives the treatment or the one that does not. Because randomized experiments (a.k.a. A/B test) remove the bias that confounding variables can have in estimating the treatment effect, they are considered the golden standard in research.

However, it is only sometimes possible to conduct a randomized trial. It is often impossible to randomly assign people to treatment or control groups, or there might be some contamination between the treatment and control populations. Take, for example, a marketing campaign using billboards. In marketing, The researcher would like to ideally know who saw an advertisement and who didn’t, which would allow them to estimate the increase in sales by the group that received the treatment (i.e., saw the ad). However, in a billboard ad, it is impossible to randomly control who sees it, and there is no way to identify who saw the ad and who didn’t. Even if there were, people who saw the poster could comment about its content to people who didn’t, spreading the effect of the treatment to the control group. This process, known as contamination in research, is the result of the aspect known as virality in marketing, and it is a crucial component that any team in marketing is keen to include in the estimation of the effect of an advertisement.

In such cases, we go to the second best thing after randomized experiments: quasi-experiments. In them, people are not randomly assigned to treatment or control but are assigned based on a characteristic shared among multiple people that can be used to apply the treatment. In the previous example of the billboard advertisement, an example of such a characteristic is the place people live, so we could randomly select some cities to receive the treatment and some not and evaluate the impact of the advertisement at a city level. However there are differences in the profile of people who live in different cities (e.g., household income, education, job), which are called confounding variables, and they might bias the estimation of the treatment effect. To reduce the spurious effect that these variables might cause, we apply algorithms such as Synthetic Control and Difference-in-Differences.

But despite the frequent use of quasi-experiments and their long history, with the algorithm Difference-in-Differences dating since 1855 [1], there has yet to be a thorough comparison of the benefits and shortcomings of the different algorithms used for estimating the intervention effect in them. In this presentation, we will compare four algorithms and their implementation in open-source Python libraries that can be used for quasi-experiments in terms of the bias and variance between predicted and actual treatment effects and the confidence/credible intervals associated with the predictions:

Difference-in-Differences (CausalPy),
Synthetic Control (CausalPy),
Meta-Learners (CausalML),
Graphical Causal Models (DoWhy),

By the end of this lesson, attendees will understand the shortcomings and benefits of the different algorithms and be better informed about which one best suits their needs.

Topics

Presentation of the algorithms: [5 min]
- Difference-in-Differences,
- Synthetic Control,
- Meta-Learners and
- Graphical Causal Models
The synthetic experiment: application of a synthetic change to the target variable [5 min]
Comparison between algorithms [10min]
- Bias and Variance
- Credible/Confidence Interval associated with predictions
Tuning your quasi-experiment algorithms [5min]

References

[1] Causal Inference for the Brave and True, https://matheusfacure.github.io/python-causality-handbook/landing-page.html
[2] Scott Cunningham, Causal Inference, The Mixtape
[3]Piotr Rzepakowski · Szymon Jaroszewicz, Decision trees for uplift modeling with single and multiple treatments
[4] Amit Sharma, Emre Kiciman. DoWhy: An End-to-End Library for Causal Inference. 2020. https://arxiv.org/abs/2011.04216
[5] Patrick Blöbaum, Peter Götz, Kailash Budhathoki, Atalanti A. Mastakouri, Dominik Janzing. DoWhy-GCM: An extension of DoWhy for causal inference in graphical causal models. 2022. https://arxiv.org/abs/2206.06821
[6] Walmart Store Sales Dataset, https://www.kaggle.com/datasets/yasserh/walmart-dataset/data
[7] Iowa Liquor Sales, https://www.kaggle.com/datasets/residentmario/iowa-liquor-sales
[8] Supermarket Sales https://www.kaggle.com/datasets/aungpyaeap/supermarket-sales
[9] CausalPy, causal inference for quasi-experiments: https://causalpy.readthedocs.io/en/latest/#
[10] CausalML, https://causalml.readthedocs.io/en/latest/index.html

Raphael Tamaki

I'm a lead data scientist in the Marketing Science in Meta (Facebook). As a lead data scientist, I lead data science projects by ensuring scalability and implementation of best practices. I have experience in data pipelines and warehousing, machine-learning visibility for MLOps, stakeholder management, mentoring, and causal inference.
Among other tools and libraries, I have expertise in Airflow, XGBoost, CausalPy, PyMC, STAN, Looker, Deep Learning, Synthetic Control (for pesudo-experiments).