Causal machine learning and placebo tests for businesses

"Correlation does not mean causation." So let's calculate causation.

Doctors use techniques like clinical causal inference and placebos to find out if a medicine really works.

But, causal inference, causal machine learning, and causal AI, are increasingly being used by large enterprises like Microsoft, TripAdvisor, and Uber, to solve business problems too, as they help them go beyond simple correlation.

You've probably heard it before: "Correlation does not mean causation."

Correlation (r) is a value that ranges from 0 to 1 and shows how two or more variables ONLY move together.

But if correlation isn't enough, what is causation and how can we calculate it?

That's where causal inference modeling comes in.

Inspired by clinical studies and medicine, we can figure out what really affects our business variables, like KPIs and metrics, no matter which department we are solving the problem for. We can infer whether certain independent variables are actually causing these outcomes to go up or down (not just whether they move together) by modeling how our variables interact with each other. After knowing the causal effect size, we can so-called "refute" the results, a.k.a. create "business placebos", and see whether the results are trustworthy or not.

After reading my short article, you will learn how to use causal machine learning, so you can make scientific and trustworthy decisions in your business too.

Causal inference using Microsoft's DoWhy library.

Firstly, we will need to install DoWhy library for causal machine learning by running the following code in our terminal:

pip install dowhy

Next, we import the necessary libraries for causal inference with the code below:

from dowhy import CausalModel

import pandas as pd
import numpy as np
from graphviz import Digraph

The library for causal inference (causal machine learning) and a good starting point for causal AI is called DoWhy. The DoWhy library has been developed by Microsoft. It provides a framework for creating, estimating, and validating causal models. It's a powerful tool for understanding cause-and-effect relationships in your data.

5 variables in causal inference.

To understand Causal inference, you need to get an idea of what kinds of variables you can inlcude in the model when using Microsoft's DoWhy library.

There are 3 main ones:

1. Common Causes (w0, w1, w2...wN) = common causes, also known as confounders, are variables that influence both, the treatment and the outcome.

EXAMPLE: service_quality and promotional_offers are common causes because they affect both customer_satisfaction (the treatment) and referral_likelihood (the outcome).

2. Treatment (v0) = the treatment is the main variable of interest that we believe has an effect on the outcome. We want to understand how changes in the treatment affect the outcome.

EXAMPLE: customer_satisfaction is the treatment variable, and we want to see how it impacts referral_likelihood.

3. Outcome (y) = the outcome is the variable that we are interested in predicting or understanding. It is the effect or result that we think is influenced by the treatment.

EXAMPLE: referral_likelihood is the outcome variable, which we want to see if it's influenced by customer_satisfaction.

2 additional factors in causal inference:

However, if you think about it, there could be 2 other types of an effect.

What about the variable that you think has an effect on the treatment, but not directly on the outcome?

That type of variable exists and is called an instrument:

4. Instrument (z0) (advanced) = the instrument is a variable that influences the treatment but does not directly affect the outcome except through the treatment.

EXAMPLE: In real life, a webinar itself doesn’t directly make customers refer more customers. Instead, participating in the webinar makes customers feel satisfied, and it’s this satisfaction that leads to more referrals. Thus, the customer_webinar variable can be an instrument because it affects customer_satisfaction, but it doesn’t directly affect referral_likelihood.

Also, what about the variable that you think has an effect on the outcome and nothing else?

That variable also exists and is called a risk factor:

5. Risk Factor (r0) (advanced) = the risk factor is a variable that influences the outcome but does not directly affect the treatment. It can affect the outcome independently of the treatment.

EXAMPLE: The length of the relationship with the company is a risk factor because customers who have been with the company longer might be more likely to refer others due to their familiarity and trust in the company, irrespective of their current level of satisfaction. Thus, how long the customer has been a customer affects referral_likelihood but does not directly affect customer_satisfaction.

Dataset example for causal inference.

For the purpose of showing you how you can use causal inference, I simulate and create a made-up dataset with a causal treatment effect of 1.

The big advantage of using simulated data to study causal machine learning is that we can control how the data is created and know the "true" cause-and-effect relationships.

This will later help us benchmark against simple correlation methods such as naive linear regression.

The simulated customer referral dataset looks like this:

And you can also generate the same dataset with this code:

np.random.seed(4)

num_samples = 10000

df = pd.DataFrame({
    'w0_service_quality': np.random.normal(5, 2, num_samples), 
    'w1_promotional_offers': np.random.binomial(1, 0.5, num_samples), 
    'z0_customer_webinar': np.random.binomial(1, 0.3, num_samples),  
    'r0_length_of_relationship': np.random.normal(5, 2, num_samples) 
})

df['v0_customer_satisfaction'] = (
    0.5 * df['w0_service_quality'] + 
    2 * df['w1_promotional_offers'] + 
    3 * df['z0_customer_webinar'] +  
    np.random.uniform(1, 10, num_samples)  
)

df['y_referral_likelihood'] = (
    1 * df['v0_customer_satisfaction'] +  
    0.5 * df['w0_service_quality'] +  
    0.3 * df['w1_promotional_offers'] + 
    0.2 * df['r0_length_of_relationship'] + 
    np.random.normal(0, 1, num_samples)  
)

df

In this simulated made-up dataset, each line is an individual customer.

We are looking at how three variables affect customer satisfaction and referral likelihood:

Service quality (how good the service is) increases customer satisfaction.
Promotional offers (whether a customer got a special offer) also increase satisfaction if received.
Participating in a webinar strongly boosts satisfaction.

Then, we see how this customer satisfaction affects the likelihood of referrals and how the service quality and promotional offer also increase the referrals:

Higher satisfaction also directly increases the chances that a customer will refer others.
Additionally, better service quality and receiving promotional offers both add to the likelihood of referrals.

Furthermore, the length of the relationship with the company (how long the customer has been with the company in days) also plays a role:

Customers who have been with the company longer are more likely to refer others due to their familiarity and trust in the company, regardless of their current satisfaction level.

Think of how you can create your own dataset for your own department with your own independent variables and the main variable you are trying to predict the causality for. Get inspired by this example, but be creative and remember that causality can be found in any part of the business environment.

Think of different KPIs and metrics in your department. KPIs are usually denoted by multiple metrics as follows: KPI = f(Metric1,Metric2,…,MetricnN). If you are just starting out with causal inference, think of it this way: usually, the KPI is the main outcome, while different metrics would be the rest of variables — the common causes, instruments, risk factors, and treatments. These metrics influence your main KPI and you need to know how you can cause the KPI to go up/down because your performance as an employee depends on it.

Now that we have a dataset, we can continue to the first step of modeling causal inference.

Step 1 — Model

Firstly, you must understand your variables and how they are connected.

In our case, we are looking at how customer_satisfaction affects referral_likelihood, while also considering the impact of service_quality, promotional_offers, length_of_relationship, and customer_webinar.

We firstly create our graph that will represent all of the relationships in our model with this code:

graph = """
digraph {
    v0_customer_satisfaction -> y_referral_likelihood;
    w0_service_quality -> v0_customer_satisfaction;
    w1_promotional_offers -> v0_customer_satisfaction;
    w0_service_quality -> y_referral_likelihood;
    w1_promotional_offers -> y_referral_likelihood;
}
"""

Next, we can create the simplest causal model with this code:

model = CausalModel(data=df,
  treatment='v0_customer_satisfaction',
  outcome='y_referral_likelihood',  
  common_causes=['w0_service_quality', 'w1_promotional_offers'],
  graph=graph)

To visualize and define these relationships firstly a) model them in pairs, then b) create the diagram, c) download it to your directory, and d) display it in your jupyter notebook with this code:

dot = Digraph()

dot.edges([('v0_customer_satisfaction', 'y_referral_likelihood'), ('w0_service_quality', 'v0_customer_satisfaction'), ('w1_promotional_offers', 'v0_customer_satisfaction'), ('w0_service_quality', 'y_referral_likelihood'), ('w1_promotional_offers', 'y_referral_likelihood')])

dot.render(filename = 'graphviz_output', directory = "/Users/jerry/Downloads", format='png', cleanup=True)

dot

In this example, we firstly model the causal inference without the instrument (later I will show you causality with the instrument and risk factor included).

Using the code above, you will have a diagram that will look like this:

The diagram can be used to visualize the relationships and effects between the different variables. Additionally, you can now also share the downloaded png image with your colleagues.

Step 2 — Identify

Next, we must identify the causal effect we want to study.

We want to find out if customer_satisfaction (treatment) really affects referral_likelihood (outcome).

To do this, we check if we can measure (=if the causal algorithm can identify) the effect of customer_satisfaction while accounting for other variables like service_quality (common cause #1) and promotional_offers (common cause #2).

To do that, you can use the following single line of code:

identified_estimand = model.identify_effect()

Step 3 — Estimate

Now, we can estimate the causal effect.

We use a method called backdoor linear regression to calculate how much customer_satisfaction changes referral_likelihood.

The method backdoor.linear_regression is not the same as ordinal naive linear regression. It uses the identified estimand, which includes adjusting for confounders specified in your model. This means that the regression is conditioned on the common causes, which ensures a more accurate estimate of the causal effect when compared to a naive regression.

The causal backdoor linear regression gives us a number that shows how increasing customer satisfaction will increase or decrease the chances of customers referring others.

You can estimate the effect and print it out with this code:

estimate = model.estimate_effect(identified_estimand, method_name="backdoor.linear_regression")

print(estimate)

Depending on your causal inference model and the use case, remember to try different causal methods that you can use with the DoWhy library. Based on your study and dataset, you can change the method_name parameter in the code above to one of these other methods: backdoor.distance_matching, backdoor.propensity_score_stratification, backdoor.propensity_score_matching, backdoor.propensity_score_weighting, iv.instrumental_variable, iv.regression_discontinuity. You can read the full official documentation for DoWhy methods with code examples here. Remember, that most of these methods in DoWhy are going to work ONLY if you have a treatment as a binary variable (received treatment = 1, has not received treatment = 0), which resembles the more common RCT experiment.

We then get an output with the effect size:

The mean causal effect estimate of customer satisfaction on referrals is 1.001094228016882, which is almost exactly 1.

In real life, this would mean that customer satisfaction (treatment) has a positive effect on the referral likelihood (outcome) while simultaneously taking into account 2 other variables, service quality and promotional offer (common causes). In fact, increasing satisfaction by one unit on the scale, we can boost referral likelihood by around one unit (which translates to a 1% increase in this example).

Remember that we have simulated a made-up dataset with the causal effect of = 1.

Thus, this confirms that the model accurately captures the causal effect very well.

Step 4 — Refute

With DoWhy library, you can also statistically stress-test your results and see whether they are accurate and reliable.

To do so, you can choose from 4 main statistical refuter tests:

1. placebo_treatment_refuter = this test creates a placebo in your model by randomly shuffling the treatment values among different observations in the data to see whether the estimated effect changes.

You WOULD want to see the new effect results to be different (drop to 0) if you create a placebo in your model, as placebos should have no effect in real life.

You can run the placebo-treatment-refuter test with this code:

refutation = model.refute_estimate(identified_estimand, estimate, method_name="placebo_treatment_refuter", placebo_type = "permute")

print(refutation)

After running the code, we get these results:

The results show that the original estimated effect was almost 1.

As expected, after running the test with the placebo, the new effect changed, dropped drastically and is now basically at 0.

A high p-value (e.g., 0.98) in the placebo test suggests that the new effect (under placebo) is not significantly different from zero. This is a good sign, as we want to see the new placebo effect to drop to 0. This is because it now indicates that the original effect is likely due to the actual treatment (customer satisfaction) and not just random chance or other factors.

2. data_substet_refuter = this test randomly takes a subset of your data to see whether the results change.

You WOULD NOT want to see the new effect results to be different if you take a random subset from your dataset.

You can run the data-subset-refuter test with this code:

refute_results = model.refute_estimate(identified_estimand, estimate, method_name="data_subset_refuter")

print(refute_results)

After running the code, we get these results:

The results show that the original estimated effect is almost 1, as expected.

After using a subset of the data, the new effect is still very close to the original effect, at 1.000907031559731, and the p-value is 1.0.

This is a good sign, as it indicates that the original estimated effect is consistent and robust, even when just using a smaller and random subset of the data.

3. bootstrap_refuter = this test generates multiple new datasets by randomly sampling from the original dataset with replacement.

You WOULD NOT want to see the new effect results to be significantly different when re-estimating the effect on these new datasets.

You can run the bootstrap_refuter test and generate 100 new datasets (tables) all including the same number of rows as your original data frame with this code:

refute_results = model.refute_estimate(identified_estimand, estimate, method_name="bootstrap_refuter", num_simulations = 100, num_samples = len(df))

print(refute_results)

After running the code, we get these results:

The results show that the original estimated effect is almost 1, as expected.

After running the bootstrap refuter, the new effect remains very close to the original effect, at approximately 1.0023221163625238, with a p-value of 0.84.

This is a good sign, as it suggests that the original estimated effect is stable and not significantly influenced by random variation or sampling error. In simple terms, we tested the stability of the original estimated effect by creating 100 new datasets, each with 10,000 customers, and found that the estimated effects were consistent across these samples.

4. random_common_cause = this test creates a new independent variable as a new common cause (w3 in our case) in the model.

You WOULD NOT want to see the new effect results to be different if you add a new common cause to your model, as you believe you have captured all common causes already (omitted variable bias).

You can run the random-common-cause test with this code:

refute_results = model.refute_estimate(identified_estimand, estimate, method_name="random_common_cause")

print(refute_results)

After running the code, we get these results:

The results show that the original estimated effect is almost 1, as expected.

After adding a random common cause, the new effect remains very close to the original effect, at 1.0010961345441067, with a p-value of 0.98.

This is a good sign, as it suggests that the original estimated effect is robust and not significantly influenced by potential confounding factors that we have not included in the model.

Step 5 - Compare (my extra step)

Lastly, we want to compare the causal inference model results with another technique.

With our causal machine learning model, we used a method called "backdoor.linear.regression".

So let's actually compare the results with a naive linear regression and see whether the causal modelling captures the effect size of 1 more precisely or not.

To do so, we can run this code

import statsmodels.api as sm

X = df['v0_customer_satisfaction'].astype(float)
y = df['y_referral_likelihood'].astype(float)

X = sm.add_constant(X)

ols = sm.OLS(y, X).fit()

print(ols.summary().tables[1])

After running the code, we get these results:

Because the dataset was generated to have an effect exactly = 1, the naive regression shows a less precise result (naive regression = 1.106 VS causal inference model = 1.001). This suggests that linear regression is less precise in capturing the exact effect of the treatment, while the causal model performs better.

As mentioned before, the four common steps of causal inference with DoWhy are: Model, Identify, Estimate, and Refute.

However, I believe there should be a crucial fifth step: Compare.

Comparing your models with other similar methods helps evaluate the size of the error (such as MSE, RMSE).

Thus, to make it easier to remember all the steps, I've created an acronym: MIER-C. This will help you recall all 5 steps.

Adding the instrument to your causal machine learning model.

At the start of this article, I told you about the instrument, so let's add that to the model too.

Firstly, we create our graph parameter again, but now it wil include the instrument effect:

graph = """
digraph {
    v0_customer_satisfaction -> y_referral_likelihood;
    w0_service_quality -> v0_customer_satisfaction;
    w1_promotional_offers -> v0_customer_satisfaction;
    w0_service_quality -> y_referral_likelihood;
    w1_promotional_offers -> y_referral_likelihood;
    z0_customer_webinar -> v0_customer_satisfaction;
}
"""

Next, we create our model, but now include the "z0" as our instrument:

model2 = CausalModel(
    data=df,
    treatment='v0_customer_satisfaction',
    outcome='y_referral_likelihood',  
    common_causes=['w0_service_quality', 'w1_promotional_offers'], 
    instruments=['z0_customer_webinar'],
    graph=graph)

After that, we once again create our causal diagram to show the relationships between variables with this code below:

dot = Digraph()

dot.edges([('v0_customer_satisfaction', 'y_referral_likelihood'), ('w0_service_quality', 'v0_customer_satisfaction'), ('w1_promotional_offers', 'v0_customer_satisfaction'), ('Z0_customer_webinar', 'v0_customer_satisfaction'), ('w0_service_quality', 'y_referral_likelihood'), ('w1_promotional_offers', 'y_referral_likelihood')])

dot.render(filename = 'graphviz_output', directory = "/Users/jerry/Downloads", format='png', cleanup=True)

dot

We get our new diagram displayed and downloaded, so it can be shared with your colleagues:

We identify the effect size once again with the code below:

identified_estimand = model2.identify_effect()

After which we estimate the effect, but now we use the "iv.instrumental_variable" as our method:

iv_estimate = model2.estimate_effect(identified_estimand, method_name="iv.instrumental_variable") 

print(iv_estimate)

We get the results printed:

When using the instrumental variable method, the results show that the mean effect value is once again very close to 1, which is what we would expect with this simulated dataset.

In this example, we can see that the true estimate is slightly less precise than when the webinar is not included as an instrument, but it's still more precise than the naive regression.

Causal, without instrument = 0.00109422 error (1 - 1.001094228016882)
Naive regression = 0.1066 error (1 - 1.1066)
Causal, with instrument = 0.00654781 error (1–1.0065478105629306)

The key point is that the instrument (z0_customer_webinar) should be independent of the common causes you use while being a valid predictor of the treatment (v0_customer_satisfaction).

After the estimate step, we can once again refute the results with the placebo, subset, bootstrapping, and random common cause test.

Adding the risk factor to your causal machine learning model.

DoWhy does not have a specific parameter to include the risk factor directly in the model.

To properly account for the risk factor in DoWhy, you should once again include it in the “graph” parameter as a relationship affecting the outcome only:

graph = """
digraph {
    w0_service_quality -> v0_customer_satisfaction;
    w1_promotional_offers -> v0_customer_satisfaction;
    r0_length_of_relationship -> y_referral_likelihood;
    w0_service_quality -> y_referral_likelihood;
    w1_promotional_offers -> y_referral_likelihood;
    v0_customer_satisfaction -> y_referral_likelihood;
}
"""

We then create our model, pass on the treatment and outcome with the graph parameter:

model3 = CausalModel(
    data=df,
    treatment='v0_customer_satisfaction',
    outcome='y_referral_likelihood',
    common_causes=["w0_service_quality", "w1_promotional_offers"],
    graph=graph    
)

Remember you must always include treatment and outcome parameters when modeling with DoWhy (even when technically they are both included in your graph parameter).

Then, we visualize the relationships if we want to share the diagram with our colleagues:

dot = Digraph()

dot.edges([('w0_service_quality', 'v0_customer_satisfaction'), 
           ('w1_promotional_offers', 'v0_customer_satisfaction'), 
           ('risk_factor_length_of_relationship', 'y_referral_likelihood'), 
           ('w0_service_quality', 'y_referral_likelihood'), 
           ('w1_promotional_offers', 'y_referral_likelihood'), 
           ('v0_customer_satisfaction', 'y_referral_likelihood')])

dot.render(filename='graphviz_output', directory="/Users/jerry/Downloads", format='png', cleanup=True)

dot

We get our causal model diagram including the risk factor:

We identify the effect size once again with the code below:

identified_estimand = model3.identify_effect()

After which we estimate the effect:

estimate = model3.estimate_effect(identified_estimand, method_name="backdoor.linear_regression")
print(estimate)

We get the results printed and find the estimated effect to be equal 1, as expected:

We can once again see that when the causal model includes risk factor, it performs better than a naive linear regression.

Additionally, under the estimated effect, you can see that with the risk factor being included, you also get bins (red brackets in the screenshot above):

When you include a risk factor in your model, DoWhy can conditionally estimate the treatment effect at different levels of this risk factor. This involves dividing the range of the risk factor into intervals (bins). DoWhy calculates the average effect of the treatment (customer satisfaction) on the outcome (referral likelihood) separately for each bin (risk factor range), showing how the final effect varies across different ranges of the risk factor.

After including the risk factor in the estimate step, we can once again refute the results with the placebo, subset, bootstrapping, and random common cause test.

Remember, that in the DoWhy causal inference framework, you cannot use an instrumental variable (IV) and a risk factor (confounder) simultaneously within the same estimation method like backdoor.linear_regression. This is because the use of instrumental variables and the backdoor criterion are distinct methods for identifying causal effects and rely on different assumptions.

FAQ:

Here are answers for FAQs I got asked.

Why should you use causal inference (causal machine learning) and not just a naive multiple linear regression?

Causal inference performs better and is more precise than correlation methods like multivariate linear regression. With causal inference, we don’t just place all independent variables (Xs) into one multivariate linear regression model to predict the dependent variable (Y). We can adjust for other common confounders (common causes and instruments) and specifically model their direction of the effects to other variables in the causal model. There are other parameters you can adjust with causal machine learning that you can’t when using multiple regression. As you have seen, you can also “refute” your results to double check the validity and reliability using 4 different statistical tests which you can’t when using regression.

Is this causal inference with DoWhy library the same as traditional randomized controlled trials (RCTs)?

In this example, not really, but it can be (read the next question below on how). While traditional treatment-control experiments use randomization and distinct control groups to measure causal effects, the approach I've described here uses statistical methods to adjust for confounders and infer causality from observational historical data without needing a clear control group.

Is it possible and if so, how could you create a control-treatment group and perform randomized control trials (RCT) using DoWhy library?

Yes, it is possible to create control-treatment groups and perform RCTs using the DoWhy library. In fact, most of the methods are going to work only if you have a treatment as a binary variable (received treatment = 1, has not received treatment = 0). First, you model and randomly assign the v0_random_treatment variable, with 50% of the samples in the treatment group and 50% in the control group. The control group will be represented by v0_random_treatment = 0 (no treatment) and the treatment group by v0_random_treatment = 1 (treatment). This means the v0_random_treatment column in your data frame will now have only binary values (0s and 1s) instead of continuous values as in the the example I have shown you. You will then know how being part of the treatment group (=1) vs not being part of the treatment group (=0) influences the outcome you are studying. Next, you identify, estimate, and refute, the same way as I have shown you in this tutorial.

What's in it for you?

I hope you now have a better idea of how you can use causal inference to make smarter decisions in your business.

Here are the main takeaways for you:

Remember you can estimate causality using the DoWhy library in Python.
In business: causality > correlation.
The main 5 variables of causal machine learning are: common causes, instruments, risk factors, treatment, and outcome.
Based on your problem and department, brainstorm which business variables can be assigned to these five variables.
The 4 main steps of causal modeling are: model, identify, estimate, refute.
The extra 5th step I personally perform and would recommend you to perform is: compare.
Use my easy-to-remember acronym: MIER-C to remember all 5 steps of causal inference.
Use causal diagrams to a) show the causal relations between variables and to b) share them with your colleagues as png images.
The 4 main methods to double-check the robustness of your causal machine learning model and to stress-test the results are: placebo, subset, bootstrapping, and random common cause.

Applying causal inference and placebo tests to infer business causations (not just correlations).

Tomas Jancovic

Data @ Trustpilot (Microsoft-certified Power BI Data Analyst, Meta-certified Professional Data Scientist)

"Correlation does not mean causation." So let's calculate causation.

Causal inference using Microsoft's DoWhy library.

5 variables in causal inference.

There are 3 main ones:

2 additional factors in causal inference:

Dataset example for causal inference.

Step 1 — Model

Step 2 — Identify

Step 3 — Estimate

Step 4 — Refute

Step 5 - Compare (my extra step)

Adding the instrument to your causal machine learning model.

Adding the risk factor to your causal machine learning model.

FAQ:

Why should you use causal inference (causal machine learning) and not just a naive multiple linear regression?

Is this causal inference with DoWhy library the same as traditional randomized controlled trials (RCTs)?

Is it possible and if so, how could you create a control-treatment group and perform randomized control trials (RCT) using DoWhy library?

What's in it for you?

More articles by this author

Insights from the community

Explore topics

"Correlation does not mean causation." So let's calculate causation.

Causal inference using Microsoft's DoWhy library.

5 variables in causal inference.

There are 3 main ones:

2 additional factors in causal inference:

Dataset example for causal inference.

Step 1 — Model

Step 2 — Identify

Step 3 — Estimate

Step 4 — Refute

Step 5 - Compare (my extra step)

Adding the instrument to your causal machine learning model.

Adding the risk factor to your causal machine learning model.

FAQ:

Why should you use causal inference (causal machine learning) and not just a naive multiple linear regression?

Is this causal inference with DoWhy library the same as traditional randomized controlled trials (RCTs)?

Is it possible and if so, how could you create a control-treatment group and perform randomized control trials (RCT) using DoWhy library?

What's in it for you?

You didn't conduct an A/B test. You can still simulate one retrospectively.

Jul 30, 2024

Thinking fast and distributions.

Apr 25, 2024

Insights from the community

Explore topics