Causal inference and causal analysis II - Counterfactuals, Randomization and Confounder isolation/correction
In previous edition on Causal inference, some weaknesses or causal regularity theories were discussed such as co-occurrence and the effects of third confounding variables.
Also to test for causal association in Research, it is very important to have a direct comparison including causal object presence vs no presence.
This can be achieved by having a Counterfactual design. Exposure vs non exposure comparison will yield this design. There are many other terms and designs, like case-control, treatment vs non-treatment, Modification vs no modification and many other which have a counterfactual aspect.
But, All these allow for confounders if not corrected for the confounding Bias. To correct for this bias, here are my top 3 strategies that can be applied :
- Experimental isolation
- Randomization
- Instrumental variable introduction
These correction strategies actually empower the counterfactual design, and without them, the counterfactual design would actually not have its stability.
Experimental isolation is the strongest of the bias correcting instruments. This means that one can actually set up an experiment in a fashion that no unwanted influences outside the experiment can be allowed inside the experiment setting.
This procedure is often not easy and can be expensive. Also the absolute isolation if often not possible in environments such as in vivo Life Science, most Business analytics, Social Science, Psychology and many other areas.
In fact majority of the data we have today does not come from isolated experiments. The majority of Research data also does not come from completely isolated experiments either.
So there are actually two main strategies to deal with the confounder bias correction in terms of causal inference with most data we have today, Randomization or Instrumental variables.
- Randomization
In this procedure, a random but equal probability to be assigned to defined groups is assigned to each observation. These can be Expose, Control groups, Therapy control groups in Randomized Controlled trials or any other counter factual randomized design. As it can be seen, randomization is done at the beginning of the study and the subjects are given random, equal probabilities (it can also be unequal in some cases). If we consider the allocation, this way random allocation probability enables correcting for possible allocation bias, since it is 0.5 and nothing else but randomness affects it.
However, the most interesting part from a Biostatistical point of view is that principle is applied also to the unknown confounders. They are also spread randomly with 0.5 probability. This principle is then said to correct both the allocation bias and the confounder bias. In causal inference correcting the confounders bias is a must. Randomization and counterfactual design are the reasons why we consider RCTs (randomized controlled trials) a form of Causal analysis and in fact it is the Gold Standard in Clinical Trials for decades and still undisputed in that area.
Law of large numbers states that as the sample size increases we can expect these probabilities to actually converge to reality. Imagine a coin toss. Initially the probability for head P(H)=0.5 and probability of tails P(T)=0.5. If you toss a coin 10 times you might say the probability is equal, so the expected number should br 5 vs 5. But in reality the small sample will yield different result. If we toss a coin 10 times, we may get a situation like 7-3, head vs tails. Another 10 times, we may get 4-6. As the sample size, or the number of tosses increases, lets say to millions of tosses, then the initial probability of head vs tail of 0.5 would yield almost equal number of heads and tails, thus converging a reality tosses to the initial probability of 0.5. Same principle is applied in Random allocation, if the want the randomization probabilities to reflect in reality
AB tests - The same counterfactual, randomization design is deployed (and very effectively) in AB tests, especially in business area. Why is this important for businesses. Well, they can collect data for randomized experiments, much easier that the Clinical Trials can. The principle is very similar, except for the fact that we can call the groups A and B for the terminology sake.
Why is this important for businesses. Well, they can collect data for randomized experiments, much easier that the Clinical Trials can. Both Clinical trials and A/B testing companies require Longitudinal follow up periods to gain insights in Causal inference. The principle is very similar, except for the fact that we can call the groups A and B for the terminology sake. Randomized A/B tests are some of the most powerful tools for most analyst in Business area. In fact there are many Businesses that are mostly focused on just doing the A/B tests for other countries and have a big market there.
Both Clinical trials and A/B testing companies require Longitudinal follow up periods to gain insights in Causal inference and both have a certain controlled Counter-factual design. Well all this is not always possible. In fact a lot of data lets say in Bioinformatics and Biomedical industry is not Longitudinal. Its mostly observational and does not have the time aspect. This makes the use of Randomized Counterfactual design, well impossible in many cases.
So what is the golden solution in that case?
Instrumental variables. This method enables correcting for bias but in a very specific way. It is mostly adapted to the regression problems, where a relation between independent, dependent, confounders and additional instrumental variables is considered.
If there is a variable, that does not influence the model and the outcome as such, but can be introduced to have and effect on the independent variable, such variable, an instrumental variable could be used to correct for the significant portion of the bias. Using this principle we are essentially introducing a source of variability which is random for the dependent variable, the output and its also a source of random noise to the confounders but its not random for the independent variable, masking the portion of the bias that comes from the confounders in such a way.
Not as powerful as Randomized controlled experiment if used on observational data, but still useful in most of todays Research areas. Instrumental variable methods are considered as one of the top Causal inference approaches today. But there is a catch with these methods. Since they are mostly used on Observational data or any other similar data, they don't have the controlled longitudinal aspect, which is another weakness. There are situation where this longitudinal aspect can be assumed. More on that in one of the next articles in Advanced Stata and Data Science.
Thanks for reading!
Darko
Statistical Analyst
1yHow can I get the full text of this article?