CausalInference Tutorial
In this tutorial we're going to cover techniques like Inverse Propensity Score, Matching, Doubly Robust Weighted Estimator, Trimming and Stratification.
We want to measure the effect of smoking cessation on weight gain using the National Health and Nutrition Examionation Survey Epidemiologic Followup Study (NHEFS) dataset. This study was driven for 11 years (1971-1982). Each subject was weighted twice. The first one was at the begining of the experiment (1971) and the second one, at the end of this experiment in 1982.
import pandas as pd
from causalinference import CausalModel
import seaborn as sns
import matplotlib as plt
import numpy as np
%matplotlib inline
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
data = pd.read_csv("https://cdn1.sph.harvard.edu/wp-content/uploads/sites/1268/1268/20/nhefs.csv")
data.columns
data.shape
data.info()
data.head()
data.describe()
data.isnull().sum()
We're going to remove people who haven't survived until the end of the experiment in 1982. We can check this fact testing if the field called 'wt82' has some value. It stands for the weight measured at 1982.
data = data.dropna(subset=['wt82'])
data.shape
From https://cdn1.sph.harvard.edu/wp-content/uploads/sites/1268/1268/20/hernanrobins_v2.17.21.pdf
Our goal is to estimate the average causal effect of smoking cessation (the treatment) A on weight gain (the outcome) Y. To do so, we will use data from 1566 cigarette smokers aged 25-74 years who, as part of the NHEFS, had a baseline visit and a follow-up visit about 10 years later.
- The treatment variable (A) is
qsmk
(quit smoking). Individuals were classified as treated A=1 if they reported having quit smoking before the follow-up visit, and as untreated A=0 otherwise.
data['qsmk'].value_counts(ascending=True)
data.qsmk.value_counts(normalize=True)
A = data['qsmk']
- The outcome variable (Y) is
wt82_71
(the individual's body weight at the follow-up visit in 1982 minus the individual's body weight at the baseline visit in 1971).
data['wt82_71'].value_counts()
Y = data['wt82_71']
- The confounding variables (X) (variables that influence both the independent variables and dependent variable causing spurious correlations) will be:
cf = ["active", # Measure of daily activity on 1971.
"age", # Age in 1971.
"education", # Amount of education in 1971: from 8th-grade to college level education.
"exercise", # Measure of recreational excercise.
"race", # Caucasian or not.
"sex", # Female or male.
"smokeintensity", # Number of Cigarettes smoked per day in 1971.
"smokeyrs", # Years of smoking.
"wt71" # Weight in Kilograms in 1971.
]
X = data[cf]
print(X.shape)
X.head()
X['active'].value_counts()
X['exercise'].value_counts()
X['education'].value_counts()
- Distribution of the outcome in relation to the control and treatment groups
data[data.qsmk==0].wt82_71.count()
data[data.qsmk==1].wt82_71.count()
We have a very imbalanced dataset with 1163 samples for the control group and 403 samples for the treatment group
sns.kdeplot(data.loc[lambda df: df.qsmk==0].wt82_71, label='control')
sns.kdeplot(data.loc[lambda df: df.qsmk==1].wt82_71, label='treatment')
sns.distplot(data[data.qsmk==1]['wt82_71'], bins=20)
sns.distplot(data[data.qsmk==0]['wt82_71'], bins=20)
data.groupby('qsmk')['wt82_71'].plot(kind='hist', bins=20, alpha=0.8, legend=True)
data.groupby('qsmk')['wt82_71'].agg(['mean', 'median'])
The mean weight gain value for the treatment group is greater than the mean value for the control group. Could we think that the treatment had a postive impact (more weight for subjects who quit smoking)? We have to remember that the dataset is imbalanced.
- Distribution of several covariates in relation to the control and treatment groups
data.groupby('qsmk')['age'].plot(kind='hist', bins=20, alpha=0.8, legend=True)
data.groupby('qsmk')['education'].plot(kind='hist', bins=20, alpha=0.8, legend=True)
data.groupby('qsmk')['active'].plot(kind='hist', bins=20, alpha=0.5, legend=True)
data.groupby('qsmk')['exercise'].plot(kind='hist', bins=20, alpha=0.5, legend=True)
Y = Y(0) when A = 0
Y = Y(1) when A = 1
We want to estimate the ATE so the real ATE measured in a Randomized Controlled Trial (interventional distribution) can be defined as:
Δreal = E[Y(1) - Y(0)]
But if we want to compute it from observational data:
Δobs = E[Y|A=1] - E[Y|A=0] = E[Y(1)|A=1] - E[Y(0)|A=0] <> Δreal
because
E[Y(i)|X=i] <> E[Y(i)].
Ignorability assumption: The treatment (A) received by a sample is totally explained by the random variable(s) X and the output is independent from the treatment assignment A.
Y(0), Y(1) ⊥ A|X
P(A,Y(0),Y(1)|X) = P(A|X)*P(Y(0),Y(1)|X)
The treatment is assumed to be unconfounded in the sense that the dependence between the treatment assignment and the outcomes is only through something we observe, namely the covariates X.
We're going to assume that:
Y(0) = α+βX+ϵ
where ϵ (residual) represents the noise (Y(0) doesn't have the treatment variable A) and
Y(1) = Y(0)+γA
So we can estimate the ATE as:
Y = α+βX+γA
model = CausalModel(Y=data.wt82_71.values, D=data.qsmk.values, X=X.values)
The parameter adj indicates how covariate adjustments are to be performed. Set adj = 0 to not include any covariates. Set adj = 1 to include treatment indicator D and covariates X separately. Set adj = 2 to additionally include interaction terms between D and X. Defaults to 2.
model.est_via_ols(adj=1)
print(model.estimates)
The estimated ATE is 3.349 (postive weight gain in relation to the treatment of giving up smoking). If our model decribes accurately the confounders X, we got a 95% confidence interval (If we repeat 100 times the experiment, the predicted ATE would be inside of that interval in 95 times. But the real ATE could be outside of that interval).
95 % Confidence Interval = AVG ± 1.96 * np.sqrt(data[data[A]==1].wt82_71.var()/data[data.[A]==1].shape[0] + data[data[A]==0].wt82_71.var()/data[data[A]==0].shape[0])
For each sample we're going to take the most similar one from the other ttreatment group (control) and we're going to estimate the ATE using them. To decide what's the meaning of the most similar one we're going to choose the Nearest neighbourg from the X covariate space, weighting the distances using the inverse of the variance of each covariate.
model = CausalModel(Y=data.wt82_71.values, D=data.qsmk.values, X=X.values)
model.est_via_matching()
print(model.estimates)
ATC (Average Treatment Effect of the Control) = E[Y(1) - Y(0)|X=0]
ATT (Average Treatment Effect of the Treated) = E[Y(1) - Y(0)|X=1]
The parameter matches stands for the number of matches to use for each subject and the parameter bias_adj indicates whether bias adjustements should be attempted. https://www.ncbi.nlm.nih.gov/books/NBK424728/)
model.est_via_matching(weights='inv', matches=2)
print(model.estimates)
model.est_via_matching(weights='inv', matches=2, bias_adj=True)
print(model.estimates)
e(X) = P(A|X)
We want to know the value of the Potential Outcomes E[Y(i)] but using observational data we can only have access to E[Y(i)| X=i].
The probability of the Potential Outcomes can be defined as:
P(Y(i)) = P(Y(i)|X=i)P(X=i)
So we could think that we could compute:
$$E(Y(i)) = E[\frac{yi}{P(A|X)}P(A|X)]$$
If we multiply each sample times its inverse propensity score we could compute the Potential Outcomes.
$$ΔIPSW = 1/N * (\sum_{i=Treatment} \frac{yi}{e(Xi)} - \sum_{i=Control} \frac{yi}{1-e(Xi)})$$
model = CausalModel(Y=data.wt82_71.values, D=data.qsmk.values, X=X.values)
The est_propensity_s() method applies a Logistic Regression over the covariates.
model.est_propensity_s()
propensity = model.propensity['fitted']
df = data
df['ips'] = np.where(df.qsmk == 1, 1/propensity, 1/(1-propensity))
df['ipsw'] = df.wt82_71 * df['ips']
ipswe = (df[df.qsmk==1]['ipsw'].sum() - df[df.qsmk==0]['ipsw'].sum())/df.shape[0]
print('The estimated ATE computed using IPSWE is: ' + str(ipswe))
model = CausalModel(Y=data.wt82_71.values, D=data.qsmk.values, X=X.values)
print(model.summary_stats)
Large values indicate that simple linear adjustment methods may not be adequate for removing biases that are associated with differences in covariates.
We're going to estimate the ATE computing a weighted Linear Regression, weighting each sample using the inverse propensity score.
model = CausalModel(Y=data.wt82_71.values, D=data.qsmk.values, X=X.values)
model.est_propensity_s()
model.est_via_weighting()
print(model.estimates)
The trim_s() method is going to remove samples whose estimated propensity score <= cutoff or >= (1 - cutoff).
model = CausalModel(Y=data.wt82_71.values, D=data.qsmk.values, X=X.values)
model.est_propensity_s()
model.trim_s()
model.est_via_matching()
print(model.estimates)
propensity = model.propensity['fitted']
cutoff = model.cutoff
print(propensity)
print(cutoff)
mask = (propensity > cutoff) & (propensity < 1 - cutoff)
data[mask].plot.scatter(x='active', y='wt82_71', c='qsmk', cmap='rainbow', colorbar=True)
If we don't have a prior knowledge about what 'similar samples' could be, we can group these samples inside clusters of similar propensity scores. After that, we can compute the ATE of each cluster of subjects.
model = CausalModel(Y=data.wt82_71.values, D=data.qsmk.values, X=X.values)
model.est_propensity_s()
model.stratify_s()
print(model.strata)
We can compute the overall ATE from the different strata using the est_via_blocking method. To achive that, this function estimates average treatment effects using regression within blocks.
model.est_via_blocking()
print(model.estimates)