---
title: "Double Biased ML Model"
author: "Cheney Yu"
date: "2022-11-06"
output: pdf_document
editor_options:
chunk_output_type: inline
---
```{r setup, include=FALSE}
library(DoubleML)
```
The Bonus data set from the Pennsylvania Reemployment Bonus experiment and as a second example we simulate data from a partially linear regression model.
```{r }
# Simulate data
set.seed(3141)
n_obs = 500
n_vars = 100
theta = 3
X = matrix(rnorm(n_obs*n_vars), nrow=n_obs, ncol=n_vars)
d = X[,1:3]%*%c(5,5,5) + rnorm(n_obs)
y = theta*d + X[, 1:3]%*%c(5,5,5) + rnorm(n_obs)
```
DoubleML provides interfaces to objects of class data.table as well as R base classes data.frame and matrix.The DoubleMLData class serves as data-backend and can be initialized from a dataframe by specifying the column y_col="inuidur1" serving as outcome variable 𝑌, the column(s) d_cols = "tg" serving as treatment variable 𝐷 and the columns x_cols=c("female", "black", "othrace", "dep1", "dep2", "q2", "q3", "q4", "q5", "q6", "agelt35", "agegt54", "durable", "lusd", "husd") specifying the confounders. Alternatively a matrix interface can be used as shown below for the simulated data.
```{r }
# Specify the data and variables for the causal model
dml_data_bonus = DoubleMLData$new(df_bonus,
y_col = "inuidur1",
d_cols = "tg",
x_cols = c("female", "black", "othrace", "dep1", "dep2",
"q2", "q3", "q4", "q5", "q6", "agelt35", "agegt54",
"durable", "lusd", "husd"))
print(dml_data_bonus)
```
```{r}
# matrix interface to DoubleMLData
dml_data_sim = double_ml_data_from_matrix(X=X, y=y, d=d)
dml_data_sim
```
To estimate our partially linear regression (PLR) model with the double machine learning algorithm, we first have to specify machine learners to estimate 𝑚0 and 𝑔0. For the bonus data we use a random forest regression model and for our simulated data from a sparse partially linear model we use a Lasso regression model. The implementation of DoubleML is based on the meta-packages mlr3 for R.
```{r}
library(mlr3)
library(mlr3learners)
# surpress messages from mlr3 package during fitting
lgr::get_logger("mlr3")$set_threshold("warn")
learner = lrn("regr.ranger", num.trees=500, mtry=floor(sqrt(n_vars)), max.depth=5, min.node.size=2)
ml_g_bonus = learner$clone()
ml_m_bonus = learner$clone()
learner = lrn("regr.glmnet", lambda = sqrt(log(n_vars)/(n_obs)))
ml_g_sim = learner$clone()
ml_m_sim = learner$clone()
```
When initializing the object for PLR models DoubleMLPLR, we can further set parameters specifying the resampling:
The number of folds used for cross-fitting n_folds (defaults to n_folds = 5) as well as
the number of repetitions when applying repeated cross-fitting n_rep (defaults to n_rep = 1).
Additionally, one can choose between the algorithms "dml1" and "dml2" via dml_procedure (defaults to "dml2"). Depending on the causal model, one can further choose between different Neyman-orthogonal score / moment functions.
```{r}
set.seed(3141)
obj_dml_plr_bonus = DoubleMLPLR$new(dml_data_bonus, ml_g=ml_g_bonus, ml_m=ml_m_bonus)
obj_dml_plr_bonus$fit()
obj_dml_plr_bonus$summary()
print(obj_dml_plr_bonus)
```
```{r}
obj_dml_plr_sim = DoubleMLPLR$new(dml_data_sim, ml_g=ml_g_sim, ml_m=ml_m_sim)
```
```{r}
obj_dml_plr_sim$fit()
obj_dml_plr_sim$summary()
print(obj_dml_plr_sim)
```
## Developed by Philipp Bach, Victor Chernozhukov, Malte S. Kurz, Martin Spindler. The package in the market since 2021
hdm package has the 1991 Survey of Income and Participation data nicely wrapped up in the variable pension. Now we can map what our response, treatment and covariates are from their description in the paper and their columns in the data.
Response: net_tfa a positive or negative numerical outcome that shows the net financial assets of a person.
Covariates
age
inc: income
fsize: family size
educ: years of education
pira: participation in a IRA
hown: Home ownership status
marr: Marriage status
db: Defined benefit pension
twoearn: Two earning household.
```{r}
library(hdm)
library(ggplot2)
library(tidyverse)
library(dplyr)
treatment <- c("e401")
response <- c("net_tfa")
xActual <- c("age", "inc", "fsize", "educ", "pira", "hown", "marr", "db", "twoearn")
data(pension)
ggplot(pension, aes(x=factor(e401))) +
geom_bar() +
coord_flip() +
ggtitle("401k Eligibility")
```
```{r}
ggplot(pension, aes(x=net_tfa/1e3)) +
geom_histogram(binwidth=50) +
xlab("Net Financial Assets (thousands $)") +
ylab("Count")
```
```{r}
ggplot(pension %>%
select(age, inc, fsize, educ,
pira, hown, marr, db, twoearn) %>%
gather(Covariate, Value),
aes(x=Value)) +
geom_histogram() +
facet_wrap(~Covariate, scales="free")
```
Double machine learning is an attempt to understand the effect a treatment has on a response without being unduly influenced by the covariates.
## Step 1: Split the data
```{r}
pension %>% mutate(e401F = factor(e401, levels = c(0, 1))) -> modelData
inds <- sample.int(nrow(modelData), nrow(modelData)/2, replace=F)
dataList <- list(modelData[inds, ],
modelData[-inds, ])
```
```{r}
library(caret)
train_control <- trainControl(method="adaptive_cv",
number=10,
search = "random",
verboseIter = TRUE)
rfResponseModel <- lapply(dataList,
function(x) train(net_tfa ~ . - e401 - e401F,
method = "ranger",
tuneLength = 10,
data = x,
verbose = T,
trControl = train_control))
rfTreatmentModel <- lapply(dataList,
function(x) train(e401F ~ . - net_tfa - e401,
method="ranger",
tuneLength = 10,
data = x,
verbose = T,
trControl = train_control))
```
```{r}
calc_theta <- function(dataList, responseModel, treatmentModel){
# Predict the response in dataset 1 (2) using model 2 (1).
responsePredictions <- lapply(list(c(1,2), c(2,1)),
function(i) predict(responseModel[[i[1]]],
dataList[[i[2]]]))
# Do the same for the treatment model
treatmentPredictions <- lapply(list(c(1,2), c(2,1)),
function(i) as.numeric(predict(treatmentModel[[i[1]]],
dataList[[i[2]]])) - 1)
# Calculate the treatment residuals
treatmentResiduals <- list(dataList[[2]]$e401 - treatmentPredictions[[1]],
dataList[[1]]$e401 - treatmentPredictions[[2]])
# Calculate the response residuals
responseResiduals <- list(dataList[[2]]$net_tfa - responsePredictions[[1]],
dataList[[1]]$net_tfa - responsePredictions[[2]])
# Regress the residuals across both datasets
theta1 <- mean(treatmentResiduals[[1]] %*% responseResiduals[[1]]) / mean(treatmentResiduals[[1]] %*% dataList[[2]]$e401)
theta2 <- mean(treatmentResiduals[[2]] %*% responseResiduals[[2]]) / mean(treatmentResiduals[[2]] %*% dataList[[1]]$e401)
# Take the average as our treatment effect estimator
mean(c(theta1, theta2))
}
calc_theta(dataList, rfResponseModel, rfTreatmentModel)
```
```