MATH888

Causal Inference

I’m interested in causal inference since I did an internship this summer using causal inference and I think it’s quite useful and interesting, and I also think the causal inference could be used in my currently research topic: the fairness.

1.1 What question do you want to answer?

Could causal inferennce be used in Fairness and how?

1.2 And why is it important to answer it?

Decision making is usually based on a candidate’s features which may contain sensitive attributes such as gender and race, could we use do calculus to avoid this issue?

  1. What (observed or unobserved) random variables are needed to fully model the problem?

Sex, gender, ability score and the other random variables.

  1. What is the causal effect that you wish to study?

(1) the do calculus to avoid the effect of sennsitive attributes (2) the causal graph to increase the generalization.

  1. What are your working hypotheses about the relationship between variables?

Since a candidate has many features, it’s hard to decided the causal graph. Thus I consider use (currently unnknown) method to auto generate the causal graph.

HW3 – Proof

What assumptions are you making for identification, and why? In many causal inference papers, you may need some assumptions like exchangeability, stable-unit value assumption, consistency, and positivity or modifications of these assumptions. You may need to re-emphasize the assumed DAG that describes relationships between variables. Discuss whether or not your assumptions are reasonable.

We consider a pre-trained recommendation problem:

We have several datasets from several countries( $C$ ) : America( $C=A$ ), England( $C=E$ ), Japan( $C=J$ ).

In each domain, the dataset contains many items( $I$ ), users( $U$ ), and the observed decisions( $D$ ) which indicate users bought items.

The causal graph could be drawn as: image

The country, the user and the item will decide the decision – whether the user will buy this item or not. Assuming we consider the observed the data (the pre-training data) from America( $C=A$ ) and England( $C=E$ ), we want to apply the estimation on Japan( $C=J$, the zero-shot dataset). We want to show that the estimation is not applicable to Japan if we do not have any other assumption on the relationships between countries $A, E, J$.

The problem could be regarded as the Selection Bias. Where the country of a sample implies the sample indicator $S$ – $C\in{A,E}$ implies $S=1$ and $C\in{J}$ implies $S=0$. We want to show $E(D(u,i)|U=u,I=i,S=1)!=E(D(u,i)|U=u,I=i,S=0)$ is not s-recoverable. We will mimic the proof flow in https://amy-cochran.gitbook.io/causal-inference/other-considerations/selection-bias. We assume consistenncy and SUTVA (stable-unit value assumption).

What steps are needed for identification? Here, you should be providing a step-by-step proof.

Here we provide a counter example to show $E(D(u,i)|U=u,I=i,S=1)!=E(D(u,i)|U=u,I=i,S=0)$:

We assume $U ~ bernoulli(1/2)$ and $I ~ bernoulli(1/2)$ and we have decision formula $D=(U+I+S)mod2$. Thus we know for $u=i=1$,

$1=(u+i+1)mod2=E(D|U=u,I=i,S=1)!=E(D|U=u,I=i,S=0)=(u+i)mode2=0$. (Consistency for $E(D|U=u,I=i,S=1)=E(D(u,i)|U=u,I=i,S=1)$ )

What justifies each step in your identification proof? Here, I simply want you to justify each step of your proof. For example, if you are assuming consistency, clearly identify which step relies on this assumption.

$(u+i+1)%2=E(D(u,i)|S=1)$ and $E(D(u,i)|S=0)=(u+i)mod2$ depend on the assumption $D=(U+I+S)mod2$. (Consistency for $E(D|U=u,I=i,S=1)=E(D(u,i)|U=u,I=i,S=1)$ )

How do we interpret the conclusions from your identification proof? You should make some concluding remarks about identification. For example, you could point out what variables are needed to identify the causal effect or you may relate the final quantity to some other quantity you have seen before from class, a paper, or a textbook.

If we want to show the pre-training dataset could make some useful estimation on the zero-shot dataset, we at least need some extra assumptions on the relationships between the pre-training datasets and the zero-shot datasets. Otherwise we could conclude some wrong estimant.

HW4 – Implementation

What algorithms are you using for estimation/implementation? (Some examples could be outcome regression, matching, or inverse probability weighting. You may need to re-emphasize the assumptions you are making for identification.)

I use out come regression and ‘simple’ inverse probability weighting for samples from different domain.

What steps are needed for estimation/implementation? (You should be detailing an algorithm step-by-step in words or with pseudo-code.)

Come regression: (1) an item $i$ could be represent by $v_i$, $v_i$ could be extract from an item’s description via language model; (2) a user $j$ in a country $d_k$ could be represent by $u_j+d_k$, $u_j$ could be extract from the user’s purchasing history and $d_k$ is an unknow vector, we assume $d_k$ follow $\mathcal{N(0,\frac{1}{\lambda})}$; (3) how a user $u_j$ in country $d_k$ likes(will purchase) an item $v_i$ will be represented by $\langle v_i, u_j+d_k \rangle$.

‘Simple’ inverse probability weighting for samples from different domain – (1) if a domain $k$ has $N_k$ samples, then all the samples in that domain wil be weighted $\frac{1}{N_k}$. A purchasing sample could be represented by ${v_{m_k}, u_{m_k}, d_k}$ which means a user $u_{m_k}$ in country $d_m$ purchased an item $v_{m_k}$,

If causal:

the optimization formula will be: $\min \sum_k \frac{1}{N_k}\sum_{m_k\in{1,2,…,N}} \mathcal{L}(v_{m_k}, u_{m_k}, d_k) + \frac{\lambda}{2}\sum_k||d_k||^2$ Where $\mathcal{L}(v_{m_k}, u_{m_k}, d_k)$ is infoNCE where the positive pair is $(v_{m_k}, u_{m_k}+d_k)$ and the negative pair is $(\text{the other items instead of } v_{m_k}, u_{m_k}+d_k)$. Optimized via AdamW.

When estimating the user item interest in a new country, we use $\int_{d\in \mathcal{N}(0,\frac{1}{\lambda})} \langle v_i, u_j+d \rangle$ for how a user will like a item. (note, this formula could be used for zero-shot)

If no causal:

the optimization formula will be: $\min \sum_k \frac{1}{N_k}\sum_{m_k\in{1,2,…,N}} \mathcal{L}(v_{m_k}, u_{m_k})$ Where $\mathcal{L}(v_{m_k}, u_{m_k})$ is infoNCE where the positive pair is $(v_{m_k}, u_{m_k})$ and the negative pair is $(\text{the other items instead of } v_{m_k}, u_{m_k})$. Optimized via AdamW.

When estimating the user item interest in a new country, we use $\langle v_i, u_j\rangle$ for how a user will like a item. (note, this formula could be used for zero-shot)

How exactly does data factor into estimation/implementation? (You might detail the (ideal) dataset that would be used by the proposed algorithm, including its variables, setting, sample size, etc.)

The data is Cross-Market Recommendation (XMRec) dataset. It includes 52.5 million purchasing samples from different countries. It has item descriptions and use histories.

What can you say about the results/performance of the algorithm?

The model is pretrained on (India, Spain, and Canada) and zero-shot on Australia. Without causal, the Recall@K% is 0.0583; with causal, the Recall@K% is 0.0610. The causal method will slight improve the zeroshot performance.

HW5 – Sensitivity Analysis: Add to your public webpage how your previous conclusions or inferences are sensitive to assumptions or models.

What assumptions have you made that are important to your inferences/conclusions? (Some examples could be no unmeasured confounding or that a conditional mean is linear in its arguments.)

The assumption that different countries’ $d_k$ are randomly sampled from the same Gaussian Distribution with a small enough variance so that there are enough overlap of the distribution of $u_j+d_k$ (user) in different countries. Thus with enough training countries we are able to have a god estimatin on the zero-shot countries.

What alternatives to estimation/implementation/identification are available to investigate sensitivity of your conclusions/inferences to your assumptions? (Provide alternative strategies, e.g., controlling for an additional variables, using IPW instead of outcome regression, calculating E-values, adding a nonlinear term to a regression model. Clearly state what assumption you are investigating with each alternative strategy.)

The assumption you are investigate: different countries’ $d_k$ are randomly sampled from the same Gaussian Distribution.

Using IPS. Noting that different countries have different number of user-item interaction samples, when training using samples from multiple countries, it’s not ideal to use is as it is since countries with larger number of samples will dominate the training procedure and have lower $|d_k^2|$ value, thus I apply IPS to samples of a country based on how many samples we have from the country.

How would your inferences/conclusions change (if at all) if your assumptions were violated?

If $d_k$ are NOT randomly sampled from the same Gaussian Distribution. then there is n gaurantee the pre-trained model can be applied to new country.

What assumptions (if at all) cannot be investigated in any principled way?

SUTVA (stable-unit value assumption).

HW6 – Please provide some concluding remarks (~1 paragraph) about your project. These remarks might include:

A brief reminder of what you set out to do (“In summary, we did ….”)

In summary, I did a pretraining using datasets from several domains and use the model in a new domain and applied causal mechanism to debias.

Your main conclusion(s) / what you learned (“Our major finding was that …”)

I found that the causal mechanism are able to improve the performance by a small margin.

Any major limitation/issue that remains (“Unfortunately, we were unable to address …”)

Unfortunately, we were unable to improve by a large margin since compared with a normal pretraining, the only information we provide is that the sample’s domain index.

Possible areas of future work (“Possible avenues for future work include …”)

Consider a more comprehensive causal debiass mechanism.

How someone might use your work (“This work is expected to contribute to …”)

This work is expected to contribute to the pretraining with datasets from different domains.