Out of Distribution Generalization

First we start with the question, what does it mean for data to be out of distribution. In tradition statistical inference we concern ourselves with the generalization gap. This is defined by the difference in expected error of our algorithm in the training set vs. the test set. Now, in classical approaches the test set is assumed to be sampled from the same distribution, that is the true data distribution \(p(x)\).

If we consider generalization in the OOD setting, things get a bit more complicated. Imagine that your test set is not sampled from the same data generating distribution as your training set. This introduces a whole new set of challenges that we need to address to describe generalization in the context of out of distribution data.

Alas, we cannot just go about as machine learning researchers without making any assumptions about the nature of the data generating distribution. In order to provide results for OOD generalization, some sort of assumptions are necessary.

When talking about OOD generalization, we are going to talk about environments a lot. In each environment, the data generating distribution manifests itself in a particular form, shifting the expected error (risk) accordingly. In fact, the set of environments can be (and most often is) specified as the distribution family considered for analysis purposes.

The concept of distributional robustness quantifies the maximum OOD risk that is obtained across a large (sometimes infinite) set of environments. In mathematical terms, we define the OOD risk as the following:

\[\begin{equation} R^{OOD}(f) = \max_{e \in \mathcal{E}} R^e(f) \end{equation}\]

where \(f\) is our hypothesis, ie. the predictor/model. Quite intuitively, the goal of this objective is to minimize the worst case error across different environments (test data distributions). We notice one important thing right here, the assumptions that are going to be necessary for deriving any kind of reasonable result for OOD generalization are going to talk about the types of distributions from which the test data (environment data is drawn from).

Why should we use a max and not a prior over environments?

A caveat with regards to this objective is that it is dominated by the worst-case environments, so it cannot distinguish between two predictors that have different performance across all environments.

Interesting area of research is to find Pareto-optimal predictors that minimize this objective.

A common assumption is that we assume that we have access to data from more training distributions.

Interesting example with linear gaussian structural equation model. If the coefficient depends on the environment, we have an infinite set of distribution families describing \(p(x, y)\). The only predictor in this case that is going to have finite OOD risk is the predictor \(f(x) = x\), which is also the causal explanation of \(y\).

The basic line of thought is that the statistical relationship between \(x_2\) and \(y\) varies strongly across environments. One of the main ideas behind finding good predictors that generalize across environments is to find invariant relationships. It is relatively straight-forward whe the concept of OOD is related to causality directly, since the structural equation models that describe the relationships are these types of predictors.

Why cannot we use empirical risk minimization for OOD, let’s say that we obtain a single dataset that contains samples sampled from \(x,y \sim p(x,y \\| e) p(e)\) ? While this would lead to a predictor that would be optimal in the sense of average performance across environments, we can construct \(p(e)\) such that it has low density in areas of very bad performance, hence no attributing to the expected risk a lot, but when sampled would lead to disastrous predictions.

When is ERM the best thing that we can do in this case? If the data is meta-iid, we don’t have any knowledge about the structure of the data.

Robust Optimization

Attempts to minimize the objective

\[\begin{equation} R^{rob}(f) = \max R^e(f) - r_e \end{equation}\]

\(r_e\) is one of our optimization variables that we set specifically to the environment. Intuitively, we would want \(r_e\) to be big when we have a very stochastic environment where our predictor cannot perform well. One possible option for setting \(r_e\) is the variance of the targets for example. This is one of the ways to achieve distributional robustness, ie. optimizing the objective that we previously mentioned.

It has been shown that this is equivalent to minimizing a weighted sum over training environment risks.

Methods based on robust optimization can’t deal well with the situations when there are large spurious correlations it will associate large weights to these environments, because of the linear mix.

Invariant prediction, finding a feature mapping \(\Phi(x)\) that is going to cause the predictive density \(p(y | x)\) to be invariant across environments.

Connections to Causality

Causality has often been defined as having the central property of invariance under intervention. Another way of looking at it is the ever reoccurring assumptions of independence and invariance of mechanisms.

The concept of environments that is often used in OOD generalization literature is intimately connected to the concept of intervention in causality. By intervening we are setting a certain value to a random variable (a factor that varies across environments), and therefore are changing the data generating distribution, which results in a change of the structural equation model, ie we obtain the interventional structural equation model which encodes the interventional distribution.

In certain circumstances, invariance causality and OOD generalization are equivalent. Martin Arjovsky in his thesis makes a hint at the need of redefining what we want, although a causal model might be out of our reach (it is hard to define a graph between raw data), thinking about invariance might be more useful in terms of the goal of having models that perform well.