comp34312

Bias Variance Decomposition

Noise

E_{y | x} [y - E_{y | x} [y]]^{2}

Bias

[E_{D} [f (x)] - E_{y | x} [y]]^{2}

Variance

E_{D} [f (x) - E_{D} (f (x))]^{2}

Double Descent: neural nets typically exhibit low bias and high variance. if over-trained the variance + bias increases over time however if in the over-parameterised regime ( $p >> n$ ) both bias and variance start decreasing again

Bias Variance Decomposition for Ensembles

Ambiguity of the Ensemble

\frac{1}{m} \sum^{m} f_{i} (x) - \bar{f} (x)

loss of the ensemble is guaranteed to be less than or equal to the average loss i.e. ensemble will at least be better than the average
overall ensemble variance is reduced by a factor of $1 / m$ relative to the average
expected loss = avg bias + avg variance - expected ambiguity q(diversity)
- the improvement in performance is entirely determined by diversity

Empirical Risk Minimisation

$f_{erm}$ empirical risk minimiser
$f^{*}$ best model in a given family (lowest population risk in family)
$y^{*}$ bayes model
Excess Risk: $R (f) - R (f^{*})$
Estimation/Approximation Decomposition: $R (f_{erm}) - R (y^{*}) = R (f_{erm}) - R (f^{*}) + R (f^{*}) - R (y^{*})$
- Estimation Error: $R (f_{erm}) - R (f^{*})$ (random depends on sample size)
- Approximation Error: $R (f^{*}) - R (f)$ (constant, depends on choice of model)
Optimisation/Estimation/Approximation Decomposition $R (f) - R (y^{*}) = R (f) - R (f_{erm}) + R (f_{erm}) - R (f^{*}) + R (f^{*}) - R (y^{*})$
- Optimisation Error: $R (f) - R (f_{erm})$

Linear Regression

β = X^{⊤} (X X^{⊤})^{- 1} y

GD

t \geq \frac{\log (\frac{| x_{0} |}{ϵ})}{\log \frac{1}{k}}

GD in Over-parameterised Linear Regression

because of over-parameterisation GD converges to a global minima nearest to the origin i.e. over-parameterisation induces an effect called "implicit bias" or "algorithmic regularisation"

If $X \in R^{n \times d}$ is of rank $n$ and $n < d$ then $X X^{⊤} \in R^{n \times n}$ is invertible

Objective Function

min_{β \in R^{d}} \frac{1}{2} ‖ y - X β ‖_{2}^{2}

Loss function

\hat{R} (β) = \frac{1}{2} \cdot ‖ y - X β ‖_{2}^{2}

Assumptions:
- over-parameterised i.e. $n < d$
- $X$ is full rank

X^{†} := X^{⊤} (X X^{⊤})^{- 1} \in R^{d \times n} X X^{†} = I

NOTE:

$‖ y - X β ‖_{2} = 0$ iff $β = X^{†} y + ξ$ s.t. $ξ x_{i} = 0 \forall i$ multiplicity of global minima, there are uncountably many infinite global minima which can be described as $X^{†} y + Orth(Rows(x))$
$X^{†} y = \arg min ‖ β ‖_{2}$ s.t. $\hat{R} (β) = 0$ i.e. 2-norm of all global minimisers of $\hat{R}$ is bigger than that of $X^{†} y$
Full-rank over-parameterised Linear Regression can be set up ( $β_{0} = 0$ ) such that GD find the minimum 2-norm interpolant ( $X^{†} y$ )