Covariance Matrices in High Dimensions¶

Sang-Yun Oh¶

Covariance Matrix¶

  • Covariance matrix is defined as $$ \Sigma=E(\boldsymbol{X}-\boldsymbol{\mu})(\boldsymbol{X}-\boldsymbol{\mu})^{\prime}=\left(\sigma_{i j}\right), $$
  • Sample covariance matrix is $$ \mathbf{S}=\frac{1}{n} \sum_{i=1}^{n}\left(\boldsymbol{X}_{i}-\overline{\boldsymbol{X}}\right)\left(\boldsymbol{X}_{i}-\overline{\boldsymbol{X}}\right)^{\prime}, $$
  • Other types of covariance estimates

Covariance Matrices in Practice¶

  • Stocks: $p=4300+$ companies and 20 days per month
    Relationship between stocks (volatility structure)
  • Genomics: $p\approx 20000$ genes and 100s of subjects
    Co-expression of genes (gene relevance network)

Gene Co-expression Network

Shows protein-protein interaction network. Source: https://doi.org/10.1038/npjschz.2016.12

  • Neuroimaging: $p=90000$ voxels (or hundreds of aggregated ROIs) and thousands of time points
    Functional connectivity network

Functional Connectivity

Source: https://doi.org/10.1073/pnas.1912034117

  • Ecology: $p=23$ environmental variables and $n=12$ locations
    Community abundance (Warton 2008)

Data are often high dimensional

Other Covariance Matrix Uses¶

  • Markowitz Portfolio Problem (encodes market volatility)
  • Regression (OLS)
    $$ \widehat{\beta}_{\text {OLS}}=\left(\boldsymbol{X}^{\prime} \boldsymbol{X}\right)^{-1} \boldsymbol{X}^{\prime} \boldsymbol{Y} $$
  • Canonical Correlation Analysis (CCA): associations among two sets of variables (examples)
  • Input to clustering algorithm
  • Inverse of $\Sigma$ often needed but doesn't exist in high dimensional setting

Mathematical Properties¶

  • Matrix $\Sigma$ is symmetric
  • All eigenvalues are nonnegative
  • Nonnegative definite
  • If singular, dimensionality can be reduced (assuming enough data is used)

Difficulty with Estimation of Covariance Matrices¶

  • High dimensionality: $O(p^2)$ parameters grows faster than available samples ($n \ll p$)
  • Sample deficiency: $\mathbf{S}$ is singular when $p>n$; poorly conditioned when $p$ close to $n$
  • Eigenvalue distortion: extreme spread, noisy small eigenvalues (Marchenko–Pastur effects)
  • Computational burden: forming/storing $\mathbf{S}$ is $O(p^2)$ memory, eigendecompositions $O(p^3)$
  • Stability: high sensitivity to outliers, heavy tails, regime shifts, and nonstationarity
  • Spurious correlations: many pairwise correlations appear significant by chance
  • Missing data and asynchronous sampling (finance, genomics) complicate estimation
  • Interpretability: dense estimates hinder structural insight (networks, factor structure)
  • Downstream impact: inversion instability propagates to regression, portfolio optimization, CCA

Regularized Estimation¶

  • Eigenvalue structure (Ledoit-Wolf, condition number)
  • Sparsity pattern (graphical models)
  • Structural assumptions (banding, tapering, low-rank)

All stabilize estimates (bias variance trade off)

Eigen-structure changes¶

  • Eigenvector regularization (sparse PCA, SVD, etc)
  • Ledoit-Wolf Estimator (Ledoit and Wolf, 2004) $$ \widehat{\mathbf{\Sigma}}=\alpha I+(1-\alpha) \mathbf{S} $$
  • Condition number regularization (Won et al., 2009)

    $$ \begin{array}{ll} \operatorname{maximize} & l(\Sigma) \\ \text { subject to } & \operatorname{cond}(\Sigma) \leq \kappa_{\max }, \end{array} $$
    where $l(\Sigma)$ is the Gaussian Likelihood

  • Ridge regression: $\min_\beta \|Y - X\beta\|^2 +\gamma\|\beta\|_2^2$ $$ \widehat{\beta}_{\text {Ridge}}=\left(X^{\prime} X + \gamma I\right)^{-1} X^{\prime} Y $$

Sparsity Assumption (Graphical Model)¶

  • $\mathbf{\Omega}=\mathbf{\Sigma}^{-1}$ appear in many situations
  • $\mathbf{\Omega}$ can be regularized directly
  • Assume $\boldsymbol{X}_{1}, \ldots, \boldsymbol{X}_{n} \sim N_{p}(\boldsymbol{0}, \boldsymbol{\Sigma})$ and $$ L(\boldsymbol{\Sigma})=\frac{1}{(2 \pi)^{n p / 2}|\boldsymbol{\Sigma}|^{n / 2}} \exp \left\{-\frac{1}{2} \sum_{i=1}^{n} \boldsymbol{X}_{i}^{\prime} \boldsymbol{\Sigma}^{-1} \boldsymbol{X}_{i}\right\} . $$
  • Compute $\boldsymbol{\Omega}$: $$ \ell_{P}(\boldsymbol{\Omega})=\log |\boldsymbol{\Omega}|-\operatorname{tr}(\mathbf{S} \boldsymbol{\Omega})-\lambda\|\boldsymbol{\Omega}\|_{1}, $$

Reference: High‐Dimensional Covariance Estimation by Mohsen Pourahmadi

Multivariate Gaussian¶

The Setup: Partitioning the Vectors and Matrices¶

Let $x$ be a $D$-dimensional Gaussian random vector with mean $\mu$ and covariance $\Sigma$. We partition $x$ into two disjoint subsets, $x_a$ and $x_b$ (where we want to find the distribution of $x_a$ given $x_b$).

$$ x = \begin{bmatrix} x_a \\ x_b \end{bmatrix}, \quad \mu = \begin{bmatrix} \mu_a \\ \mu_b \end{bmatrix} $$

The joint distribution is defined as $p(x) = \\mathcal{N}(x | \mu, \Sigma)$.


1. In Terms of the Covariance Matrix ($\Sigma$)¶

We partition the covariance matrix $\Sigma$ to correspond with $x_a$ and $x_b$:

$$ \Sigma = \begin{bmatrix} \Sigma_{aa} & \Sigma_{ab} \\ \Sigma_{ba} & \Sigma_{bb} \end{bmatrix} $$

The conditional distribution $p(x_a | x_b)$ is also a Gaussian distribution $\mathcal{N}(x_a | \mu_{a|b}, \Sigma_{a|b})$.

The Conditional Mean¶

The mean of $x_a$ is adjusted by the deviation of $x_b$ from its mean, scaled by the correlation between the two:

$$ \mu_{a|b} = \mu_a + \Sigma_{ab}\Sigma_{bb}^{-1}(x_b - \mu_b) $$

The Conditional Covariance¶

The uncertainty in $x_a$ is reduced by the knowledge of $x_b$. This is mathematically represented by the Schur Complement of $\Sigma_{bb}$ in $\Sigma$:

$$ \Sigma_{a|b} = \Sigma_{aa} - \Sigma_{ab}\Sigma_{bb}^{-1}\Sigma_{ba} $$

Note: Calculating the conditional distribution using the covariance matrix requires computing the inverse of $\Sigma_{bb}$. If $x_b$ is high-dimensional, this is computationally expensive.

2. In Terms of the Precision Matrix ($\Omega$)¶

The Precision matrix (also called the information matrix) is the inverse of the covariance matrix: $\Omega = \Sigma^{-1}$. We partition $\Omega$ similarly:

$$ \Omega = \begin{bmatrix} \Omega_{aa} & \Omega_{ab} \\ \Omega_{ba} & \Omega_{bb} \end{bmatrix} $$

Note that because the inverse of a block matrix involves complex interactions between blocks, $\Omega_{aa}$ is not simply $\Sigma_{aa}^{-1}$.

When expressing the conditional distribution $p(x_a | x_b)$ using the precision matrix, the formulas usually become simpler and computationally more efficient, specifically for the covariance.

The Conditional Mean¶

$$ \mu_{a|b} = \mu_a - \Omega_{aa}^{-1}\Omega_{ab}(x_b - \mu_b) $$

The Conditional Covariance¶

Here is the elegant property of the precision matrix: The conditional precision is simply the corresponding block of the joint precision matrix. Therefore, the conditional covariance is just the inverse of that block:

$$ \Sigma_{a|b} = \Omega_{aa}^{-1} $$

Summary Comparison¶

Parameter Using Covariance Matrix ($\Sigma$) Using Precision Matrix ($\Omega$)
Conditional Mean ($\mu_{a \mid b}$) $\mu_a + \Sigma_{ab}\Sigma_{bb}^{-1}(x_b - \mu_b)$ $\mu_a - \Omega_{aa}^{-1}\Omega_{ab}(x_b - \mu_b)$
Conditional Covariance ($\Sigma_{a \mid b}$) $\Sigma_{aa} - \Sigma_{ab}\Sigma_{bb}^{-1}\Sigma_{ba}$ $\Omega_{aa}^{-1}$

Key Insight:

  • Marginalization is easy with the Covariance matrix (you just take the sub-block $\Sigma_{aa}$).
  • Conditioning is easy with the Precision matrix (you just take the sub-block $\Omega_{aa}$ and invert it).
  • Zeros in the Covariance matrix imply marginal independence.
  • Zeros in the Precision matrix imply conditional independence (essential for Gaussian Markov Random Fields).

Numerical Experiments¶

In [1]:
import pandas as pd
import numpy as np

store = pd.HDFStore('data/yfinance-data-store.h5')
logret = store['logret'].dropna().head(100)

Sample Covariance Matrix¶

$$ \mathbf{S}=\frac{1}{n} \sum_{i=1}^{n}\left(\boldsymbol{X}_{i}-\overline{\boldsymbol{X}}\right)\left(\boldsymbol{X}_{i}-\overline{\boldsymbol{X}}\right)^{\prime}, $$

  • MLE of covariance matrix for Gaussian data
  • Singular in high dimensional setting: Zero eigenvalues
  • When $p>n$, other covariance estimates are needed
In [2]:
import sklearn.covariance as skcov
import sklearn.model_selection as sksel
import numpy.linalg as npl

cov_sample = skcov.empirical_covariance(logret)

Ledoit-Wolf Estimator¶

$$ \Sigma_{\text {LW}}=(1-\alpha) \mathbf{S}+\alpha \frac{\operatorname{Tr} \mathbf{S}}{p} \text { I } $$

  • Ledoit and Wolf, 2004 (documentation page)
  • Weighted average between sample covariance and identity (times average eigenvalue)
In [3]:
cov_lw, shrinkage_lw = skcov.ledoit_wolf(logret)
In [4]:
shrinkage_lw
Out[4]:
0.19241305589502752
  • Large eigenvalues decrease
  • Zero eigenvalues increase (not sigular)

Graphical Model¶

In [5]:
# !pip install bokeh==2.4.1
# !pip install networkx
# !pip install robust-selection
In [6]:
edge_model = skcov.GraphicalLassoCV(n_refinements=2)
edge_model.fit(logret.head(30))
omega = edge_model.precision_.copy()
cov_ggm = npl.inv(edge_model.precision_)
/opt/conda/lib/python3.11/site-packages/numpy/core/_methods.py:173: RuntimeWarning: invalid value encountered in subtract
  x = asanyarray(arr - arrmean)
In [7]:
from bokeh.io import show, output_file, output_notebook
from bokeh.models import (BoxSelectTool, Circle, EdgesAndLinkedNodes, 
                          HoverTool, MultiLine, NodesAndLinkedEdges, 
                          Plot, Range1d, TapTool, LabelSet, 
                          ColumnDataSource)
from bokeh.plotting import from_networkx
from bokeh.palettes import Spectral4
output_notebook()

import networkx as nx
G = nx.from_pandas_adjacency(pd.DataFrame(omega, columns=logret.columns.to_list(), index=logret.columns.to_list()))
Loading BokehJS ...
In [8]:
plot = Plot(plot_width=600, plot_height=600,
            x_range=Range1d(-1.1,1.1), y_range=Range1d(-1.1,1.1))
plot.title.text = "Dow Jones Component Stock Graphical Models"

graph_renderer = from_networkx(G, nx.circular_layout, scale=1, center=(0,0))

graph_renderer.node_renderer.glyph = Circle(size=15, fill_color=Spectral4[0])
graph_renderer.node_renderer.selection_glyph = Circle(size=15, fill_color=Spectral4[2])

graph_renderer.edge_renderer.glyph = MultiLine(line_color="#CCCCCC", line_alpha=0.8, line_width=5)
graph_renderer.edge_renderer.selection_glyph = MultiLine(line_color=Spectral4[2], line_width=5)

graph_renderer.selection_policy = NodesAndLinkedEdges()
graph_renderer.inspection_policy = EdgesAndLinkedNodes()

plot.add_tools(HoverTool(tooltips=None), TapTool(), BoxSelectTool())
plot.renderers.append(graph_renderer)

# add labels
pos = nx.circular_layout(G)
x, y=zip(*pos.values())

source = ColumnDataSource({'x':x,'y':y,'ticker':logret.columns.to_list()})
labels = LabelSet(x='x', y='y', text='ticker', source=source)

plot.renderers.append(labels)

show(plot)

Compare Eigenvalues¶

  • Different estimates perform different regularization
  • There are similairies as well as differences
In [9]:
eigval_sample = npl.eigvalsh(cov_sample)
eigval_lw = npl.eigvalsh(cov_lw)
eigval_ggm = npl.eigvalsh(cov_ggm)

allevals = pd.DataFrame({
    "sample": np.real(eigval_sample),
    "lw": eigval_lw,
    "ggm": eigval_ggm
})
ax = allevals.plot(logy=True)
ax.set_xlabel('index')
ax.set_ylabel('eigenvalue');
No description has been provided for this image

Regularization makes small eigenvalues larger (away from zero) and large eigenvalues smaller.