Bernstein-von Mises Theorem

These notes aim to fill in the gaps (potentially trivial to others) in the proof of the Bernstein-von Mises Theorem found in van der Vaart’s Asymptotic Statistics (van der Vaart 1998) and in Szabó and van der Vaart’s Bayesian statistics lecture notes (Szabó and van der Vaart 2023).

Definition

Definition 1 (Differentiability in quadratic mean (DQM)) A family $\{P_\theta\}_{\theta\in \Theta}$ is differentiable in quadratic mean (DQM) if for every $\theta$ in the interior of $\Theta \subset \RR^d$, there exists a vector-valued measurable function $\dot{\ell}_\theta: \mathfrak{X}\mapsto \RR^d$ such that, as $h\to0$, \[ \int \Big[ p_{\theta+h}^{1/2} - p_\theta^{1/2} - \frac{1}{2} h^\top \dot{\ell}_\theta p_\theta^{1/2}\Big]^2 \, d\mu = \mathcal{o}(\|h\|^2). \tag{1}\]

The Bernstein-von Mises Theorem is as follows:

Theorem

Theorem 1 (Bernstein-von Mises (Thoerem 2.16 in Szabó and van der Vaart)) Suppose that for some compact neighborhood $\Theta_0 \subset \Theta$ of $\theta_0$, there exists a sequence of tests $\phi_n$ such that

\[ P^n_{\theta_0} \phi_n \to 0, \qquad \qquad \sup_{\theta \notin \Theta_0} P_\theta^n (1- \phi_n) \to 0. \tag{2}\]

Furthermore, assume that (1) holds at every $\theta$ in the interior of $\Theta$ with nonsingular Fisher information, and that the map $\theta\mapsto P_\theta$ is one-to-one. If $\theta_0$ is an inner point of $\Theta$ and the prior measure is absolutely continuous with a bounded density that is continuous and positive in a neighborhood of $\theta_0$, then the corresponding posterior distributions satisfy

\[ \Bigg\| \Pi(\sqrt{n}(\theta-\theta_0) \in \cdot \mid X_1,\dots,X_n) - N_d(I_{\theta_0}^{-1} \Delta_{n,\theta_0}, I^{-1}_{\theta_0}) \Bigg\|_{TV} \stackrel{P^n_{\theta_0}}{\rightarrow}0, \]

where $\Delta_{n,\theta_0} = \frac{1}{\sqrt{n}} \sum_{i=1}^n \dot{\ell}_\theta(X_i)$ converges under $\theta_0$ in distribution to a $N_d(0, I_{\theta_0})$-distribution.

We first look at the proof of the following intermediate result.

Lemma

Lemma 1 (Tests with an exponential rate (lemma 10.3 in van der Vaart / lemma 2.35 in Szabó and van der Vaart)) Under the conditions of Theorem 1, there exists for every $M_n\to \infty$ a sequence of tests $\phi_n$ and a constant $c > 0$ such that, for every sufficiently large $n$ and every $\|\theta -\theta_0\| \geq M_n / \sqrt{n}$,

\[ P^n_{\theta_0} \phi_n \to 0, \qquad \qquad .P^n_{\theta} (1- \phi_n) \leq e^{-cn (\|\theta-\theta_0\|^2 \wedge 1)}. \]

The goal is to define a sequence of test $\omega_n$ such that $P^n_{\theta_0} \omega_n \to 0$ and $P^n_{\theta} (1-\omega_n) \to 0$ exponentially fast.

The proof proceeds by dividing the parameter space $\Theta$ into the two disjoint sets, and define a test sequence for each.

The range of $M_n / \sqrt{n} \leq \Vert \theta-\theta_0\Vert \leq \epsilon$

They define $\omega_n = \mathbb{1}\{\Vert (\mathbb{P}_n-P_{\theta_0})\dot{\ell}^L_{\theta_0}\Vert \geq \sqrt{M_n/n}\}$, and claim that $P^n_{\theta_0} \omega_n \to 0$ (covergence in probability) via CLT. To see this, note that CLT implies

\[ \sqrt{n} (\mathbb{P}_n-P_{\theta_0})\dot{\ell}^L_{\theta_0} \rightsquigarrow \mathcal{N}(0, \text{Var}(\dot{\ell}^L_{\theta_0})). \]

This holds because $\dot{\ell}^L_{\theta_0} = \mathbb{1}\{\Vert \dot{\ell}_{\theta_0}\Vert \leq L\} \dot{\ell}_{\theta_0}$ is bounded and thus has finite second moment. This means that

\[ (\mathbb{P}_n-P_{\theta_0})\dot{\ell}^L_{\theta_0} = \mathcal{O}_{P_{\theta_0}}\Big(\frac{1}{\sqrt{n}}\Big). \]

Since $1=\mathcal{o}_{P_{\theta_0}}(\sqrt{M_n})$, by $M_n \to \infty$, we also have

\[ \begin{align*} (\mathbb{P}_n-P_{\theta_0})\dot{\ell}^L_{\theta_0} = \mathcal{O}_{P_{\theta_0}}\Big(\frac{1}{\sqrt{n}}\Big)\mathcal{o}_{P_{\theta_0}}(\sqrt{M_n}) = \mathcal{o}_{P_{\theta_0}}\left(\sqrt{\frac{M_n}{n}}\right) . \end{align*} \]

That is, $P^n_{\theta_0}\omega_n$ converges in probability to 0:

\[ \mathbb{P}^n_{\theta_0}\left[\Vert (\mathbb{P}_n-P_{\theta_0})\dot{\ell}^L_{\theta_0}\Vert \geq \sqrt{\frac{M_n}{n}}\right] = \mathbb{P}^n_{\theta_0} \omega_n \to 0. \]

we then need to show the type-II error converges to 0 exponentially fast. A left-out detail regards the claim

\[ P_\theta \dot{\ell}^L_{\theta_0} - P_{\theta_0} \dot{\ell}^L_{\theta_0} = (P_{\theta} \dot{\ell}^L_{\theta_0} \dot{\ell}^\top_{\theta_0} +o(1)) (\theta - \theta_0), \]

which holds due to the DQM assumption Definition 1. To see this, recall that DGM means there exists $\dot{\ell}_{\theta_0}$ such that

\[ \int \left[ \sqrt{p_{\theta}} - \sqrt{p_{\theta_0}} - \frac{1}{2}\dot{\ell}_{\theta_0}^\top (\theta-\theta_0) \sqrt{p_{\theta_0}} \right]^2 d\mu = o(\Vert \theta-\theta_0\Vert ^2). \]

That implies the following expansion:

\[ \begin{aligned} p_{\theta} &= (\sqrt{p_{\theta}})^2 = (\sqrt{p_{\theta_0}} + \frac{1}{2}\dot{\ell}_{\theta_0}^\top (\theta-\theta_0) \sqrt{p_{\theta_0}} + r_{\theta} )^2 \\ &= p_{\theta_0} + p_{\theta_0} \dot{\ell}_{\theta_0}^\top (\theta-\theta_0) + \text{} \\ & \quad \quad r^2_{\theta} + \frac{1}{4} p_{\theta_0} (\dot{\ell}_{\theta_0}^\top (\theta-\theta_0))^2 + 2\sqrt{p_{\theta_0}} r_{\theta} + \dot{\ell}_{\theta_0}^\top (\theta-\theta_0) \sqrt{p_{\theta_0}} r_\theta, \end{aligned} \]

where $\Vert r_\theta(x)\Vert _{L_2} =\int r^2_{\theta}(x)d\mu(x)= o(\Vert \theta-\theta_0\Vert )$. If we apply the integral operator $\int (\cdot)\dot{\ell}^L_{\theta_0}d\mu$ to both sides, we will get an expression similar to the desired, with some extra terms. We analyze the exta terms one by one:

\[ \left\Vert \int r^2_\theta \dot{\ell}^L_{\theta_0}d\mu\right\Vert \leq \int \left\Vert r^2_\theta\dot{\ell}^L_{\theta_0}\right\Vert d\mu \leq \int r^2_\theta \left\Vert \dot{\ell}^L_{\theta_0}\right\Vert d\mu \leq \int L r^2_{\theta} d\mu = o(\Vert \theta-\theta_0\Vert ^2) = o(\Vert \theta-\theta_0\Vert ); \]

\[ \begin{aligned} \left\Vert \int p_{\theta_0} (\dot{\ell}_{\theta_0}^\top (\theta-\theta_0))^2 \dot{\ell}_{\theta_0}^L d\mu \right\Vert &\leq \int p_{\theta_0} (\dot{\ell}_{\theta_0}^\top (\theta-\theta_0))^2\left\Vert \dot{\ell}_{\theta_0}^L\right\Vert d\mu \leq \int p_{\theta_0} \Vert \theta-\theta_0\Vert ^2 \left\Vert \dot{\ell}_{\theta_0} \right\Vert ^2 \left\Vert \dot{\ell}_{\theta_0}^L \right\Vert d\mu \\ & = \Vert \theta-\theta_0\Vert ^2 \cdot \mathbb{E}_{P_{\theta_0}} \left[\left\Vert \dot{\ell}_{\theta_0} \right\Vert ^3 \mathbb{1}\left\{\left\Vert \dot{\ell}_{\theta_0} \right\Vert \leq L\right\}\right] \\ &=\mathcal{O}(\Vert \theta-\theta_0\Vert ^2) = o(\Vert \theta-\theta_0\Vert ), \end{aligned} \]

because the random variable $_{_0} $ is bounded and thus has finite third moment;

\[ \left|\int \sqrt{p_{\theta_0}} r_{\theta} d\mu\right| \leq \int\sqrt{p_{\theta_0}} \left| r_{\theta}\right| d\mu \leq \int p_{\theta_0} d\mu \cdot \Vert r_{\theta}\Vert _{L_2} = o(\Vert \theta-\theta_0\Vert ^2) = o(\Vert \theta-\theta_0\Vert ); \]

\[ \begin{aligned} \left\Vert \int \dot{\ell}_{\theta_0}^L \dot{\ell}_{\theta_0}^\top (\theta-\theta_0) \sqrt{p_{\theta_0}} r_{\theta}d\mu \right\Vert &\leq \int \left\Vert \dot{\ell}_{\theta_0}^L\right\Vert \left| \dot{\ell}_{\theta_0}^\top (\theta-\theta_0)\right| \sqrt{p_{\theta_0}} |r_{\theta}| d\mu \leq \int \left\Vert \dot{\ell}_{\theta_0}^L\right\Vert ^2 \left\Vert \theta-\theta_0\right\Vert \sqrt{p_{\theta_0}} |r_{\theta}| d\mu \\ &\leq \Vert \theta-\theta_0 \Vert \cdot \mathbb{E}_{P_{\theta_0}}\left[ \left\Vert \dot{\ell}_{\theta_0}^L\right\Vert ^4 \right]^\frac{1}{2} \int r_{\theta}^2 d\mu \\ & = o(\Vert \theta-\theta_0\Vert )\cdot \mathcal{O}(\Vert \theta-\theta_0\Vert ) \\ &= o(\Vert \theta-\theta_0\Vert ^2) \\ &= o(\Vert \theta-\theta_0\Vert ). \end{aligned} \]

Thus, we have

\[ \begin{aligned} P_{\theta} \dot{\ell}_{\theta_0}^L &= P_{\theta_0}\dot{\ell}_{\theta_0}^L + P_{\theta_0}\dot{\ell}_{\theta_0}^L \dot{\ell}_{\theta_0}^\top (\theta-\theta_0) + o(\Vert \theta-\theta_0\Vert ) \\ &=P_{\theta_0}\dot{\ell}_{\theta_0}^L + (P_{\theta_0}\dot{\ell}_{\theta_0}^L \dot{\ell}_{\theta_0}^\top + o(1))(\theta-\theta_0). \end{aligned} \]

Note that the $o(1)$ term is some square matrix, say $A_h$, with $o(1)$ entries that depends on $h:=\theta-\theta_0$. Then we have

\[ \sigma_{\rm{min}}(P_{\theta_0}\dot{\ell}_{\theta_0}^L \dot{\ell}_{\theta_0}^\top+A_h) \geq \sigma_{\rm{min}}(P_{\theta_0}\dot{\ell}_{\theta_0}^L \dot{\ell}_{\theta_0}^\top) - \sigma_{\rm{max}}(A_h). \]

$P_{\theta_0}\dot{\ell}_{\theta_0}^L \dot{\ell}_{\theta_0}^\top$ is nonsingular, so $\sigma_{\rm{min}}(P_{\theta_0}\dot{\ell}_{\theta_0}^L \dot{\ell}_{\theta_0}^\top)>c_0$ for some postive $c_0$. Also, for for every $\theta$ that is sufficiently close to $\theta_0$, say, $\Vert h\Vert \leq \epsilon$, we have $\sigma_{\rm{max}}(A_h)=\Vert A_h-0\Vert _2 \leq c_1 < c_0$ for some positive $c_1$. Pick $c \leq c_0 - c_1$, we have

\[ \left\Vert P_{\theta} \dot{\ell}_{\theta_0}^L - P_{\theta_0}\dot{\ell}_{\theta_0}^L\right\Vert = \left\Vert (P_{\theta_0}\dot{\ell}_{\theta_0}^L \dot{\ell}_{\theta_0}^\top + A_h)(\theta-\theta_0)\right\Vert \geq \sigma_{\rm{min}}(P_{\theta_0}\dot{\ell}_{\theta_0}^L \dot{\ell}_{\theta_0}^\top+A_h) \geq c\Vert \theta-\theta_0\Vert . \]

Moving on, if $\omega_n = 0$, we have $-\Vert (\mathbb{P}_n-P_{\theta_0})\dot{\ell}^L_{\theta_0}\Vert > -\sqrt{M_n/n}$. So

\[ \left\Vert (\mathbb{P}_n - P_\theta) \dot{\ell}_{\theta_0}^L \right\Vert \geq \left\Vert (P_{\theta_0} - P_\theta) \dot{\ell}_{\theta_0}^L \right\Vert - \left\Vert (\mathbb{P}_n - P_{\theta_0}) \dot{\ell}_{\theta_0}^L \right\Vert > c\Vert \theta-\theta_0\Vert -\sqrt{M_n/n}. \]

For $c\Vert \theta-\theta_0\Vert \geq 2 \sqrt{M_n/n}$, i.e., $-c\Vert \theta-\theta_0\Vert / 2\leq -\sqrt{M_n/n}$, we have

\[ \left\Vert (\mathbb{P}_n - P_\theta) \dot{\ell}_{\theta_0}^L \right\Vert \geq \frac{c}{2}\Vert \theta-\theta_0\Vert . \]

Since $\Vert \cdot \Vert _2 \leq \sqrt{d} \Vert \cdot \Vert _\infty$, with a union bound, we have

\[ \begin{aligned} P^n_{\theta} (1-\omega_n) &\leq P_\theta \left[ \left\Vert (\mathbb{P}_n - P_\theta) \dot{\ell}_{\theta_0}^L \right\Vert \geq \frac{c}{2}\Vert \theta-\theta_0\Vert \right] \\ &\leq P_\theta \left[ \left\Vert (\mathbb{P}_n - P_\theta) \dot{\ell}_{\theta_0}^L \right\Vert _\infty \geq \frac{c}{2\sqrt{d}}\Vert \theta-\theta_0\Vert \right] \\ &\leq \sum_{i=1}^d P_\theta \left[ \left| (\mathbb{P}_n - P_\theta) \dot{\ell}_{\theta_0;i}^L \right| \geq \frac{c}{2\sqrt{d}}\Vert \theta-\theta_0\Vert \right] \\ &\leq 2d e^{-n\frac{c^2}{8dL^2} \Vert \theta-\theta_0\Vert ^2} \\ &= \exp\left\{-n\frac{c^2}{8L^2 d} \Vert \theta-\theta_0\Vert ^2+\log 2d\right\}. \end{aligned} \]

We want $C_1$ such that

\[ \exp\left\{-n\frac{c^2}{8L^2 d} \Vert \theta-\theta_0\Vert ^2+\log 2d\right\} \leq \exp\left\{-nC_1 \Vert \theta-\theta_0\Vert ^2\right\} . \]

Take log on both sides,

\[ -n\frac{c^2}{8L^2 d} \Vert \theta-\theta_0\Vert ^2+\log 2d \leq -nC_1 \Vert \theta-\theta_0\Vert ^2, \]

so we must have

\[ C_1 \leq \frac{c^2}{8L^2 d} - \frac{\log 2 d }{n \Vert \theta-\theta_0\Vert ^2}. \]

For sufficiently large $n$, since $c\Vert \theta-\theta_0\Vert \geq 2 \sqrt{M_n/n}$, i.e., $n\Vert \theta-\theta_0\Vert ^2\geq c^2 M_n/4 \to \infty$, the second term will be negligible. One can pick $C_1 \leq \frac{c^2}{16L^2 d}$ so that

\[ P^n_{\theta} (1-\omega_n) \leq e^{-nC_1 \Vert \theta-\theta_0\Vert ^2} \leq e^{-c^2M_n^2/4} \]

holds for sufficienly large $n$.

The range of $\Vert \theta-\theta_0\Vert > \epsilon$

This follows from lemma 2.33 in Szabó and van der Vaart by taking $\Theta_1=\{\Vert \theta-\theta_0\Vert > \epsilon\}$.

Lemma

Lemma 2 (Lemma 2.33 in Szabó and van der Vaart) If there exits tests $\phi_n$ such that $\sup_{\theta\in \Theta_0} P^n_\theta \phi_n \to 0$ and $\sup_{\theta\in \Theta_1} P^n_\theta (1-\phi_n) \to 0$, for given fixed sets $\Theta_0$ and $\Theta_1$ and a given statistical model, then there exist tests $\psi_n$ and $c>0$ such that $\sup_{\theta\in \Theta_0} P^n_\theta \psi_n \leq e^{-cn}$ and $\sup_{\theta\in \Theta_1} P^n_\theta (1-\psi_n) \leq e^{-cn}$.

Proof of Lemma 2.32 in Szabó and van der Vaart

We next look at the proof of another intermiediate result.

Lemma

Lemma 3 (Convergence of Laplace transforms (lemma 2.32 in Szabó and van der Vaart)) If $\Delta_n$ are random vectors with $\Delta_n \rightsquigarrow N_d(0, J)$, then there exists $M_n \to \infty$ such that $\Delta_n \mathbb{1}_{\|\Delta_n \|\leq M_n} − \Delta_n → 0$ and, for every $C > 0$,

\[ \sup_{\Vert h\Vert <C} \Bigg|\EE e^{h^\top \Delta_n \indicator{\Vert \Delta_n\Vert \leq M_n}-h^\top J h^\top /2}-1\Bigg|\to 0. \label{eq:laplace-tran} \tag{3}\]

For $\Delta \sim {N}(0, J)$, by Chebychev’s inequality,

\[ \mathbb{P}\left[ \Vert \Delta\Vert > M \right] \leq \frac{\text{Var}(\Vert \Delta\Vert )}{M^2} \stackrel{M \to \infty}{\longrightarrow} 0. \]

Since $\Delta_n \rightsquigarrow \Delta$, by Continuous Mapping Theorem, $\Vert \Delta_n\Vert \rightsquigarrow \Vert \Delta\Vert$, so

\[ \mathbb{P}\left[ \Vert \Delta_n\Vert > M \right] \to \mathbb{P}\left[ \Vert \Delta\Vert > M \right]. \]

It follows that $\Delta'_n:=\Delta_n \mathbb{1}_{\Vert \Delta_n\Vert \leq M}$ converges in distribution to $\Delta 1_{\Vert \Delta\Vert \leq M}$. This is because the function $g(x)=x \mathbb{1}_{\Vert x\Vert \leq M}$ is continuous almost everywhere (the set $\{\omega: \Vert \Delta(\omega)\Vert = M\}$ has measure $0$), and Continous Mapping Thoerem applies.

Remark

Remark 1. An important part of the proof of BvM (details in the next section) is that the truncated version of $\Delta_n$, $\Delta_n':=\Delta_n \indicator{\Vert \Delta_n\Vert \leq M_n}$, preserves the distributional limit of $\Delta_n$.

Proof

We can write $\Delta_n' = \Delta_n + (\Delta_n'-\Delta_n)$. The first part $\leadsto \Delta$ by assumption. The second part $\stackrel{\mathbb{P}}{\to}0$ by Lemma 2.32. We then argue by Slutsky’s Thoerem.

To show Equation 3 we first show it for fixed $M$. To find a sequence $M_n$, we rely on the following result.

Lemma

Lemma 4 (Diagonalization argument) Suppose $\lim_{n\to \infty}f(M,n)\to 0$ for every $M>0$. Then there exists $M_n$ such that $f(M_n, n) \to 0$.

Proof

For every fixed $M$, We know that $\lim_{n\to\infty} f(M, n) \to 0.$

Fix $M=k\in \mathbb{Z}^+$, there exists $N_k$ (chosen such that $\geq N_{k-1}$) such that $\vert f(M,n)\vert \leq \frac{1}{k}$ for all $n\geq N_k$.

So now we have two sequences $M_k$’s and $N_k$’s. To define $M_n\to \infty$: - For $N_k \leq n < N_{k+1}$, pick $M_n:=k$. - For $n < N_1$, pick $M_1:=1$.

To show that this sequence work, fix $\delta>0$. Pick an integer $K$ such that $1/K \leq \delta$. We have shown there exists $N_K$ such that $\vert f(M_n,n)\vert=\vert f(K,n)\vert \leq \frac{1}{K}$ for all $n \in [N_K, N_{K+1})$; for $n \in [N_{K+q}, N_{K+q+1})$ for $q\in\{1, 2, \dots,\}$, we have $\vert f(M_n,n)\vert =\vert f(K+q,n)\vert \leq \frac{1}{K+q} < \frac{1}{K}$. So we have $\vert f(M_n,n)\vert \leq \frac{1}{K}\leq \delta$. Since $\delta$ is arbitrary, we have $\lim_{n\to\infty} f(M_n, n) = 0.$

To be continued.

References

Szabó, Botond, and Aad van der Vaart. 2023. Bayesian Statistics.

van der Vaart, Aad. 1998. Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press.