Bernstein-von Mises Theorem
These notes aim to fill in the gaps (potentially trivial to others) in the proof of the Bernstein-von Mises Theorem found in van der Vaart’s Asymptotic Statistics (van der Vaart, 1998) and in Szabó and van der Vaart’s Bayesian statistics lecture notes (Szabó & van der Vaart, 2023).
A family \(\{P_\theta\}_{\theta\in \Theta}\) is differentiable in quadratic mean (DQM) if for every \(\theta\) in the interior of \(\Theta \subset \RR^d\), there exists a vector-valued measurable function \(\dot{\ell}_\theta: \mathfrak{X}\mapsto \RR^d\) such that, as \(h\to0\),
\[\begin{equation} \int \Big[ p_{\theta+h}^{1/2} - p_\theta^{1/2} - \frac{1}{2} h^\top \dot{\ell}_\theta p_\theta^{1/2}\Big]^2 \, d\mu = \mathcal{o}(\|h\|^2). \label{ass:dqm} \end{equation}\]The Bernstein-von Mises Theorem is as follows:
Suppose that for some compact neighborhood \(\Theta_0 \subset \Theta\) of \(\theta_0\), there exists a sequence of tests $\phi_n$ such that
\[\begin{equation} P^n_{\theta_0} \phi_n \to 0, \qquad \qquad \sup_{\theta \notin \Theta_0} P_\theta^n (1- \phi_n) \to 0. \label{ass:test-cond} \end{equation}\]Furthermore, assume that $\eqref{ass:dqm}$ holds at every \(\theta\) in the interior of \(\Theta\) with nonsingular Fisher information, and that the map \(\theta\mapsto P_\theta\) is one-to-one. If \(\theta_0\) is an inner point of \(\Theta\) and the prior measure is absolutely continuous with a bounded density that is continuous and positive in a neighborhood of \(\theta_0\), then the corresponding posterior distributions satisfy
\[\Bigg\| \Pi(\sqrt{n}(\theta-\theta_0) \in \cdot \mid X_1,\dots,X_n) - N_d(I_{\theta_0}^{-1} \Delta_{n,\theta_0}, I^{-1}_{\theta_0}) \Bigg\|_{TV} \stackrel{P^n_{\theta_0}}{\rightarrow}0,\]where \(\Delta_{n,\theta_0} = \frac{1}{\sqrt{n}} \sum_{i=1}^n \dot{\ell}_\theta(X_i)\) converges under \(\theta_0\) in distribution to a \(N_d(0, I_{\theta_0})\)-distribution.
Lemma 10.3 in van der Vaart / lemma 2.35 in Szabó and van der Vaart
We first look at the proof of the following intermediate result.
Under the conditions of , there exists for every \(M_n\to \infty\) a sequence of tests \(\phi_n\) and a constant $c > 0$ such that, for every sufficiently large $n$ and every \(\|\theta -\theta_0\| \geq M_n / \sqrt{n}\),
\[\begin{equation*} P^n_{\theta_0} \phi_n \to 0, \qquad \qquad .P^n_{\theta} (1- \phi_n) \leq e^{-cn (\|\theta-\theta_0\|^2 \wedge 1)}. \end{equation*}\]The goal is to define a sequence of test \(\omega_n\) such that \(P^n_{\theta_0} \omega_n \to 0\) and \(P^n_{\theta} (1-\omega_n) \to 0\) exponentially fast.
The proof proceeds by dividing the parameter space \(\Theta\) into the two disjoint sets, and define a test sequence for each.
The range of \(M_n / \sqrt{n} \leq \Vert \theta-\theta_0\Vert \leq \epsilon\)
They define \(\omega_n = \mathbb{1}\{\Vert (\mathbb{P}_n-P_{\theta_0})\dot{\ell}^L_{\theta_0}\Vert \geq \sqrt{M_n/n}\}\), and claim that \(P^n_{\theta_0} \omega_n \to 0\) (covergence in probability) via CLT. To see this, note that CLT implies
\[\sqrt{n} (\mathbb{P}_n-P_{\theta_0})\dot{\ell}^L_{\theta_0} \rightsquigarrow \mathcal{N}(0, \text{Var}(\dot{\ell}^L_{\theta_0})).\]This holds because \(\dot{\ell}^L_{\theta_0} = \mathbb{1}\{\Vert \dot{\ell}_{\theta_0}\Vert \leq L\} \dot{\ell}_{\theta_0}\) is bounded and thus has finite second moment. This means that
\[(\mathbb{P}_n-P_{\theta_0})\dot{\ell}^L_{\theta_0} = \mathcal{O}_{P_{\theta_0}}\Big(\frac{1}{\sqrt{n}}\Big).\]Since \(1=\mathcal{o}_{P_{\theta_0}}(\sqrt{M_n})\). By \(M_n \to \infty\), we also have
\[\begin{align*} (\mathbb{P}_n-P_{\theta_0})\dot{\ell}^L_{\theta_0} = \mathcal{O}_{P_{\theta_0}}\Big(\frac{1}{\sqrt{n}}\Big)\mathcal{o}_{P_{\theta_0}}(\sqrt{M_n}) = \mathcal{o}_{P_{\theta_0}}\left(\sqrt{\frac{M_n}{n}}\right) . \end{align*}\]That is, \(P^n_{\theta_0}\omega_n\) converges in probability to 0:
\[\mathbb{P}^n_{\theta_0}\left[\Vert (\mathbb{P}_n-P_{\theta_0})\dot{\ell}^L_{\theta_0}\Vert \geq \sqrt{\frac{M_n}{n}}\right] = \mathbb{P}^n_{\theta_0} \omega_n \to 0.\]we then need to show the type-II error converges to 0 exponentially fast. A left-out detail regards the claim
\[P_\theta \dot{\ell}^L_{\theta_0} - P_{\theta_0} \dot{\ell}^L_{\theta_0} = (P_{\theta} \dot{\ell}^L_{\theta_0} \dot{\ell}^\top_{\theta_0} +o(1)) (\theta - \theta_0),\]which holds due to the DQM assumption (). To see this, recall that DGM means there exists \(\dot{\ell}_{\theta_0}\) such that
\[\int \left[ \sqrt{p_{\theta}} - \sqrt{p_{\theta_0}} - \frac{1}{2}\dot{\ell}_{\theta_0}^\top (\theta-\theta_0) \sqrt{p_{\theta_0}} \right]^2 d\mu = o(\Vert \theta-\theta_0\Vert ^2).\]That implies the following expansion:
\[\begin{aligned} p_{\theta} &= (\sqrt{p_{\theta}})^2 = (\sqrt{p_{\theta_0}} + \frac{1}{2}\dot{\ell}_{\theta_0}^\top (\theta-\theta_0) \sqrt{p_{\theta_0}} + r_{\theta} )^2 \\ &= p_{\theta_0} + p_{\theta_0} \dot{\ell}_{\theta_0}^\top (\theta-\theta_0) + \text{} \\ & \quad \quad r^2_{\theta} + \frac{1}{4} p_{\theta_0} (\dot{\ell}_{\theta_0}^\top (\theta-\theta_0))^2 + 2\sqrt{p_{\theta_0}} r_{\theta} + \dot{\ell}_{\theta_0}^\top (\theta-\theta_0) \sqrt{p_{\theta_0}} r_\theta, \end{aligned}\]where \(\Vert r_\theta(x)\Vert _{L_2} =\int r^2_{\theta}(x)d\mu(x)= o(\Vert \theta-\theta_0\Vert )\). If we apply the integral operator \(\int (\cdot)\dot{\ell}^L_{\theta_0}d\mu\) to both sides, we will get an expression similar to the desired, with some extra terms. We analyze the exta terms one by one:
\[\left\Vert \int r^2_\theta \dot{\ell}^L_{\theta_0}d\mu\right\Vert \leq \int \left\Vert r^2_\theta\dot{\ell}^L_{\theta_0}\right\Vert d\mu \leq \int r^2_\theta \left\Vert \dot{\ell}^L_{\theta_0}\right\Vert d\mu \leq \int L r^2_{\theta} d\mu = o(\Vert \theta-\theta_0\Vert ^2) = o(\Vert \theta-\theta_0\Vert );\] \[\begin{aligned} \left\Vert \int p_{\theta_0} (\dot{\ell}_{\theta_0}^\top (\theta-\theta_0))^2 \dot{\ell}_{\theta_0}^L d\mu \right\Vert &\leq \int p_{\theta_0} (\dot{\ell}_{\theta_0}^\top (\theta-\theta_0))^2\left\Vert \dot{\ell}_{\theta_0}^L\right\Vert d\mu \leq \int p_{\theta_0} \Vert \theta-\theta_0\Vert ^2 \left\Vert \dot{\ell}_{\theta_0} \right\Vert ^2 \left\Vert \dot{\ell}_{\theta_0}^L \right\Vert d\mu \\ & = \Vert \theta-\theta_0\Vert ^2 \cdot \mathbb{E}_{P_{\theta_0}} \left[\left\Vert \dot{\ell}_{\theta_0} \right\Vert ^3 \mathbb{1}\left\{\left\Vert \dot{\ell}_{\theta_0} \right\Vert \leq L\right\}\right] \\ &=\mathcal{O}(\Vert \theta-\theta_0\Vert ^2) = o(\Vert \theta-\theta_0\Vert ), \end{aligned}\]because the random variable \(\left\Vert \dot{\ell}_{\theta_0} \right\Vert\) is bounded and thus has finite third moment;
\[\left|\int \sqrt{p_{\theta_0}} r_{\theta} d\mu\right| \leq \int\sqrt{p_{\theta_0}} \left| r_{\theta}\right| d\mu \leq \int p_{\theta_0} d\mu \cdot \Vert r_{\theta}\Vert _{L_2} = o(\Vert \theta-\theta_0\Vert ^2) = o(\Vert \theta-\theta_0\Vert );\] \[\begin{aligned} \left\Vert \int \dot{\ell}_{\theta_0}^L \dot{\ell}_{\theta_0}^\top (\theta-\theta_0) \sqrt{p_{\theta_0}} r_{\theta}d\mu \right\Vert &\leq \int \left\Vert \dot{\ell}_{\theta_0}^L\right\Vert \left| \dot{\ell}_{\theta_0}^\top (\theta-\theta_0)\right| \sqrt{p_{\theta_0}} |r_{\theta}| d\mu \leq \int \left\Vert \dot{\ell}_{\theta_0}^L\right\Vert ^2 \left\Vert \theta-\theta_0\right\Vert \sqrt{p_{\theta_0}} |r_{\theta}| d\mu \\ &\leq \Vert \theta-\theta_0 \Vert \cdot \mathbb{E}_{P_{\theta_0}}\left[ \left\Vert \dot{\ell}_{\theta_0}^L\right\Vert ^4 \right]^\frac{1}{2} \int r_{\theta}^2 d\mu \\ & = o(\Vert \theta-\theta_0\Vert )\cdot \mathcal{O}(\Vert \theta-\theta_0\Vert ) \\ &= o(\Vert \theta-\theta_0\Vert ^2) \\ &= o(\Vert \theta-\theta_0\Vert ). \end{aligned}\]Thus, we have
\[\begin{aligned} P_{\theta} \dot{\ell}_{\theta_0}^L &= P_{\theta_0}\dot{\ell}_{\theta_0}^L + P_{\theta_0}\dot{\ell}_{\theta_0}^L \dot{\ell}_{\theta_0}^\top (\theta-\theta_0) + o(\Vert \theta-\theta_0\Vert ) \\ &=P_{\theta_0}\dot{\ell}_{\theta_0}^L + (P_{\theta_0}\dot{\ell}_{\theta_0}^L \dot{\ell}_{\theta_0}^\top + o(1))(\theta-\theta_0). \end{aligned}\]Note that the \(o(1)\) term is some square matrix, say \(A_h\), with \(o(1)\) entries that depends on \(h:=\theta-\theta_0\). Then we have
\[\sigma_{\rm{min}}(P_{\theta_0}\dot{\ell}_{\theta_0}^L \dot{\ell}_{\theta_0}^\top+A_h) \geq \sigma_{\rm{min}}(P_{\theta_0}\dot{\ell}_{\theta_0}^L \dot{\ell}_{\theta_0}^\top) - \sigma_{\rm{max}}(A_h).\]\(P_{\theta_0}\dot{\ell}_{\theta_0}^L \dot{\ell}_{\theta_0}^\top\) is nonsingular, so \(\sigma_{\rm{min}}(P_{\theta_0}\dot{\ell}_{\theta_0}^L \dot{\ell}_{\theta_0}^\top)>c_0\) for some postive \(c_0\). Also, for for every \(\theta\) that is sufficiently close to \(\theta_0\), say, \(\Vert h\Vert \leq \epsilon\), we have \(\sigma_{\rm{max}}(A_h)=\Vert A_h-0\Vert _2 \leq c_1 < c_0\) for some positive \(c_1\). Pick \(c \leq c_0 - c_1\), we have
\[\left\Vert P_{\theta} \dot{\ell}_{\theta_0}^L - P_{\theta_0}\dot{\ell}_{\theta_0}^L\right\Vert = \left\Vert (P_{\theta_0}\dot{\ell}_{\theta_0}^L \dot{\ell}_{\theta_0}^\top + A_h)(\theta-\theta_0)\right\Vert \geq \sigma_{\rm{min}}(P_{\theta_0}\dot{\ell}_{\theta_0}^L \dot{\ell}_{\theta_0}^\top+A_h) \geq c\Vert \theta-\theta_0\Vert .\]Moving on, if \(\omega_n = 0\), we have \(-\Vert (\mathbb{P}_n-P_{\theta_0})\dot{\ell}^L_{\theta_0}\Vert > -\sqrt{M_n/n}\). So
\[\left\Vert (\mathbb{P}_n - P_\theta) \dot{\ell}_{\theta_0}^L \right\Vert \geq \left\Vert (P_{\theta_0} - P_\theta) \dot{\ell}_{\theta_0}^L \right\Vert - \left\Vert (\mathbb{P}_n - P_{\theta_0}) \dot{\ell}_{\theta_0}^L \right\Vert > c\Vert \theta-\theta_0\Vert -\sqrt{M_n/n}.\]For \(c\Vert \theta-\theta_0\Vert \geq 2 \sqrt{M_n/n}\), i.e., \(-c\Vert \theta-\theta_0\Vert / 2\leq -\sqrt{M_n/n}\), we have
\[\left\Vert (\mathbb{P}_n - P_\theta) \dot{\ell}_{\theta_0}^L \right\Vert \geq \frac{c}{2}\Vert \theta-\theta_0\Vert .\]Since \(\Vert \cdot \Vert _2 \leq \sqrt{d} \Vert \cdot \Vert _\infty\), with a union bound, we have
\[\begin{aligned} P^n_{\theta} (1-\omega_n) &\leq P_\theta \left[ \left\Vert (\mathbb{P}_n - P_\theta) \dot{\ell}_{\theta_0}^L \right\Vert \geq \frac{c}{2}\Vert \theta-\theta_0\Vert \right] \\ &\leq P_\theta \left[ \left\Vert (\mathbb{P}_n - P_\theta) \dot{\ell}_{\theta_0}^L \right\Vert _\infty \geq \frac{c}{2\sqrt{d}}\Vert \theta-\theta_0\Vert \right] \\ &\leq \sum_{i=1}^d P_\theta \left[ \left| (\mathbb{P}_n - P_\theta) \dot{\ell}_{\theta_0;i}^L \right| \geq \frac{c}{2\sqrt{d}}\Vert \theta-\theta_0\Vert \right] \\ &\leq 2d e^{-n\frac{c^2}{8dL^2} \Vert \theta-\theta_0\Vert ^2} \\ &= \exp\left\{-n\frac{c^2}{8L^2 d} \Vert \theta-\theta_0\Vert ^2+\log 2d\right\}. \end{aligned}\]We want \(C_1\) such that
\[\exp\left\{-n\frac{c^2}{8L^2 d} \Vert \theta-\theta_0\Vert ^2+\log 2d\right\} \leq \exp\left\{-nC_1 \Vert \theta-\theta_0\Vert ^2\right\} .\]Take log on both sides,
\[-n\frac{c^2}{8L^2 d} \Vert \theta-\theta_0\Vert ^2+\log 2d \leq -nC_1 \Vert \theta-\theta_0\Vert ^2,\]so we must have
\[C_1 \leq \frac{c^2}{8L^2 d} - \frac{\log 2 d }{n \Vert \theta-\theta_0\Vert ^2}.\]For sufficiently large \(n\), since \(c\Vert \theta-\theta_0\Vert \geq 2 \sqrt{M_n/n}\), i.e., \(n\Vert \theta-\theta_0\Vert ^2\geq c^2 M_n/4 \to \infty\), the second term will be negligible. One can pick \(C_1 \leq \frac{c^2}{16L^2 d}\) so that
\[\begin{equation*} P^n_{\theta} (1-\omega_n) \leq e^{-nC_1 \Vert \theta-\theta_0\Vert ^2} \leq e^{-c^2M_n^2/4} \end{equation*}\]holds for sufficienly large \(n\).
The range of \(\Vert \theta-\theta_0\Vert > \epsilon\)
This follows from lemma 2.33 in Szabó and van der Vaart by taking \(\Theta_1=\{\Vert \theta-\theta_0\Vert > \epsilon\}\).
If there exits tests \(\phi_n\) such that \(\sup_{\theta\in \Theta_0} P^n_\theta \phi_n \to 0\) and \(\sup_{\theta\in \Theta_1} P^n_\theta (1-\phi_n) \to 0\), for given fixed sets \(\Theta_0\) and \(\Theta_1\) and a given statistical model, then there exist tests \(\psi_n\) and \(c>0\) such that \(\sup_{\theta\in \Theta_0} P^n_\theta \psi_n \leq e^{-cn}\) and \(\sup_{\theta\in \Theta_1} P^n_\theta (1-\psi_n) \leq e^{-cn}\).
Proof of Lemma 2.32 in Szabó and van der Vaart
We next look at the proof of another intermiediate result.
If \(\Delta_n\) are random vectors with \(\Delta_n \rightsquigarrow N_d(0, J)\), then there exists \(M_n \to \infty\) such that \(\Delta_n \mathbb{1}_{\|\Delta_n \|\leq M_n} − \Delta_n → 0\) and, for every $C > 0$,
\[\begin{equation} \sup_{\Vert h\Vert <C} \Bigg|\EE e^{h^\top \Delta_n \indicator{\Vert \Delta_n\Vert \leq M_n}-h^\top J h^\top /2}-1\Bigg|\to 0. \label{eq:laplace-tran} \end{equation}\]For \(\Delta \sim {N}(0, J)\), by Chebychev’s inequality,
\[\mathbb{P}\left[ \Vert \Delta\Vert > M \right] \leq \frac{\text{Var}(\Vert \Delta\Vert )}{M^2} \stackrel{M \to \infty}{\longrightarrow} 0.\]Since \(\Delta_n \rightsquigarrow \Delta\), by Continuous Mapping Theorem, \(\Vert \Delta_n\Vert \rightsquigarrow \Vert \Delta\Vert\), so
\[\mathbb{P}\left[ \Vert \Delta_n\Vert > M \right] \to \mathbb{P}\left[ \Vert \Delta\Vert > M \right].\]It follows that \(\Delta'_n:=\Delta_n \mathbb{1}_{\Vert \Delta_n\Vert \leq M}\) converges in distribution to \(\Delta 1_{\Vert \Delta\Vert \leq M}\). This is because the function \(g(x)=x \mathbb{1}_{\Vert x\Vert \leq M}\) is continuous almost everywhere (the set \(\{\omega: \Vert \Delta(\omega)\Vert = M\}\) has measure \(0\)), and Continous Mapping Thoerem applies.
An important part of the proof of BvM (details in the next section) is that the truncated version of \(\Delta_n\), \(\Delta_n':=\Delta_n \mathbb{1}\{\Vert \Delta_n\Vert \leq M_n\}\), preserves the distributional limit of \(\Delta_n\).
Show details
We can write \(\Delta_n' = \Delta_n + (\Delta_n'-\Delta_n)\). The first part \(\leadsto \Delta\) by assumption. The second part \(\stackrel{\mathbb{P}}{\to}0\) by Lemma 2.32. We then argue by Slutsky’s Thoerem.
To show $\eqref{eq:laplace-tran}$ we first show it for fixed \(M\). To find a sequence \(M_n\), we rely on the following result.
Suppose \(\lim_{n\to \infty}f(M,n)\to 0\) for every \(M>0\). Then there exists \(M_n\) such that \(f(M_n, n) \to 0\).
Show proof
For every fixed \(M\), We know that \(\lim_{n\to\infty} f(M, n) \to 0.\)
Fix \(M=k\in \mathbb{Z}^+\), there exists \(N_k\) (chosen such that \(\geq N_{k-1}\)) such that \(\vert f(M,n)\vert \leq \frac{1}{k}\) for all \(n\geq N_k\).
So now we have two sequences \(M_k\)’s and \(N_k\)’s. To define \(M_n\to \infty\):
- For \(N_k \leq n < N_{k+1}\), pick \(M_n:=k\).
- For \(n < N_1\), pick \(M_1:=1\).
To show that this sequence work, fix \(\delta>0\). Pick an integer \(K\) such that \(1/K \leq \delta\). We have shown there exists \(N_K\) such that \(\vert f(M_n,n)\vert=\vert f(K,n)\vert \leq \frac{1}{K}\) for all \(n \in [N_K, N_{K+1})\); for \(n \in [N_{K+q}, N_{K+q+1})\) for \(q\in\{1, 2, \dots,\}\), we have \(\vert f(M_n,n)\vert =\vert f(K+q,n)\vert \leq \frac{1}{K+q} < \frac{1}{K}\). So we have \(\vert f(M_n,n)\vert \leq \frac{1}{K}\leq \delta\). Since \(\delta\) is arbitrary, we have \(\lim_{n\to\infty} f(M_n, n) = 0.\)
can also be used to find \(C_n\to \infty\) such that the property holds, with
\[f(C,n):=\sup_{\| h\| <C} |\EE e^{h^\top \Delta_n \indicator{\| \Delta_n\| \leq M_n}-h^\top J h^\top /2}-1|.\]Proof of Theorem 2.16 (Bernstein-von Mises) in Szabó and van der Vaart
The proof considers a ball in the parameter space: \(\Theta_n = \{\theta: \Vert \theta-\theta_0\Vert < M_n / \sqrt{n}\}\), and argue that (1) the posterior distribution outside of the ball is negligable, and that (2) the posterior measures restricted to this ball
\[\begin{equation} \Pi_n(B\mid X^{(n)})=\frac{\int_{\Theta_n \cap (\theta_0+B/\sqrt{n})} \prod (p_\theta / p_{\theta_0}) (X_i)\, d\Pi(\theta)}{\int_{\Theta_n} \prod (p_\theta / p_{\theta_0}) (X_i) \,d\Pi(\theta)} \label{eq:post-mea} \end{equation}\]tends to a Gaussian distribution.
Details on step 1
We want to show that \(\Pi(\sqrt{n}(\theta-\theta_0)\in\Theta_n^c \mid X^{(n)})\to 0\) in probability, so it suffices to show that
\[\mathbb{E}_{\theta_0}[\Pi(\sqrt{n}(\theta-\theta_0)\in\Theta_n^c \mid X^{(n)})] \to 0\]and then argue by Markov’s inequality.
Note that for a sequence of tests \(\phi_n\) whose existence is guaranteed by the assumption:
\[\begin{align*} \mathbb{E}_{\theta_0}[\Pi(\sqrt{n}(\theta-\theta_0)\in\Theta_n^c \mid X^{(n)})] &= \mathbb{E}_{\theta_0}[\Pi(\sqrt{n}(\theta-\theta_0)\in\Theta_n^c \mid X^{(n)}) \indicator{A_n}] + \mathbb{E}_{\theta_0}[\Pi(\sqrt{n}(\theta-\theta_0)\in\Theta_n^c \mid X^{(n)})(1-\indicator{A_n})] \\ &\leq \mathbb{E}_{\theta_0}[\Pi(\sqrt{n}(\theta-\theta_0)\in\Theta_n^c \mid X^{(n)}) \phi_n \indicator{A_n}] \\ & \qquad \text{ }+ \mathbb{E}_{\theta_0}[\Pi(\sqrt{n}(\theta-\theta_0)\in\Theta_n^c \mid X^{(n)}) (1-\phi_n) \indicator{A_n}] + \PP_{\theta_0}[A_n^c] \\ & \leq \mathbb{P}_{\theta_0}[\phi_n] + \mathbb{E}_{\theta_0}[\Pi(\sqrt{n}(\theta-\theta_0)\in\Theta_n^c \mid X^{(n)}) \mathbb{1}_{A_n}(1-\phi_n)] + \mathbb{P}_{\theta_0}[A_n^c]\\ &=: \text{I} + \text{II} + \text{III}, \end{align*}\]where \(A_n\) is the event where \(\int \prod (p_\theta /p_{\theta_0}) (X_i) \,d\Pi(\theta) \geq n^{-d/2} \epsilon_n\). This is to control the denominator of \(\eqref{eq:post-mea}\). We need to show \(\mathbb{P}_{\theta_0}[A_n]\to 1\) and that \(\text{II}\) tends to 0.
For \(\text{II}\to 0\), we need to pick \(\epsilon_n\to 0\) such that
\[\underbrace{\epsilon_n^{-1} \int_{M_n < \Vert h\Vert < \sqrt{n}} e^{-c\Vert h\Vert ^2} \left\Vert \frac{d\Pi}{d\mu}\right\Vert _\infty\,}_{(a)} + \underbrace{n^{d/2} \epsilon_n^{-1} \int_{\|\theta-\theta_0\|\geq 1} e^{-cn} \pi(\theta)}_{(b)}\to 0.\]Note that \((b) \leq n^{d/2} e^{-cn} \epsilon^{-1}\). By dominated convergence,
\[\lim_{n\to \infty}\int_{M_n < \Vert h\Vert < \sqrt{n}} e^{-c\Vert h\Vert ^2} \left\Vert \frac{d\Pi}{d\mu}\right\Vert_\infty\,dh = \int \lim_{n\to \infty} \indicator{M_n < \Vert h\Vert < \sqrt{n}} \,e^{-c\Vert h\Vert^2} \left\Vert \frac{d\Pi}{d\mu}\right\Vert_\infty\,dh = 0.\]So we need \(\epsilon_n\to 0\) more slowly than the above convergence and also more slowly than \(n^{d/2} e^{-cn} \to 0\) for \((a)+(b)\) to go to zero.
Details on step 2
We first establish (the fourth paragraph in the original proof)
\[\begin{equation} \EE_\theta \sup_B \Bigg| \int_{h\in B:\Vert h\Vert < M_n} \prod \frac{p_{\theta+h/\sqrt{n}}}{p_\theta} (X_i) \,dh - \int_{h\in B:\Vert h\Vert < M_n} e^{h^\top \Delta_{n,\theta}-\frac{1}{2}h^\top I_\theta h}\, dh \Bigg| \to 0. \label{eq:conv-in-mean-sup} \end{equation}\]We need the following converence in mean:
\[\begin{equation} \lim_{n\to \infty} \EE_\theta \Bigg| \prod \frac{p_{\theta+h/\sqrt{n}}}{p_\theta} (X_i) - e^{h^\top \Delta_{n,\theta}-\frac{1}{2}h^\top I_\theta h} \Bigg| = 0. \label{eq:conv-in-mean} \end{equation}\]
Show proof of $\eqref{eq:conv-in-mean}$
Donote \(L_n=\prod \frac{p_{\theta+h/\sqrt{n}}}{p_\theta} (X_i)\). We look at the truncated version \(\Delta_{n, \theta}'=\Delta_{n, \theta} \indicator{\Vert \Delta_{n, \theta}\Vert \leq C_n}\), so that it has all finite moments and the MGF or Laplace transform \(e^{h^\top \Delta_{n, \theta}'-h^\top I_\theta h/2}\) is defined for all \(h\in \mathbb{R}^d\), which is required for later steps. There are four steps:
- Show that \(L_n-e^{h^\top \Delta_{n, \theta}'-h^\top I_\theta h/2}\) is uniformly integrable. To do so, we show UI for each of them and argue by triangle inequality.
- Note: We need to invoke Lemma 2.32, which requires convergence in first absolute moment. It’s tempting to use Portmanteau’s Lemma, but the exponential function is neither bounded nor Lipschitz, so it doesn’t apply here.
- Show that \(L_n-e^{h^\top \Delta_{n, \theta}'-h^\top I_\theta h/2}\leadsto 0\), or \(\stackrel{\PP}{\to}0\).
- Then argue by the fact that UI + \(\leadsto\) implies \(\stackrel{L_1}{\to}\).
shows that for every \(M\),
\[\begin{equation*} \sup_{\Vert h\Vert \leq M} \Big|\EE e^{h^\top \Delta_{n, \theta}'-h^\top I_\theta h/2} - \EE e^{h^\top \Delta_\theta-h^\top I_\theta h/2} \Big|\to 0, \end{equation*}\]where \(\Delta_\theta = N(0, I_\theta^{-1})\). The previous expression means that for any \(h\) such that \(\Vert h\Vert \leq M\)
\[\EE e^{h^\top \Delta_{n, \theta}'-h^\top I_\theta h/2} \to \EE e^{h^\top \Delta_\theta-h^\top I_\theta h/2}.\]We have seen that \(\Delta'_{n, \theta} \leadsto \Delta_{\theta}\) in the previous section. By Continous Mapping Theorem, we have that \(e^{h^\top \Delta_{n, \theta}'-h^\top I_\theta h/2} \leadsto e^{h^\top \Delta_\theta-h^\top I_\theta h/2}\). Together with the fact that \(e^{h^\top \Delta_{n, \theta}'-h^\top I_\theta h/2} \geq 0\), by Lemma 2.38, we have that \(e^{h^\top \Delta_{n, \theta}'-h^\top I_\theta h/2}\) is uniformly integrable. Following the logic in the proof, \(L_n\) is uniformly integrable. So \(L_n-e^{h^\top \Delta_{n, \theta}'-\frac{1}{2}h^\top I_\theta h}\) is uniformly integrable. Step 1 is done.
Step 2 follows from that fact that \(\Delta_{n, \theta}' \leadsto \Delta_{n, \theta}\) and Continuous Mapping, and LAN.
Dominated convergence then implies
\[\int_{\| h\| <M} \EE_\theta \Bigg| \prod \frac{p_{\theta+h/\sqrt{n}}}{p_\theta} (X_i) - e^{h^\top \Delta_{n,\theta}-\frac{1}{2}h^\top I_\theta h} \Bigg| \,dh \to 0.\]By Fubini, we have for every measurable \(B\):
\[\begin{align*} \int_{\Vert h\Vert <M} \EE_\theta \Bigg| \prod \frac{p_{\theta+h/\sqrt{n}}}{p_\theta} (X_i) - e^{h^\top \Delta_{n,\theta}-\frac{1}{2}h^\top I_\theta h} \Bigg| \,dh &\geq \EE_\theta \int_{\Vert h\Vert <M} \Bigg| \prod \frac{p_{\theta+h/\sqrt{n}}}{p_\theta} (X_i) - e^{h^\top \Delta_{n,\theta}-\frac{1}{2}h^\top I_\theta h} \Bigg| \,dh \\ &\geq \EE_\theta \int_{h\in B:\,\Vert h\Vert <M} \Bigg| \prod \frac{p_{\theta+h/\sqrt{n}}}{p_\theta} (X_i) - e^{h^\top \Delta_{n,\theta}-\frac{1}{2}h^\top I_\theta h} \Bigg| \,dh. \end{align*}\]Then by Jansen’s inequality:
\[\begin{align*} \int_{\Vert h\Vert <M} \EE_\theta \Bigg| \prod \frac{p_{\theta+h/\sqrt{n}}}{p_\theta} (X_i) - e^{h^\top \Delta_{n,\theta}-\frac{1}{2}h^\top I_\theta h} \Bigg| \,dh &\geq \EE_\theta \sup_B \int_{h\in B:\,\Vert h\Vert <M} \Bigg| \prod \frac{p_{\theta+h/\sqrt{n}}}{p_\theta} (X_i) - e^{h^\top \Delta_{n,\theta}-\frac{1}{2}h^\top I_\theta h} \Bigg| \,dh \\ &\geq \EE_\theta \sup_B \Bigg\vert \int_{h\in B:\,\Vert h\Vert <M} \prod \frac{p_{\theta+h/\sqrt{n}}}{p_\theta} (X_i) - e^{h^\top \Delta_{n,\theta}-\frac{1}{2}h^\top I_\theta h} \,dh \Bigg\vert . \end{align*}\]So
\[\EE_\theta \sup_B \Bigg| \int_{h\in B:\,\Vert h\Vert <M} \prod \frac{p_{\theta+h/\sqrt{n}}}{p_\theta} (X_i) - e^{h^\top \Delta_{n,\theta}-\frac{1}{2}h^\top I_\theta h}\,dh \Bigg| \to 0.\]This holds for each fixed \(M\). One can choose \(M_n\to \infty\) using as before, such that $\eqref{eq:conv-in-mean-sup}$ holds.
Now we are ready to show the convergence of $\eqref{eq:post-mea}$ to a Gaussian in total variation distance. To this end, we consider the change of variable \(\sqrt{n}(\theta-\theta_0) \mapsto h\). So $\eqref{eq:post-mea} = Y_n(B)/Y_n(\RR^d)$, where (the factor \(1/\sqrt{n}\) cancels out)
\[Y_n(B) = \int_{h\in B:\, \|h\|\leq M_n} \prod \frac{p_{\theta_0+h/\sqrt{n}}}{p_{\theta_0}}(X_i) \frac{\pi(\theta_0+h/\sqrt{n})}{\pi(\theta_0)} \,dh.\]Define
\[Z_n(B):= \int_{h\in B:\, \|h\|\leq M_n} e^{h^\top \Delta_{n, \theta_0} - h^\top I_{\theta_0} h/2} \,dh.\]Note \(c_n\) tends in mean to 0 for \(M_n\to \infty\), hence in probability by Markov’s ineqality.
Show proof
Let \(H_n = I^{-1}_{\theta_0} \Delta_{n, \theta_0} + G\) where \(G\sim N_d(0, I^{-1}_{\theta_0})\).
Then we have \(c_n = \PP[ \| H_n \| \geq M_n \mid \Delta_{n, \theta_0} ]\), and we want to show
\[\EE[c_n] = \PP[ \| H_n \| \geq M_n ] \to 0.\]Note that \(\{\| H_n \| \geq M_n\} \subset \{\|I^{-1}_{\theta_0} \Delta_{n, \theta_0}\| \geq M_n / 2\} \cup \{\|G\| \geq M_n / 2\}\). Since \(\Delta_{n, \theta_0} \rightsquigarrow N_d(0, I_{\theta_0}^{-1})\) by CLT, we have \(\Delta_{n, \theta_0} = \mathcal{O}_{\PP}(1)\), so \(\PP[\|I^{-1}_{\theta_0} \Delta_{n, \theta_0}\| \geq M_n/2] \to 0\). Similarly, \(\PP[\|G\| \geq M_n/2] \to 0\). Thus, we have
\[\EE |c_n| \leq \PP[\|I^{-1}_{\theta_0} \Delta_{n, \theta_0}\| \geq M_n/2] + \PP[\|G\| \geq M_n/2] \to 0.\]Note that the density of \(N_d(I_{\theta_0}^{-1} \Delta_{n, \theta_0} , I_{\theta_0}^{-1})\) is
\[\frac{1}{(2\pi)^{d/2} \sqrt{\det I_{\theta_0}^{-1}}} e^{-\frac{1}{2} h^\top I_{\theta_0} h + h^\top \Delta_{n, \theta_0}-\frac{1}{2} \Delta_{n, \theta_0}^\top I_{\theta_0}^{-1} \Delta_{n, \theta_0} }.\]So the quotient
\[\frac{Z_n(B)(1-c_n)}{Z_n(\RR^d)} = N_d(I^{-1}_{\theta_0} \Delta_{n, \theta} , I^{-1}_{\theta_0})(B\ \cap\ \{h: \|h\| < M_n\} )= \Phi_{^{-1}_{\theta_0} \Delta_{n, \theta} , I^{-1}_{\theta_0}}(B\ \cap\ \{h: \|h\| < M_n\}),\]which converges to \(\Phi_{^{-1}_{\theta_0} \Delta_{n, \theta} , I^{-1}_{\theta_0}}(B)\) uniformly in $B$. Indeed,
\[\Phi_{^{-1}_{\theta_0} \Delta_{n, \theta} , I^{-1}_{\theta_0}}(B) - \Phi_{^{-1}_{\theta_0} \Delta_{n, \theta} , I^{-1}_{\theta_0}}(B\ \cap\ \{h: \|h\| < M_n\}) \leq c_n \to 0,\]where the convergence doesn’t depend on $B$, hence is uniform.
In light of this observation, we want to show
\[\begin{equation} \sup_B \Bigg| \frac{Z_n(B)}{Z_n(\RR^d)} - \frac{Y_n(B)(1-c_n)}{Y_n(\RR^d)} \Bigg| \to 0. \label{eq:conv-in-TV} \end{equation}\]The two look almost the same due to $\prod \frac{p_{\theta_0+h/\sqrt{n}}}{p_{\theta_0}}(X_i) \approx e^{h^\top \Delta_{\theta_0} - h^\top I_{\theta_0} h/2}$ in mean, as in $\eqref{eq:conv-in-mean-sup}$, except for the extra quotient $\frac{\pi(\theta_0+h/\sqrt{n})}{\pi(\theta_0)}$, which is asymptotically equal to 1. More concretely, if $M_n/\sqrt{n} \to 0$, for any \(h\in \{\|h\| \leq M_n\}\), we have
\[\frac{\pi(\theta_0+h/\sqrt{n})}{\pi(\theta_0)} \leq \frac{\pi(\theta_0+M_n/\sqrt{n})}{\pi(\theta_0)} \to 0.\]So
\[\sup_{\|h\|\leq M_n}\frac{\pi(\theta_0+h/\sqrt{n})}{\pi(\theta_0)} \to 0.\]As a result, we have
\[\begin{equation} \begin{aligned} \sup_B \Bigg|Y_n(B)- \int_{h\in B:\, \|h\|\leq M_n} \prod \frac{p_{\theta_0+h/\sqrt{n}}}{p_{\theta_0}}(X_i) \,dh\Bigg| &= \sup_B \Bigg|\int_{h\in B:\, \|h\|\leq M_n} \prod \frac{p_{\theta_0+h/\sqrt{n}}}{p_{\theta_0}}(X_i) \left(1-\frac{\pi(\theta_0+h/\sqrt{n})}{\pi(\theta_0)} \right)\,dh \Bigg| \\ & \leq \sup_B \Bigg| \int_{h\in B:\, \|h\|\leq M_n} \prod \frac{p_{\theta_0+h/\sqrt{n}}}{p_{\theta_0}}(X_i) \,dh \Bigg| \cdot \sup_{\|h\|\leq M_n} \frac{\pi(\theta_0+h/\sqrt{n})}{\pi(\theta_0)} \to 0. \end{aligned} \label{eq:Yn-with-quotient} \end{equation}\]Now
\[Z_n(\mathbb{R}^d) = \frac{(2\pi)^{d/2}}{\sqrt{\det I_{\theta_0}}} e^{\frac{1}{2}\Delta_{n,\theta_0}^T I_{\theta_0}^{-1} \Delta_{n,\theta_0}} (1 - c_n),\qquad c_n = \Phi_{I_{\theta_0}^{-1} \Delta_{n,\theta_0}, I_{\theta_0}^{-1}} (h: \|h\| \ge M_n).\]Observe that
\[\Bigg| \frac{Z_n(B)}{Z_n(\RR^d)} - \frac{Y_n(B)(1-c_n)}{Y_n(\RR^d)} \Bigg| = \Bigg| \underbrace{\frac{Z_n(B)-Y_n(B)}{Z_n(\RR^d)}}_{\Lambda_B} - \underbrace{\frac{Z_n(B)}{Z_n(\RR^d)}}_{\Xi_B}\Big(\underbrace{\frac{Z_n(\RR^d)-Y_n(\RR^d)}{Y_n(\RR^d)}}_\Phi \Big) +\underbrace{\frac{Y_n(B)c_n}{Y_n(\RR^d)}}_{\Omega_B} \Bigg|.\]Using $\eqref{eq:conv-in-mean-sup}$, $\eqref{eq:Yn-with-quotient}$, the triangle inequality, and Markov’s inequality, we have \(\sup_B \lvert Z_n(B)-Y_n(B)\rvert \to 0\) in probability. Recall that $c_n$ converges in mean to 0. Also, \(c_n < 1\), so \(Z_n(\RR^d) > 0\) a.s. and thus converges to a positive R.V., i.e., \(1/Z_n(\RR^d) = \mathcal{O}_{\PP}(1)\). Thus,
\[\sup_B |\Lambda_B| = \mathcal{o}_\PP (1) \mathcal{O}_\PP(1) = \mathcal{o}_\PP (1).\]Also, \(\sup_B Z_n(B) \in [0, Z_n(\RR^d)]\). So
\[\sup_B |\Xi_B| = \mathcal{O}_\PP(1).\]The fact that \(\sup_B \lvert Z_n(B)-Y_n(B)\rvert \to 0\) in probability implies \(\lvert Z_n(\RR^d)-Y_n(\RR^d)\rvert \to 0\) in probability. So \(Y_n(\RR^d)\) also converges to a positive R.V., i.e., \(1/Y_n(\RR^d) = \mathcal{O}_{\PP}(1)\). So
\[|\Psi| = \mathcal{o}_\PP(1) \mathcal{O}_\PP(1) = \mathcal{o}_\PP(1)\]and
\[\sup_B |\Omega_B| = \mathcal{O}_\PP(1) \mathcal{o}_\PP(1) = \mathcal{o}_\PP(1).\]Hence,
\[\sup_B \Bigg| \frac{Z_n(B)}{Z_n(\RR^d)} - \frac{Y_n(B)(1-c_n)}{Y_n(\RR^d)} \Bigg| \leq \sup_B |\Lambda_B| + \sup_B |\Xi_B| \cdot |\Psi|+\sup_B |\Omega_B| = \mathcal{o}_\PP(1).\]We are almost done, except that we have not shown $\PP_{\theta_0}[A_n] \to 1$ from step 1. To this end, note that \(\{Y_n(\RR^d) \geq \frac{\pi(\theta_0+h/\sqrt{n})}{\theta_0} \epsilon_n\} \subset A_n\) for every \(\|h\| <M_n\). Since \(Y_n(\RR^d)\) tends to a postivie R.V. as we argued earlier, we have \(\PP_{\theta_0}[A_n] \geq \PP_{\theta_0}[Y_n(\RR^d)\geq \frac{\pi(\theta_0+h/\sqrt{n})}{\theta_0} \epsilon_n] \to 1\).