สังเกตเมทริกซ์ข้อมูลเป็นตัวประมาณความสอดคล้องของเมทริกซ์ข้อมูลที่คาดหวัง?

16

ฉันพยายามที่จะพิสูจน์ว่าเมทริกซ์ข้อมูลที่สังเกตได้ประเมินที่ตัวประมาณความน่าจะเป็นค่าสูงสุดที่ไม่สม่ำเสมอ (MLE) ซึ่งเป็นค่าประมาณที่ไม่แน่นอนของเมทริกซ์ข้อมูลที่คาดหวัง นี่คือผลลัพธ์ที่ยกมาอย่างกว้างขวาง แต่ไม่มีใครให้การอ้างอิงหรือหลักฐาน (ฉันหมดแรงฉันคิดว่าหน้าแรกของผลการค้นหาของ google และตำราสถิติของฉัน) 20 หน้า!

การใช้ลำดับของ MLE ที่สอดคล้องกันอย่างอ่อนฉันสามารถใช้กฏที่อ่อนแอของจำนวนมาก (WLLN) และทฤษฎีการทำแผนที่แบบต่อเนื่องเพื่อให้ได้ผลลัพธ์ตามที่ฉันต้องการ อย่างไรก็ตามฉันเชื่อว่าไม่สามารถใช้ทฤษฎีบทการทำแผนที่อย่างต่อเนื่องได้ แต่ฉันคิดว่าต้องใช้กฎหมายเครื่องแบบของคนจำนวนมาก (ULLN) มีใครทราบถึงข้อมูลอ้างอิงที่มีหลักฐานนี้หรือไม่? ฉันมีความพยายามที่ ULLN แต่ไม่ต้องสนใจเลยสำหรับตอนนี้

ฉันต้องขออภัยในความยาวของคำถามนี้ แต่จะต้องมีการจดบันทึก สัญกรณ์เป็นเหมือน folows (หลักฐานของฉันอยู่ท้าย)

สมมติว่าเรามีตัวอย่าง IID ของตัวแปรสุ่ม $\{Y_1,\ldots,Y_N\}$ กับความหนาแน่น $f(\tilde{Y}|\theta)$ ที่ $\theta\in\Theta\subseteq\mathbb{R}^{k}$ (ที่นี่ $\tilde{Y}$ เป็นเพียงตัวแปรสุ่มทั่วไปที่มีความหนาแน่นเดียวกัน เป็นหนึ่งในสมาชิกของกลุ่มตัวอย่าง) เวกเตอร์ $Y=(Y_1,\ldots,Y_N)^{T}$ คือเวกเตอร์ของเวกเตอร์ตัวอย่างทั้งหมดที่ $Y_{i}\in\mathbb{R}^{n}$ สำหรับทุก $i=1,\ldots,N$ Nค่าพารามิเตอร์ที่แท้จริงของความหนาแน่นคือ $\theta_{0}$ และเป็นที่สอดคล้องกันนิดหน่อยโอกาสสูงสุดประมาณ (MLE) ของ 0ภายใต้เงื่อนไขความสม่ำเสมอเมทริกซ์ข้อมูลฟิชเชอร์สามารถเขียนได้ $\hat{\theta}_{N}(Y)$ $\theta_{0}$

I (θ) = - E θ [H θ (log f (Y ~ | θ)]

$I(\theta)=-E_\theta \left[H_{\theta}(\log f(\tilde{Y}|\theta)\right]$

โดยที่ ${H}_{\theta}$ คือเมทริกซ์ของ Hessian ตัวอย่างที่เทียบเท่าคือ

I N (θ) = \sum i = 1 N I y i (θ),

$I_N(\theta)=\sum_{i=1}^N I_{y_i}(\theta),$

ที่ $I_{y_i}=-E_\theta \left[H_{\theta}(\log f(Y_{i}|\theta)\right]$ เมทริกซ์ข้อมูลที่สังเกตคือ;

$J(\theta) = -H_\theta(\log f(y|\theta)$ ,

(บางคนเรียกร้องเมทริกซ์ได้รับการประเมินที่แต่บางคนไม่) ตัวอย่างเมทริกซ์ข้อมูลที่สังเกตคือ $\hat{\theta}$

$J_N(\theta)=\sum_{i=1}^N J_{y_i}(\theta)$

ที่ $J_{y_i}(\theta)=-H_\theta(\log f(y_{i}|\theta)$ )

ฉันสามารถพิสูจน์บรรจบกันในน่าจะเป็นของประมาณการ $N^{-1}J_N(\theta)$ เพื่อ $I(\theta)$ แต่ไม่ได้ของ $N^{-1}J_{N}(\hat{\theta}_N(Y))$ เพื่อ $I(\theta_{0})$ )นี่คือบทพิสูจน์ของฉันจนถึงตอนนี้;

Now $(J_{N}(\theta))_{rs}=-\sum_{i=1}^N (H_\theta(\log f(Y_i|\theta))_{rs}$ is element $(r,s)$ of $J_N(\theta)$ , for any $r,s=1,\ldots,k$ . If the sample is iid, then by the weak law of large numbers (WLLN), the average of these summands converges in probability to $-E_{\theta}[(H_\theta(\log f(Y_{1}|\theta))_{rs}]=(I_{Y_1}(\theta))_{rs}=(I(\theta))_{rs}$ . Thus $N^{-1}(J_N(\theta))_{rs}\overset{P}{\rightarrow}(I(\theta))_{rs}$ for all $r,s=1,\ldots,k$ , and so $N^{-1}J_N(\theta)\overset{P}{\rightarrow}I(\theta)$ . Unfortunately we cannot simply conclude $N^{-1}J_{N}(\hat{\theta}_N(Y))\overset{P}{\rightarrow}I(\theta_0)$ by using the continuous mapping theorem since $N^{-1}J_{N}(\cdot)$ is not the same function as $I(\cdot)$ .

Any help on this would be greatly appreciated.

— dandar
แหล่งที่มา

Related: Convergence rate of empirical Fisher information matrix.

does my answer below address answer your question?

— Dapz

1

@Dapz Please accept my sincerest apologies for not replying to you until now - I made the mistake of assuming nobody would answer. Thank-you for your answer below - I have upvoted it since I can see it is most useful, however I need to spend a little time considering it. Thank-you for your time, and I will reply to your post below soon.

— dandar

7

$\newcommand{\convp}{\stackrel{P}{\longrightarrow}}$

I guess directly establishing some sort of uniform law of large numbers is one possible approach.

Here is another.

We want to show that $\frac{J^N(\theta_{MLE})}{N} \convp I(\theta^*)$ .

(As you said, we have by the WLLN that $\frac{J^N(\theta)}{N} \convp I(\theta)$ . But this doesn't directly help us.)

One possible strategy is to show that

| I (θ *) - J N ( θ * ) N | ⟶ P 0.

$|I(\theta^*) - \frac{J^N(\theta^*)}{N}| \convp 0.$

and

| J N ( θ M L E ) N - J N ( θ * ) N | ⟶ P 0

$|\frac{J^N(\theta_{MLE})}{N} - \frac{J^N(\theta^*)}{N}| \convp 0$

If both of the results are true, then we can combine them to get

| I (θ *) - J N ( θ M L E ) N | ⟶ P 0,

$|I(\theta^*) - \frac{J^N(\theta_{MLE})}{N}| \convp 0,$

which is exactly what we want to show.

The first equation follows from the weak law of large numbers.

The second almost follows from the continuous mapping theorem, but unfortunately our function $g()$ that we want to apply the CMT to changes with $N$ : our $g$ is really $g_N(\theta) := \frac{J^N(\theta)}{N}$ . So we cannot use the CMT.

(Comment: If you examine the proof of the CMT on Wikipedia, notice that the set $B_\delta$ they define in their proof for us now also depends on $n$ . We essentially need some sort of equicontinuity at $\theta^*$ over our functions $g_N(\theta)$ .)

Fortunately, if you assume that the family $\mathcal{G} = \{g_N | N=1,2,\ldots\}$ is stochastically equicontinuous at $\theta^*$ , then it immediately follows that for $\theta_{MLE} \convp \theta^*$ ,

| g n (θ M L E) - g n (θ *) | ⟶ P 0.

$\begin{align*} |g_n(\theta_{MLE}) - g_n(\theta^*)| \convp 0. \end{align*}$

(See here: http://www.cs.berkeley.edu/~jordan/courses/210B-spring07/lectures/stat210b_lecture_12.pdf for a definition of stochastic equicontinuity at $\theta^*$ , and a proof of the above fact.)

Therefore, assuming that $\mathcal{G}$ is SE at $\theta^*$ , your desired result holds true and the empirical Fisher information converges to the population Fisher information.

Now, the key question of course is, what sort of conditions do you need to impose on $\mathcal{G}$ to get SE? It looks like one way to do this is to establish a Lipshitz condition on the entire class of functions $\mathcal{G}$ (see here: http://econ.duke.edu/uploads/media_items/uniform-convergence-and-stochastic-equicontinuity.original.pdf ).

— Dapz
แหล่งที่มา

1

The answer above using stochastic equicontinuity works very well, but here I am answering my own question by using a uniform law of large numbers to show that the observed information matrix is a strongly consistent estimator of the information matrix , i.e. $N^{-1}J_{N}(\hat{\theta}_{N}(Y))\overset{a.s.}{\longrightarrow}I(\theta_{0})$ if we plug-in a strongly consistent sequence of estimators. I hope it is correct in all details.

We will use $I_{N}=\{1,2,...,N\}$ to be an index set, and let us temporarily adopt the notation $J(\tilde{Y},\theta):=J(\theta)$ in order to be explicit about the dependence of $J(\theta)$ on the random vector $\tilde{Y}$ . We shall also work elementwise with $(J(\tilde{Y},\theta))_{rs}$ and $(J_{N}(\theta))_{rs}=\sum\nolimits_{i=1}^{N}(J(Y_{i},\theta))_{rs}$ , $r,s=1,...,k$ , for this discussion. The function $(J(\cdot,\theta))_{rs}$ is real-valued on the set $\mathbb{R}^{n}\times\Theta^{\circ}$ , and we will suppose that it is Lebesgue measurable for every $\theta\in\Theta^{\circ}$ . A uniform (strong) law of large numbers defines a set of conditions under which

$\underset{\theta\in\Theta}{\text{sup}}\left|N^{-1}(J_{N}(\theta))_{rs}-E_{\theta}\left[(J(Y_{1},\theta))_{rs}\right]\right|=\nonumber\\ \hspace{60pt}\underset{\theta\in\Theta}{\text{sup}}\left|N^{-1}\sum\nolimits_{i=1}^{N}(J(Y_{i},\theta))_{rs}-(I(\theta))_{rs}\right|\overset{a.s}{\longrightarrow}0\hspace{100pt}(1)$

The conditions that must be satisfied in order that (1) holds are (a) $\Theta^{\circ}$ is a compact set; (b) $(J(\tilde{Y},\theta))_{rs}$ is a continuous function on $\Theta^{\circ}$ with probability 1; (c) for each $\theta\in \Theta^{\circ}$ $(J(\tilde{Y},\theta))_{rs}$ is dominated by a function $h(\tilde{Y})$ , i.e. $|(J(\tilde{Y},\theta))_{rs}|<h(\tilde{Y})$ ; and (d) for each $\theta\in \Theta^{\circ}$ $E_{\theta}[h(\tilde{Y})]<\infty$ ;. These conditions come from Jennrich (1969, Theorem 2).

Now for any $y_{i}\in\mathbb{R}^{n}$ , $i\in I_{N}$ and $\theta'\in S\subseteq\Theta^{\circ}$ , the following inequality obviously holds

$\left|N^{-1}\sum\nolimits_{i=1}^{N}(J(y_{i},\theta'))_{rs}-(I(\theta'))_{rs}\right|\leq\underset{\theta\in S}{\text{sup}}\left|N^{-1}\sum\nolimits_{i=1}^{N}(J(y_{i},\theta))_{rs}-(I(\theta))_{rs}\right|.\hspace{50pt}(2)$

Suppose that $\{\hat{\theta}_{N}(Y)\}$ is a strongly consistent sequence of estimators for $\theta_{0}$ , and let $\Theta_{N_{1}}=B_{\delta_{N_{1}}}(\theta_{0})\subseteq K\subseteq \Theta^{\circ}$ be an open ball in $\mathbb{R}^{k}$ with radius $\delta_{N_{1}}\rightarrow 0$ as $N_{1}\rightarrow\infty$ , and suppose $K$ is compact. Then since $\hat{\theta}_{N}(Y)\in \Theta_{N_{1}}$ for $N$ sufficiently large enough we have $P[\underset{N}{\text{lim}}\{\hat{\theta}_{N}(Y)\in\Theta_{N_{1}}\}]=1$ for sufficiently large $N$ . Together with (2) this implies

$P\left[\underset{N\rightarrow\infty}{\text{lim}}\left\{\left|N^{-1}\sum\nolimits_{i=1}^{N}(J(Y_{i},\hat{\theta}_{N}(Y)))_{rs}-(I(\hat{\theta}_{N}(Y)))_{rs}\right|\leq\right.\right.\nonumber\\ \hspace{40pt}\left.\left.\underset{\theta\in\Theta_{N_{1}}}{\text{sup}}\left|N^{-1}\sum\nolimits_{i=1}^{N}(J(Y_{i},\theta))_{rs}-(I(\theta))_{rs}\right|\right\}\right]=1.\hspace{100pt}(3)$

Now $\Theta_{N_{1}}\subseteq\Theta^{\circ}$ implies conditions (a)-(d) of Jennrich (1969, Theorem 2) apply to $\Theta_{N_{1}}$ . Thus (1) and (3) imply

$P\left[\underset{N\rightarrow\infty}{\text{lim}}\left\{\left|N^{-1}\sum\nolimits_{i=1}^{N}(J(Y_{i},\hat{\theta}_{N}(Y)))_{rs}-(I(\hat{\theta}_{N}(Y)))_{rs}\right|=0\right\}\right]=1.\hspace{100pt}(4)$

Since $(I(\hat{\theta}_{N}(Y)))_{rs}\overset{a.s.}{\longrightarrow}I(\theta_{0})$ then (4) implies that $N^{-1}(J_{N}(\hat{\theta}_{N}(Y)))_{rs}\overset{a.s.}{\longrightarrow}(I(\theta_{0}))_{rs}$ . Note that (3) holds however small $\Theta_{N_{1}}$ is, and so the result in (4) is independent of the choice of $N_{1}$ other than $N_{1}$ must be chosen such that $\Theta_{N_{1}}\subseteq \Theta^{\circ}$ . This result holds for all $r,s=1,...,k$ , and so in terms of matrices we have $N^{-1}J_{N}(\hat{\theta}_{N}(Y))\overset{a.s.}{\longrightarrow}I(\theta_{0})$ .

— dandar
แหล่งที่มา