MLE ต้องการข้อมูล iid หรือไม่ หรือเพียงแค่พารามิเตอร์อิสระ

16

การประมาณค่าพารามิเตอร์โดยใช้การประมาณความน่าจะเป็นสูงสุด (MLE) เกี่ยวข้องกับการประเมินฟังก์ชั่นความน่าจะเป็นซึ่งแผนที่ความน่าจะเป็นของตัวอย่าง (X) ที่เกิดขึ้นกับค่า (x) บนพื้นที่พารามิเตอร์ (θ) ที่กำหนดตระกูลการแจกแจง (P (X = x | θ) ) มากกว่าค่าที่เป็นไปได้ของθ (หมายเหตุ: ฉันถูกใช่ไหม?) ตัวอย่างทั้งหมดที่ฉันได้เห็นเกี่ยวข้องกับการคำนวณ P (X = x | θ) โดยการหาผลคูณของ F (X) โดยที่ F เป็นการแจกแจงแบบท้องถิ่น ค่าสำหรับθและ X คือตัวอย่าง (เวกเตอร์)

เนื่องจากเราเพิ่งคูณข้อมูลมันติดตามว่าข้อมูลนั้นเป็นอิสระหรือไม่ เช่นเราไม่สามารถใช้ MLE ให้พอดีกับข้อมูลอนุกรมเวลาได้หรือไม่ หรือพารามิเตอร์ต้องเป็นอิสระ?

maximum-likelihood

— เฟลิกซ์
แหล่งที่มา

14

ฟังก์ชันความน่าจะเป็นหมายถึงความน่าจะเป็นของเหตุการณ์ $E$ (ชุดข้อมูล ${\bf x}$ ) เป็นฟังก์ชันของพารามิเตอร์โมเดล $\theta$

L (θ; x) \propto P (Event E; θ) = P (observing x; θ) .

${\mathcal L}(\theta;{\bf x})\propto {\mathbb P}(\text{Event }E;\theta)= {\mathbb P}(\text{observing } {\bf x};\theta).$

ดังนั้นจึงไม่มีข้อสันนิษฐานเกี่ยวกับความเป็นอิสระของข้อสังเกต ในแนวทางแบบคลาสสิกไม่มีคำจำกัดความความเป็นอิสระของพารามิเตอร์เนื่องจากไม่ใช่ตัวแปรสุ่ม แนวคิดที่เกี่ยวข้องบางอย่างอาจระบุได้พารามิเตอร์ความตั้งฉากกันและความเป็นอิสระของเครื่องมือประมาณค่าความน่าจะเป็นสูงสุด (ซึ่งเป็นตัวแปรสุ่ม)

ตัวอย่างบางส่วน

(1) กรณีที่ไม่ต่อเนื่อง ${\bf x}=(x_1,...,x_n)$ คือตัวอย่างของ (อิสระ) ข้อสังเกตที่ไม่ต่อเนื่องกับ ${\mathbb P}(\text{observing } x_j ; \theta)>0$ แล้ว

L (θ; x) \propto \prod_{j = 1}^{n} P (observing x_{j}; θ) .

${\mathcal L}(\theta;{\bf x})\propto \prod_{j=1}^n{\mathbb P}(\text{observing } x_j ; \theta).$

โดยเฉพาะอย่างยิ่งถ้า $x_j\sim \text{Binomial}(N,\theta)$ โดยที่ $N$ รู้จักเรามี

L (θ; x) \propto \prod_{j = 1}^{n} θ^{x_{j}} (1 - θ)^{N - x_{j}} .

${\mathcal L}(\theta;{\bf x})\propto \prod_{j=1}^n \theta^{x_j}(1-\theta)^{N-x_j}.$

(2) ประมาณอย่างต่อเนื่อง ให้เป็นตัวอย่างจากอย่างต่อเนื่องตัวแปรสุ่มมีการกระจายและความหนาแน่นของมีข้อผิดพลาดในการวัดนี้คือคุณสังเกตชุด )แล้วก็ ${\bf x}=(x_1,...,x_n)$ $X$ $F$ $f$ $\epsilon$ $(x_j-\epsilon,x_j+\epsilon)$

\begin{array}{rcl} L (θ; x) \propto \prod_{j = 1}^{n} P [observing (x_{j} - ϵ, x_{j} + ϵ); θ] = \prod_{j = 1}^{n} [F (x_{j} + ϵ; θ) - F (x_{j} - ϵ; θ)] \end{array}

$\begin{eqnarray*} {\mathcal L}(\theta;{\bf x})\propto \prod_{j=1}^n {\mathbb P}[\text{observing } (x_j-\epsilon,x_j+\epsilon);\theta] = \prod_{j=1}^n[F(x_j+\epsilon;\theta)-F(x_j-\epsilon;\theta)] \end{eqnarray*}$

เมื่อมีขนาดเล็กสิ่งนี้สามารถประมาณได้ (โดยใช้ทฤษฎีค่าเฉลี่ย) โดย $\epsilon$

\begin{array}{rcl} L (θ; x) \propto \prod_{j = 1}^{n} f (x_{j}; θ) \end{array}

$\begin{eqnarray*} {\mathcal L}(\theta;{\bf x})\propto \prod_{j=1}^n f(x_j;\theta) \end{eqnarray*}$

เช่นกับกรณีปกติให้ดูที่นี้

(3) โมเดลที่ขึ้นกับและมาร์คอฟ สมมติว่าเป็นชุดของการสังเกตอาจจะขึ้นอยู่และปล่อยให้จะมีความหนาแน่นร่วมกันของแล้ว ${\bf x}=(x_1,...,x_n)$ $f$ ${\bf x}$

\begin{array}{rcl} L (θ; x) \propto f (x; θ) . \end{array}

$\begin{eqnarray*} {\mathcal L}(\theta;{\bf x})\propto f({\bf x}; \theta). \end{eqnarray*}$

หากว่าคุณสมบัติของมาร์คอฟเป็นที่น่าพึงพอใจ

\begin{array}{rcl} L (θ; x) \propto f (x; θ) = f (x_{1}; θ) \prod_{j = 1}^{n - 1} f (x_{j + 1} | x_{j}; θ) . \end{array}

$\begin{eqnarray*} {\mathcal L}(\theta;{\bf x})\propto f({\bf x}; \theta) = f(x_1;\theta)\prod_{j=1}^{n-1} f(x_{j+1} \vert x_j ;\theta). \end{eqnarray*}$

ลองดูที่นี่สิ

— ชุมชน
แหล่งที่มา

3

จากที่คุณเขียนฟังก์ชั่นความน่าจะเป็นเป็นผลิตภัณฑ์โดยปริยายคุณสมมติว่าโครงสร้างการพึ่งพาอาศัยอยู่ระหว่างการสังเกต ดังนั้นสำหรับ MLE คนหนึ่งต้องการสมมติฐานสองข้อ (ก) ข้อหนึ่งเกี่ยวกับการกระจายของผลลัพธ์แต่ละรายการและ (ข) ข้อหนึ่งขึ้นอยู่กับการพึ่งพาระหว่างผลลัพธ์

10

(+1) คำถามที่ดีมาก

สิ่งเล็ก ๆ น้อย ๆ MLE ย่อมาจากการประมาณความน่าจะเป็นสูงสุด (ไม่ใช่หลาย ๆ ค่า) ซึ่งหมายความว่าคุณเพิ่งเพิ่มโอกาสสูงสุด นี่ไม่ได้ระบุว่าจะต้องมีการสร้างโอกาสโดยการสุ่มตัวอย่าง IID

หากการพึ่งพาของการสุ่มตัวอย่างสามารถเขียนในแบบจำลองทางสถิติคุณเพียงแค่เขียนความเป็นไปได้ตามนั้นและเพิ่มมันตามปกติ

กรณีหนึ่งที่มีมูลค่าการกล่าวขวัญเมื่อคุณไม่ถือว่าการพึ่งพานั้นคือการสุ่มตัวอย่าง Gaussian หลายตัวแปร (ในตัวอย่างการวิเคราะห์อนุกรมเวลา) การพึ่งพาระหว่างตัวแปรเกาส์เซียนสองตัวสามารถถูกจำลองโดยเทอมความแปรปรวนร่วมซึ่งคุณรวมอยู่ในความเป็นไปได้

To give a simplistic example, assume that you draw a sample of size $2$ from correlated Gaussian variables with same mean and variance. You would write the likelihood as

\frac{1}{2 π σ^{2} \sqrt{1 - ρ^{2}}} \exp (- \frac{z}{2 σ^{2} (1 - ρ^{2})}),

$\frac{1}{2\pi\sigma^2\sqrt{1-\rho^2}}\exp\left(-\frac{z}{2\sigma^2(1-\rho^2)}\right),$

where $z$ is

z = (x_{1} - μ)^{2} - 2 ρ (x_{1} - μ) (x_{2} - μ) + (x_{2} - μ)^{2} .

$z = (x_1-\mu)^2-2\rho(x_1-\mu)(x_2-\mu)+(x_2-\mu)^2.$

This is not the product of the individual likelihoods. Still, you would maximize this with parameters $(\mu, \sigma, \rho)$ to get their MLE.

— gui11aume
แหล่งที่มา

2

These are good answers and examples. The only thing I would add to see this in simple terms is that likelihood estimation only requires that a model for the generation of the data be specified in terms of some unknown parameters be described in functional form.

— Michael R. Chernick

(+1) Absolutely true! Do you have an example of model that cannot be specified in those terms?

— gui11aume

@gu11aume I think you are referring to my remark. I would say that I was not giving a direct answer to the question. The answwer to the question is yes because there are examples that can be shown where the likelihood function can be expressed when the data are genersted by dependent random variables.

— Michael R. Chernick

2

Examples where this cannot be done would be where the data are given without any description of the data generating mechanism or the model is not presented in a parametric form such as when you are given two iid data sets and are asked to test whether they come from the same distribution where you only specify that the distributions are absolutely continuous.

— Michael R. Chernick

4

Of course, Gaussian ARMA models possess a likelihood, as their covariance function can be derived explicitly. This is basically an extension of gui11ame's answer to more than 2 observations. Minimal googling produces papers like this one where the likelihood is given in the general form.

Another, to an extent, more intriguing, class of examples is given by multilevel random effect models. If you have data of the form

y_{i j} = x_{i j}^{'} β + u_{i} + ϵ_{i j},

$y_{ij} = x_{ij}'\beta + u_i + \epsilon_{ij},$ where indices

j

$j$ are nested in

i

$i$ (think of students

j

$j$ in classrooms

i

$i$ , say, for a classic application of multilevel models), then, assuming

ϵ_{i j} ⊥ u_{i}

$\epsilon_{ij} \perp u_i$ , the likelihood is

\ln L \sim \sum_{i} \ln \int \prod_{j} f (y_{i j} | β, u_{i}) d F (u_{i})

$\ln L \sim \sum_i \ln \int \prod_j f(y_{ij}|\beta,u_i) {\rm d}F(u_i)$ and is a sum over the likelihood contributions defined at the level of clusters, not individual observations. (Of course, in the Gaussian case, you can push the integrals around to produce an analytic ANOVA-like solution. However, if you have say a logit model for your response

y_{i j}

$y_{ij}$ , then there is no way out of numerical integration.)

— StasK
แหล่งที่มา

2

Stask and @gui11aume, these three answers are nice but I think they miss a point: what about the consistency of the MLE for dependent data ?

— Stéphane Laurent