เอนโทรปีเชิงประจักษ์คืออะไร?

ในคำนิยามของชุดทั่วไปที่ใช้ร่วมกัน (ใน "องค์ประกอบของทฤษฎีข้อมูล", ch. 7.6, p. 195) เราใช้

เป็นเอนโทรปีเชิงประจักษ์ของ-sequence กับ)ฉันไม่เคยเจอคำศัพท์นี้มาก่อน ไม่ได้กำหนดไว้อย่างชัดเจนที่ใดก็ได้ตามดัชนีของหนังสือ

- \frac{1}{n} \log p (x^{n})

$-\frac{1}{n} \log{p(x^n)}$

n

$n$

p (x^{n}) = \prod_{i = 1}^{n} p (x_{i})

$p(x^n) = \prod_{i=1}^{n}{p(x_i)}$

คำถามของฉันโดยทั่วไปคือ: ทำไมเอนโทรปีเชิงประจักษ์ไม่ได้ที่ $-\sum_{x}{\hat p (x) \log(\hat p(x))}$ $\hat p(x)$ คือการกระจายเชิงประจักษ์?

อะไรคือความแตกต่างและความคล้ายคลึงที่น่าสนใจที่สุดระหว่างสองสูตรนี้? (ในแง่ของคุณสมบัติที่พวกเขาแบ่งปัน / ไม่แชร์)

information-theory entropy

— blubb
แหล่งที่มา

การแสดงออกทั้งสองนั้นไม่เท่ากันพีชคณิตหรือไม่

— whuber

@whuber: ไม่พวกเขามีปริมาณที่แตกต่างกันโดยมีวัตถุประสงค์ที่แตกต่างกันฉันเชื่อว่า โปรดทราบว่าครั้งแรกที่ใช้วัดจริง

สันนิษฐานว่ารู้จักนิรนัย ครั้งที่สองไม่ได้

p

$p$

— พระคาร์ดินัล

อดีตเกี่ยวข้องกับการสะสมของเอนโทรปีเมื่อเวลาผ่านไปและเปรียบเทียบกับเอนโทรปีที่แท้จริงของระบบ SLLN และ CLT บอกหนึ่งอย่างมากเกี่ยวกับพฤติกรรม เรื่องที่สองเกี่ยวข้องกับการประมาณค่าเอนโทรปีจากข้อมูลและคุณสมบัติบางอย่างของมันยังสามารถรับได้ผ่านเครื่องมือสองตัวที่กล่าวถึง แต่ในขณะที่แรกคือที่เป็นกลางที่สองคือไม่อยู่ภายใต้การใด ๆหน้า

ฉันสามารถกรอกรายละเอียดบางอย่างถ้ามันจะเป็นประโยชน์

p

$p$

— พระคาร์ดินัล

@ cardinal: ถ้าคุณต้องการแสดงความคิดเห็นข้างต้นเป็นคำตอบ (อาจอธิบายได้ว่า SLLN และ CLT คืออะไร - ฉันไม่รู้พวกนี้) ฉันยินดีที่จะ

— โหวต

ตกลงฉันจะพยายามโพสต์เพิ่มเติมในภายหลัง ในระหว่างนี้ SLLN = "กฎที่แข็งแกร่งของคนจำนวนมาก" และ CLT = "ทฤษฎีขีด จำกัด กลาง" ตัวย่อเหล่านี้เป็นตัวย่อมาตรฐานที่คุณอาจพบได้อีกครั้ง ไชโย :)

— สำคัญ

คำตอบ:

หากข้อมูลมีที่เป็นนัก -sequence จากตัวอย่างพื้นที่ , ความน่าจะเป็นจุดเชิงประจักษ์เป็น $x^n = x_1 \ldots x_n$ $n$ $\mathcal{X}$ สำหรับXนี่เป็นหนึ่งถ้าและศูนย์อื่น คือความถี่สัมพัทธ์ของในลำดับที่สังเกต เอนโทรปีของการกระจายความน่าจะเป็นที่ได้รับจากความน่าจะเป็นจุดเชิงประจักษ์คือ

\hat{p} (x) = \frac{1}{n} | {i ∣ x_{i} = x} | = \frac{1}{n} \sum_{i = 1}^{n} δ_{x} (x_{i})

$\hat{p}(x) = \frac{1}{n}|\{ i \mid x_i = x\}| = \frac{1}{n} \sum_{i=1}^n \delta_x(x_i)$

x \in X

$x \in \mathcal{X}$

δ_{x} (x_{i})

$\delta_x(x_i)$

x_{i} = x

$x_i = x$

\hat{p} (x)

$\hat{p}(x)$

x

$x$

H (\hat{p}) = - \sum_{x \in X} \hat{p} (x) \log \hat{p} (x) = - \sum_{x \in X} \frac{1}{n} \sum_{i = 1}^{n} δ_{x} (x_{i}) \log \hat{p} (x) = - \frac{1}{n} \sum_{i = 1}^{n} \log \hat{p} (x_{i}) .

$H(\hat{p}) = - \sum_{x \in \mathcal{X}} \hat{p}(x) \log \hat{p}(x) = - \sum_{x \in \mathcal{X}} \frac{1}{n} \sum_{i=1}^n \delta_x(x_i) \log \hat{p}(x) = -\frac{1}{n} \sum_{i=1}^n \log\hat{p}(x_i).$

\sum_{x \in X} δ_{x} (x_{i}) \log \hat{p} (x) = \log \hat{p} (x_{i}) .

$\sum_{x \in \mathcal{X}} \delta_x(x_i) \log\hat{p}(x) = \log\hat{p}(x_i).$

H (\hat{p}) = - \frac{1}{n} \log \hat{p} (x^{n})

$H(\hat{p}) = - \frac{1}{n} \log \hat{p}(x^n)$ with

\hat{p} (x^{n}) = \prod_{i = 1}^{n} \hat{p} (x_{i})

$\hat{p}(x^n) = \prod_{i=1}^n \hat{p}(x_i)$ and using the terminology from the question this is the empirical entropy of the empirical probability distribution. As pointed out by @cardinal in a comment,

- \frac{1}{n} \log p (x^{n})

$- \frac{1}{n} \log p(x^n)$ is the empirical entropy of a given probability distribution with point probabilities

p

$p$ .

— NRH
แหล่งที่มา

(+1) This provides a nice illustration of what Cover and Thomas refer to as the "strange self-referential character" of the entropy. However, I'm not sure the answer actually addresses (directly) the OP's apparent concerns. :)

— cardinal

@cardinal, I know, and the answer was just a long comment to make this particular point. I did not want to repeat your points.

— NRH

You should not feel bad or hesitate to post your own answer including expansion on my comments or those of others. I'm particularly slow and bad about posting answers, and will never take offense if you or others post answers that incorporate aspects of things I may have previously commented briefly on. Quite the contrary, in fact. Cheers.

— cardinal

Entropy is defined for probability distributions. When you do not have one, but only data, and plug in a naive estimator of the probability distribution, you get empirical entropy. This is easiest for discrete (multinomial) distributions, as shown in another answer, but can also be done for other distributions by binning, etc.

A problem with empirical entropy is that it is biased for small samples. The naive estimate of the probability distribution shows extra variation due to sampling noise. Of course one can use a better estimator, e.g., a suitable prior for the multinomial parameters, but getting it really unbiased is not easy.

The above applies to conditional distributions as well. In addition, everything is relative to binning (or kernelization), so you actually have a kind of differential entropy.

— scellus
แหล่งที่มา

We should be careful with what we are referring to as the empirical entropy here. Note that the plug-in estimator is always biased low for all sample sizes, though the bias will decrease as the sample size increases. It's not only difficult to get unbiased estimators for the entropy, but rather impossible in the general case. There has been fairly intense research in this area over the last several years, particularly in the neuroscience literature. Lots of negative results exist, in fact.

— cardinal