ทำไมเอนโทรปีถึงใหญ่ที่สุดเมื่อการกระจายความน่าจะเป็นแบบเดียวกัน?

32

ฉันรู้ว่าเอนโทรปีคือการวัดแบบแผนของกระบวนการ / ตัวแปรและสามารถกำหนดได้ดังนี้ สำหรับตัวแปรสุ่ม $X \in$ ชุด: - )ในหนังสือเกี่ยวกับเอนโทรปีและทฤษฎีข้อมูลโดยแมคเคย์เขาได้ให้ถ้อยแถลงนี้ใน Ch2 $A$ $H(X)= \sum_{x_i \in A} -p(x_i) \log (p(x_i))$

เอนโทรปีจะถูกขยายให้มากที่สุดถ้า p เป็นชุด

ฉันสามารถเข้าใจได้เช่นถ้าดาต้าพอยน์ทั้งหมดในชุด $A$ ถูกเลือกด้วยความน่าจะเป็น $1/m$ ( $m$ เป็นความสำคัญของเซต $A$ ) จากนั้นการสุ่มหรือเอนโทรปีจะเพิ่มขึ้น แต่ถ้าเรารู้ว่าบางจุดในเซต $A$ จะเกิดขึ้นโดยมีความน่าจะเป็นมากกว่าคนอื่น ๆ (พูดในกรณีของการแจกแจงแบบปกติที่ความเข้มข้นสูงสุดของจุดข้อมูลอยู่รอบค่าเฉลี่ยและพื้นที่เบี่ยงเบนมาตรฐานขนาดเล็กรอบมัน หรือเอนโทรปีควรลดลง

But is there any mathematical proof for this ? Like the equation for $H(X)$ I differentiate it with respect to $p(x)$ and set it to 0 or something like that.

On a side note, is there any connnection between the entropy that occurs information theory and the entropy calculations in chemistry (thermodynamics) ?

uniform entropy maximum-entropy

— user76170
แหล่งที่มา

2

This question is answered (in passing) at stats.stackexchange.com/a/49174/919.

— whuber

I am getting quite confused with another statement given in Christopher Bishops book which states that "for a single real variable, the distribution that maximizes the entropy is the Gaussian." It also states that "multivariate distribution with max- imum entropy, for a given covariance, is a Gaussian". How is this statement valid? Isnt the entropy of the uniform distribution the maximum always?

— user76170

6

Maximization is always performed subject to constraints on the possible solution. When the constraints are that all probability must vanish beyond predefined limits, the maximum entropy solution is uniform. When instead the constraints are that the expectation and variance must equal predefined values, the ME solution is Gaussian. The statements you quote must have been made within particular contexts where these constraints were stated or at least implicitly understood.

— whuber

2

I probably also should mention that the word "entropy" means something different in the Gaussian setting than it does in the original question here, for then we are discussing entropy of continuous distributions. This "differential entropy" is a different animal than the entropy of discrete distributions. The chief difference is that the differential entropy is not invariant under a change of variables.

— whuber

So which means that maximisation always is with respect to constraints ? What if there are no constraints ? I mean, cant there be a question like this ? Which probability distribution has maximum entropy ?

— user76170

25

Heuristically, the probability density function on $\{x_1, x_2,..,.x_n\}$ with maximum entropy turns out to be the one that corresponds to the least amount of knowledge of $\{x_1, x_2,..,.x_n\}$ , in other words the Uniform distribution.

Now, for a more formal proof consider the following:

A probability density function on $\{x_1, x_2,..,.x_n\}$ is a set of nonnegative real numbers $p_1,...,p_n$ that add up to 1. Entropy is a continuous function of the $n$ -tuples $(p_1,...,p_n)$ , and these points lie in a compact subset of $\mathbb{R}^n$ , so there is an $n$ -tuple where entropy is maximized. We want to show this occurs at $(1/n,...,1/n)$ and nowhere else.

Suppose the $p_j$ are not all equal, say $p_1 < p_2$ . (Clearly $n\neq 1$ .) We will find a new probability density with higher entropy. It then follows, since entropy is maximized at some $n$ -tuple, that entropy is uniquely maximized at the $n$ -tuple with $p_i = 1/n$ for all $i$ .

Since $p_1 < p_2$ , for small positive $\varepsilon$ we have $p_1 + \varepsilon < p_2 -\varepsilon$ . The entropy of $\{p_1 + \varepsilon, p_2 -\varepsilon,p_3,...,p_n\}$ minus the entropy of $\{p_1,p_2,p_3,...,p_n\}$ equals

- p 1 log (p 1 + ε p 1) - ε log (p 1 + ε) - p 2 log (p 2 - ε p 2) + ε log (p 2 - ε)

$-p_1\log\left(\frac{p_1+\varepsilon}{p_1}\right)-\varepsilon\log(p_1+\varepsilon)-p_2\log\left(\frac{p_2-\varepsilon}{p_2}\right)+\varepsilon\log(p_2-\varepsilon)$ To complete the proof, we want to show this is positive for small enough

ε $\varepsilon$ . Rewrite the above equation as

- p 1 log (1 + ε p 1) - ε (log p 1 + log (1 + ε p 1)) - p 2 log (1 - ε p 2) + ε (log p 2 + log (1 - ε p 2))

$-p_1\log\left(1+\frac{\varepsilon}{p_1}\right)-\varepsilon\left(\log p_1+\log\left(1+\frac{\varepsilon}{p_1}\right)\right)-p_2\log\left(1-\frac{\varepsilon}{p_2}\right)+\varepsilon\left(\log p_2+\log\left(1-\frac{\varepsilon}{p_2}\right)\right)$

Recalling that $\log(1 + x) = x + O(x^2)$ for small $x$ , the above equation is

- ε - ε log p 1 + ε + ε log p 2 + O (ε 2) = ε log (p 2 / p 1) + O (ε 2)

$-\varepsilon-\varepsilon\log p_1 + \varepsilon + \varepsilon \log p_2 + O(\varepsilon^2) = \varepsilon\log(p_2/p_1) + O(\varepsilon^2)$ which is positive when

ε $\varepsilon$ is small enough since

p1<p2 $p_1 < p_2$ .

A less rigorous proof is the following:

Consider first the following Lemma:

Let $p(x)$ and $q(x)$ be continuous probability density functions on an interval $I$ in the real numbers, with $p\geq 0$ and $q > 0$ on $I$ . We have

- \int I p log p d x \leq - \int I p log q d x

$-\int_I p\log p dx\leq -\int_I p\log q dx$ if both integrals exist. Moreover, there is equality if and only if

p(x)=q(x) $p(x) = q(x)$ for all

x $x$ .

Now, let $p$ be any probability density function on $\{x_1,...,x_n\}$ , with $p_i = p(x_i)$ . Letting $q_i = 1/n$ for all $i$ ,

- \sum i = 1 n p i log q i = \sum i = 1 n p i log n = log n

$-\sum_{i=1}^n p_i\log q_i = \sum_{i=1}^n p_i \log n=\log n$ which is the entropy of

q $q$ . Therefore our Lemma says

h(p)≤h(q) $h(p)\leq h(q)$ , with equality if and only if

p $p$ is uniform.

Also, wikipedia has a brief discussion on this as well: wiki

— mitchus
แหล่งที่มา

11

I admire the effort to present an elementary (Calculus-free) proof. A rigorous one-line demonstration is available via the weighted AM-GM inequality by noting that

exp(H) $\exp(H)$ =

∏(1pi)pi≤∑pi1pi=n $\prod\left(\frac{1}{p_i}\right)^{p_i}\le\sum p_i\frac{1}{p_i}=n$ with equality holding iff all the

1/pi $1/p_i$ are equal, QED.

— whuber

I don't understand how

∑logn $\sum{\log{n}}$ can be equal to

logn $\log{n}$ .

— user1603472

4

@user1603472 do you mean

∑i=1npilogn=logn $\sum\limits_{i=1}^n p_i \log n = \log n$ ? Its because

∑i=1npilogn=logn∑i=1npi=logn×1 $\sum\limits_{i=1}^n p_i \log n = \log n \sum\limits_{i=1}^n p_i = \log n \times 1$

— HBeel

@Roland I pulled the

logn $\log n$ outside of the sum since it does not depend on

i $i$ . Then the sum is equal to

1 $1$ because

p1,…,pn $p_1,\ldots,p_n$ are the densities of a probability mass function.

— HBeel

Same explanation with more details can be found here: math.uconn.edu/~kconrad/blurbs/analysis/entropypost.pdf

— Roland

14

Entropy in physics and information theory are not unrelated. They're more different than the name suggests, yet there's clearly a link between. The purpose of entropy metric is to measure the amount of information. See my answer with graphs here to show how entropy changes from uniform distribution to a humped one.

The reason why entropy is maximized for a uniform distribution is because it was designed so! Yes, we're constructing a measure for the lack of information so we want to assign its highest value to the least informative distribution.

Example. I asked you "Dude, where's my car?" Your answer is "it's somewhere in USA between Atlantic and Pacific Oceans." This is an example of the uniform distribution. My car could be anywhere in USA. I didn't get much information from this answer.

However, if you told me "I saw your car one hour ago on Route 66 heading from Washington, DC" - this is not a uniform distribution anymore. The car's more likely to be in 60 miles distance from DC, than anywhere near Los Angeles. There's clearly more information here.

Hence, our measure must have high entropy for the first answer and lower one for the second. The uniform must be least informative distribution, it's basically "I've no idea" answer.

— Aksakal
แหล่งที่มา

7

The mathematical argument is based on Jensen inequality for concave functions. That is, if $f(x)$ is a concave function on $[a,b]$ and $y_1, \ldots y_n$ are points in $[a,b]$ , then: $n \cdot f(\frac{y_1 + \ldots y_n}{n}) \geq f(y_1) + \ldots + f(y_n)$

Apply this for the concave function $f(x) = -x \log(x)$ and Jensen inequality for $y_i = p(x_i)$ and you have the proof. Note that $p(x_i)$ define a discrete probability distribution, so their sum is 1. What you get is $log(n) \geq \sum_{i=1}^n - p(x_i) log(p(x_i))$ , with equality for the uniform distribution.

— Octavian Ganea
แหล่งที่มา

1

I actually find the Jensen's inequality proof to be a much deeper proof conceptually than the AM-GM one.

— Casebash

4

On a side note, is there any connnection between the entropy that occurs information theory and the entropy calculations in chemistry (thermodynamics) ?

Yes, there is! You can see the work of Jaynes and many others following his work (such as here and here, for instance).

But the main idea is that statistical mechanics (and other fields in science, also) can be viewed as the inference we do about the world.

As a further reading I'd recommend Ariel Caticha's book on this topic.

— kaslusimoes
แหล่งที่มา

1

An intuitive explanation:

If we put more probability mass into one event of a random variable, we will have to take away some from other events. The one will have less information content and more weight, the others more information content and less weight. Therefore the entropy being the expected information content will go down since the event with lower information content will be weighted more.

As an extreme case imagine one event getting probability of almost one, therefore the other events will have a combined probability of almost zero and the entropy will be very low.

— Roland
แหล่งที่มา

0

Main idea: take partial derivative of each $p_i$ , set them all to zero, solve the system of linear equations.

Take a finite number of $p_i$ where $i=1,...,n$ for an example. Denote $q = 1-\sum_{i=0}^{n-1} p_i$ .

H H * ln 2 = - \sum i = 0 n - 1 p i log p i - (1 - q) log q = - \sum i = 0 n - 1 p i ln p i - (1 - q) ln q

$\begin{align} H &= -\sum_{i=0}^{n-1} p_i \log p_i - (1-q)\log q\\ H*\ln 2 &= -\sum_{i=0}^{n-1} p_i \ln p_i - (1-q)\ln q \end{align}$

\partial H \partial p i = ln q p i = 0

$\begin{align} \frac{\partial H}{\partial p_i} &= \ln \frac{q}{p_i} = 0 \end{align}$ Then

q=pi $q = p_i$ for every

i $i$ , i.e.,

$p_1=p_2=...=p_n$ .

— Jan Fan
แหล่งที่มา

I am glad you pointed out this is the "main idea," because it's only a part of the analysis. The other part--which might not be intuitive and actually is a little trickier--is to verify this is a global minimum by studying the behavior of the entropy as one or more of the

$p_i$ shrinks to zero.

— whuber