Heuristically, the probability density function on {x1,x2,..,.xn}{x1,x2,..,.xn} with maximum entropy turns out to be the one that corresponds to the least amount of knowledge of {x1,x2,..,.xn}{x1,x2,..,.xn}, in other words the Uniform distribution.
Now, for a more formal proof consider the following:
A probability density function on {x1,x2,..,.xn}{x1,x2,..,.xn} is a set of nonnegative real numbers p1,...,pnp1,...,pn that add up to 1. Entropy is a continuous function of the nn-tuples (p1,...,pn)(p1,...,pn), and these points lie in a compact subset of RnRn, so there is an nn-tuple where entropy is maximized. We want to show this occurs at (1/n,...,1/n)(1/n,...,1/n) and nowhere else.
Suppose the pjpj are not all equal, say p1<p2p1<p2. (Clearly n≠1n≠1.) We will find a new probability density with higher entropy. It then follows, since entropy is maximized at
some nn-tuple, that entropy is uniquely maximized at the nn-tuple with pi=1/npi=1/n for all ii.
Since p1<p2p1<p2, for small positive εε we have p1+ε<p2−εp1+ε<p2−ε. The entropy of {p1+ε,p2−ε,p3,...,pn}{p1+ε,p2−ε,p3,...,pn} minus the entropy of {p1,p2,p3,...,pn}{p1,p2,p3,...,pn} equals
−p1log(p1+εp1)−εlog(p1+ε)−p2log(p2−εp2)+εlog(p2−ε)
−p1log(p1+εp1)−εlog(p1+ε)−p2log(p2−εp2)+εlog(p2−ε)
To complete the proof, we want to show this is positive for small enough
εε. Rewrite the above equation as
−p1log(1+εp1)−ε(logp1+log(1+εp1))−p2log(1−εp2)+ε(logp2+log(1−εp2))−p1log(1+εp1)−ε(logp1+log(1+εp1))−p2log(1−εp2)+ε(logp2+log(1−εp2))
Recalling that log(1+x)=x+O(x2)log(1+x)=x+O(x2) for small xx, the above equation is
−ε−εlogp1+ε+εlogp2+O(ε2)=εlog(p2/p1)+O(ε2)
−ε−εlogp1+ε+εlogp2+O(ε2)=εlog(p2/p1)+O(ε2)
which is positive when
εε is small enough since
p1<p2p1<p2.
A less rigorous proof is the following:
Consider first the following Lemma:
Let p(x)p(x) and q(x)q(x) be continuous probability density functions on an interval
II in the real numbers, with p≥0p≥0 and q>0q>0 on II. We have
−∫Iplogpdx≤−∫Iplogqdx
−∫Iplogpdx≤−∫Iplogqdx
if both integrals exist. Moreover, there is equality if and only if
p(x)=q(x)p(x)=q(x) for all
xx.
Now, let pp be any probability density function on {x1,...,xn}{x1,...,xn}, with pi=p(xi)pi=p(xi). Letting qi=1/nqi=1/n for all ii,
−n∑i=1pilogqi=n∑i=1pilogn=logn
−∑i=1npilogqi=∑i=1npilogn=logn
which is the entropy of
qq. Therefore our Lemma says
h(p)≤h(q)h(p)≤h(q), with equality if and only if
pp is uniform.
Also, wikipedia has a brief discussion on this as well: wiki