The Petersen Graph: March 2014

In my recent post on the other bell curves, I wondered if there was some familiar processes underlying the logistic distribution that would parallel coin-flipping for the normal distribution. I noted there that logistic distribution works better for the ELO rating system than the normal distribution did.

I also noted that the curve that describes the logistic distribution arises from population growth models that account for competition for resources. The ELO rating system is also used in the context of competition, but for games. I wondered whether there was something to this. Was it a mere linguistic coincidence, or was there a deeper connection?

Well, I still don't have a complete answer, but I can see at least one connection between the two scenarios. In both cases, there is competition for a finite resource. For the population growth model it is food supply, and for games, it is wins.

The parallel is not exact, however. In particular, the logistic model is used to describe the size of a population, but for a game, the logistic distribution is only used to compute the probable outcomes within a set of competitors, not the number of competitors. So the connection is not complete.

I've never taken a course in information theory, but recently I've been trying to learn more about it on my own. I reached a satisfactory understanding of most of the concepts, but an acceptable understanding of self-information proved elusive for quite a while. If it were just another term of many, I might have been content to ignore it, but the definitions of all of the other terms rest on this concept, so I didn't consider ignoring it to be an option.

Most introductory texts on information theory that I've seen take the definition as axiomatic. If an event occurs with probability $p$, then self-information of that event is defined to be $-\log p$, where the logarithm is to some chosen base (often 2). What I wanted to understand better is why the self-information is defined that way. In my experience so far, this is rarely explained and never explained well.

It did make some intuitive sense, as various authors claim. One property of this definition is that the self-information of an event increases as the probability decreases. This is a reasonable property (I would even say necessary) for any self-information function. Suppose, for example, that we are playing a game of hangman. The letter "z" is far less frequent in English than the letter "s", so if a "z" is showing, then it is easier to narrow down the word than it would be if an "s" were showing. Thus, the rare event "contains the letter 'z'" gives more information than the common event "contains the letter 's'."

But $-\log p$ is not the only function that works like that. We could also use $1-p,$ which is far simpler and avoids the dreaded logarithms. Simple is always preferred to complicated, so there must be a reason why $-\log p$ is the function of choice.

Well, in order for information to be transmitted, there must be some way to represent it. Spoken languages use sounds. Written languages use words with letters some alphabet. Computers use strings of 0s and 1s, which we could think of as words from a two letter alphabet.

The number of possible strings (words) of length $k$ from an alphabet with $n$ symbols is $n^k$. Let's represent this number by $w$, so $w=n^k.$ We could also write this relationship between $w,$ $n$, and $k$ with the equation $k=\log_n w$. Although this equation is equivalent to $w=n^k$, we can interpret it differently as follows. If we want to assign unique labels to $w$ different objects using strings of the same length from an alphabet of size $n$ (that's a mouthful!), then $k$ is the smallest possible length of the strings.

Now, if these $w$ strings are chosen randomly with equal probability, say $p$, then $p=1/w$. Therefore, $w=1/p,$ and so $k=\log_n \frac{1}{p}=-\log_n p$. Aside from the base appearing explicitly in this expression for $k$, it is identical to the self-information. So the self-information of an event with a given probability $p$ can be interpreted as the length of a string that is needed to represent that event if it is one of many events that all occur with equal probability. Of course, it is possible that some event other than $w$ would have a different probability, but that doesn't change the amount of information provided by by the occurrence of $w$.

I don't know if this (rather long-winded) explanation makes the definition any more sensible for you, the reader. But once I arrived at this realization, I finally felt like I had understood why self-information is defined that way it is.

-----------------

Another difficulty that I had with the definition is the probabilistic model through which is usually explained. I can see why using this model is sensible from the perspective of the working information theorist. But as someone trying to learn about it, I couldn't see how a randomly transmitted symbol could provide any information at all. If I flip a coin and it shows up heads, so what, even if it's an unfair coin and heads is rare? And information is not usually transmitted randomly anyway.

The Petersen Graph

Pages

Sunday, 9 March 2014

Competition and the Logistic Distribution

Thursday, 6 March 2014

Understanding Self-Information