 Suppose you receive a letter. The outside says you may already be a winner. To check, you would open the letter and read it. Now a normal person would probably throw the letter away without opening it. But mathematicians are not normal people. We'd ask why would be discard the letter without opening it? In 1948, Claude Shannon made a crucial realization. Information is the answer you don't already know. So we can measure information by measuring the reduction of uncertainty that an answer gives. In this case, we already know what the letter will say. No, you didn't win. So the letter contains zero information. Consequently, it's not worth the effort of opening it. To measure the quantity of information, consider asking a sequence of yes-no questions. The information content of a message can be viewed as a number of yes-no questions it answers, assuming the questions are asked efficiently. Equivalently, it's the number of yes-no questions you'd need to ask to learn the contents. For example, suppose a channel always outputs two. How much information does it deliver? So suppose you get a message. The number of yes-no questions you'd have to ask to obtain the content is zero. We already know the output is two. So remember, information is an answer you don't already know. Let's consider a different case. Suppose a channel outputs A or B with equal frequency. How much information does the output convey? So suppose we get a message. While we know it's either A or B, we don't know which one, but we can determine it by asking one question. We ask the question, is it A? This lets us determine what the message is. If yes, then we know it's A. And if no, since it could only be A or B, then we know it's B. So the message answers one question. So the message contains one bit of information. Let's consider another case. Suppose the message could be one of three letters A, B or C. Half the time the message is A, one quarter of the time it's B, and one quarter of the time it's C. How much information is contained in the message? From our study of data compression, which we'll talk about elsewhere, we know our first question should be, is it A? So if our first question is, is it A, then half the time the answer will be yes and we'll have one question answered. But half the time the answer will be no and we'll need to ask a follow-up question. So if the message isn't A, then it's either B or C. So a follow-up question could be, is it B? If we ask, is it B, then a yes will tell us if the message is B and a no would tell us the answer is C. This means the message will answer one question half the time and two questions the other half the time. So on average, the message will answer 1.5 questions and contain 1.5 bits of information. Shannon formalizes concept using what is now called the Shannon entropy. Imagine asking a question of a system. If the possible answers are A1, A2, and so on, then the entropy of the system is given by where PAI is the probability that the output is the ith message. A long but exciting detour into the theory of probability will use the frequency of the message as a stand-in for the probability. So it says a channel outputs A with probability one-third and B with probability two-third. How much information does a message from the channel convey? The question, what letter is the output, will have two possible answers. A with probability one-third and B with probability two-thirds. So the entropy will be the negative of the sum of the products of the probabilities and their logs to base two. So A has probability one-third and B has probability two-thirds. And so we find...