I am reading the book "Elements of Information Theory" by Cover and Thomas and I am having trouble understanding conceptually the various ideas.For example, I know that H(X) can be interpreted as the average encoding length. But what does H ( Y | X ) intuitively mean?And what is mutual information? I read things like "It is the reduction in the uncertainty of one random variable due to the knowledge of the other". This doesn't mean anything to me as it doesn't help me explain in words why I ( X ; Y ) = H ( Y ) − H ( Y | X ). Or explain the chain rule for mutual information.I also encountered the Data processing inequality explained as something that can be used to show that no clever manipulation of the data can improve the inferences that can be made from the data. If X → Y → Z then I ( X ; Y ) ≥ I ( X ; Z ). If I had to explain this result to someone in words and explain why it should be intuitively true I would have absolutely no idea what to say. Even explaining how "data processing" is related to a markov chains and mutual information would baffle me.I can imagine explaining a result in algebraic topology to someone since there is usually an intuitive geometric picture that can be drawn. But with information theory if I had to explain a result to someone at comparable level to a picture I would not be able to.When I do problems its just abstract symbolic manipulations and trial and error. I am looking for an explanation (not these blah gives information about blah explanations) of the various terms that will make the solutions to problems appear in a meaningful way.Right now I feel like someone trying to do algebraic topology purely symbolically without thinking about geometric pictures.Is there a book that will help my curse?

Question

I am reading the book &quot;Elements of Information Theory&quot; by Cover and Thomas and I am having trouble understanding conceptually the various ideas.For example, I know that H(X) can be interpreted as the average encoding length. But what does   H  (  Y      |    X  ) intuitively mean?And what is mutual information? I read things like &quot;It is the reduction in the uncertainty of one random variable due to the knowledge of the other&quot;. This doesn&#039;t mean anything to me as it doesn&#039;t help me explain in words why   I  (  X  ;  Y  )  =  H  (  Y  )  −  H  (  Y      |    X  ). Or explain the chain rule for mutual information.I also encountered the Data processing inequality explained as something that can be used to show that no clever manipulation of the data can improve the inferences that can be made from the data. If   X  →  Y  →  Z then   I  (  X  ;  Y  )  ≥  I  (  X  ;  Z  ). If I had to explain this result to someone in words and explain why it should be intuitively true I would have absolutely no idea what to say. Even explaining how &quot;data processing&quot; is related to a markov chains and mutual information would baffle me.I can imagine explaining a result in algebraic topology to someone since there is usually an intuitive geometric picture that can be drawn. But with information theory if I had to explain a result to someone at comparable level to a picture I would not be able to.When I do problems its just abstract symbolic manipulations and trial and error. I am looking for an explanation (not these blah gives information about blah explanations) of the various terms that will make the solutions to problems appear in a meaningful way.Right now I feel like someone trying to do algebraic topology purely symbolically without thinking about geometric pictures.Is there a book that will help my curse?

Bria Berg · Accepted Answer

I believe that focusing on this first part of your question could be a good starting point.If we are dealing with a given process X, then we would like to better comprehend and characterize it. The mutual information is a measure of uncertainty reduction of our knowledge of the process X when a second process, let us say Y, is available. If X and Y are independent, then knowing Y would give us no extra information on X, and no uncertainty reduction would occur. On the contrary, if X and Y are somehow related, then information from Y is useful to &quot;better define&quot; the original process X.The mutual information formalized the above statements: I(X,Y)=0 if X and Y are independent as, in this case, H(X|Y)=H(X). We have no improvement on the knowledge of X.If X and Y are not independent, then I(X,Y)&amp;gt;0 by Jensen&#039;s inequality: we have an uncertainty reduction and the knowledge of Y is useful to better understand X.In this framework the &quot;absolute&quot; uncertainty of a given process, let us say X, is denoted by its entropy H(X). If this concept sounds not so clear, I would suggest to read the wiki page on self information and &quot;surprisal&quot;:http://en.wikipedia.org/wiki/SurprisalNote that mutual information is not a distance in the pure mathematical sense: it is a measure of &quot;distance&quot;, or a distance like function. If you want to define the distance between processes X and Y you need to introduce the Variation of information.This last fact can be a bit &quot;disturbing/annoying&quot;: why am I suppose to talk about distances when I use functions that are not distances themselves? A related topic is given by the use of divergences (which are related to mutual information) vs. distances in information geometry.The Cover and Thomas book is a very good textbook. If you are interested in the geometry behind information theory you can read &quot;Methods of Information Geometry&quot; by Amari and Nagaoka.If you are interested in applications of entropies and reduction of uncertainty, why not to consult the book &quot;Inroduction to Clustering Large and High-Dimensional Data&quot; by Kogan? Chapters 6-7-8 provide useful applications.

enclinesbnnbk · Accepted Answer

Since you have an intuitive understanding of entropy based on the compression theorem, you should look into the operational meaning of mutual information, which is the channel coding theorem. It says if you have a noisy channel with a joint distribution p(X,Y), then it can transmit information encoded in X to a receiving party with access to Y at a rate of I(X;Y) bits per symbol.

I am reading the book "Elements of Information Theory" by Cover and Thomas and I am having trouble u

Answered question

Answer & Explanation

New Questions in Pre-Algebra