One of the axioms in Shannon's information theory is that (Shannon's) entropy satisfies coarse-graining property:
This property is closely related to the conditional probabilities.
In communication -- regardless of the types of agents involved, eg. between the people over a phone, between parent cell's DNA to daughter cell's DNA, between a disk storage at time T and that at time T+10), or between me (the writer of this article) and you (the reader),
there is some 'tolerance' bound that allows "good-enough" intention/semantics to be transmitted and understood between the sender and the receiver.
How is this idea related to the Rate-Distortion theory or error-correcting codes?
Can this idea help us understand/define the "semantic" information (vs. Shannon's Information measure is often called "syntactic" because it is ignorant/invariant to the identities of the events whose probabilities within the process we are measuring the uncertainty of).
Pondering...
Coarse-graining/level of details when describing a process
As we 'abstract' away from particular representational form of an event/instance, we move from semantics+form domain to → → → semantics+less form domain. This allows me to say "The chair is blue" and you understand what general color the chair is.
At which level of abstraction / this ladder of coarse-graining, do we get sufficient (ie. good-enough to communication our intentions) level of semantics?
If we measure $H(\tilde{X})$ at that level, can we say that quantity measures 'semantic information'?
The difference $H(G)$ is the force/gradient that drives the flow of information -- information of what?
For example, in geometric case, \(\mu(A)\) can be interpreted as a (physical) length (if \(A\) is one dimensional), mass (if \(A\) is two dimensional), or volumn (if three-dim) of a region \(A\). In the case that \(\mu\) is a measure of probability, \(\mu(A)\) is the probability mass of event, "random variable \(X\) takes values in the set \(A\) (also called event \(A\))"
Density
A (probability) density is a function that transforms one measure to another measure by pointwise reweighting (on the abstract sample space \(\Omega\))
Measure-theoretic formalism for Probability
abstract probability space vs. observation space
Think of the abstract probability space as the entire system of the universe. A point in the space is a state of the universe (eg. a long vector of values assigned to all existing atoms' states). We often don't have a direct access to this "state", ie. it is not fully observable to us. Instead we observe/measure variables that are some functions of this atomic configuration/state (\(w\)). This mapping from a state of the universe to a value that the variable of our interest is observed/measured to take is called a "Random Variable".
Random Variable
- is a function that maps a outcome in the abstract probability space's sample space \(\Lambda\) to the sample space of the observation space \(\Omega\) (often \(\mathbb{R}\))
- it is the key component that connects the abstract probability space (which we don't get to directly observe) to the observation space
- Image measure \(\mu_{X}\) is the (derived/induced) measure on the observation space that is related to the abstract probability space via the random variable \(X\).
- We need it since the measure on the abstract probability space \(\mathbb{P}\) is not known explicitly, but we need to have a way to descirbe the measure of sets in the Borel set of the observation space.
- To assign measures to an event in the observation space, we use "Image measure" \(\mu_{X}\) which is linked to \(\mathbb{P}\) via:
$$ \mu_{X}(A) := \mathbb{P}(X^{-1}(A))$$
- In other words, we compute the probability measure of an event \(A\) (ie.the probability that the random variable X takes a value in the set A) by:
1. Map the set A in the observation space to a space in the abstract probability space, \(A^{\leftarrow} = X^{-1}(A)\)
2. Compute the probability of event \(A^{\leftarrow}\) using \(\mathbb{P}\)
Relationship between two random variables and their image measures
"Density" describes the relationship between two random variables and their image measures:
Source
Theoretical Foundations of Nonparametric Bayesian Models, by P.Orbanz. MLSS2009: video part 1, 2. Slides 1, 2 Great introduction of measure theory just as much in detail to be relevant for statistics (and nonparametric Bayesian models)
Lec on VAE, by Ali Ghodsi: This lecture motivates KL Divergence as the measurement of difference in the average information content of two random varialbes, whose distributions are \(p\) and \(q\) in in the article.
Wiki: It clears up different terminologies that are (misused) to refer to the KL.
It gives a great example of "answering 20 questions" problem as a way to think about basic concepts in info theory, including entropy, KL divergence and mutual information.
\(H(X)\) is equal to the average length of an arbitrary tree, which is the number of questions to get to choice \(x\)
"(Using \(H(X)\),) (f)or any probability distribution, we can now talk about "how uncertain we are about the outcome", "how much information is in the process", or "how much entropy the process has", and even measure it, in bits" (p.3)