Chap. 12 An Introduction to Information Theory
12.1 Modeling Information Sources
12.1.1 Measure of Information
DMS
A Discrete Memoryless Source (DMS) is the simplest model of an information source.
It is a discrete-time, discrete-amplitude i.i.d. random process.
Alphabet
A full description of the DMS is given by the set $\mathcal{A}$, called the alphabet.
$\mathcal{A} = \{a_i\}_{i=1}^{N}$ with $\{p_i\}_{i=1}^{N}$
Self information $I(p_j)$
1. Information content depends only on the probability $p_j$, not the value $a_j$, of the outcome.
2. $I(\cdot)$ is a continuous function.
3. Self-information decreases as probability increases.
4. If $p_j = p_{j1} p_{j2}$, then $I(p_j) = I(p_{j1}) + I(p_{j2})$.
$\Rightarrow I(x) = -\log(x)$
Entropy
$H(X) = \sum_{i=1}^{N} p_i I(p_i) = -\sum_{i=1}^{N} p_i \log p_i = \sum_{i=1}^{N} p_i \log\left(\frac{1}{p_i}\right)$
$H(X)$ is a function of the probability mass function (PMF) of the random variable $X$ and is just a number.
Entropy of uniform distribution
$H(X) = -\sum_{i=1}^{N} \frac{1}{N} \log \frac{1}{N} = \log N$
$\therefore$ boundary for any distribution $Y$ : $0 \leq H(Y) \leq \log N$, where $N$ is the number of elements.
12.1.2 Joint and Conditional Entropy
Joint entropy
$H(X, Y) = -\sum_{x, y} p(x, y) \log p(x, y)$
Conditional entropy
$H(X \mid Y) = -\sum_{x, y} p(x, y) \log p(x \mid y)$
$\Leftarrow$ weighted average of $H(X \mid Y = y) = -\sum_x p(x \mid y) \log p(x \mid y)$
Chain rule for entropies
$H(X, Y) = H(Y) + H(X \mid Y)$
$H(\mathbf{X}) = H(X_1) + H(X_2 \mid X_1) + \cdots + H(X_n \mid X_1, X_2, \ldots, X_{n-1})$
if independent: $H(\mathbf{X}) = \sum_{i=1}^{n} H(X_i)$
Entropy rate
In general:
$H = \lim_{n \to \infty} H(X_n \mid X_1, X_2, \ldots, X_{n-1})$
Under stationarity, entropy rate for sources with memory:
$H = \lim_{n \to \infty} \frac{1}{n} H(X_1, X_2, \ldots, X_n)$
Entropy rate plays the role of entropy for sources with memory.
12.1.3 Mutual Information
Mutual information
$I(X; Y) = H(X) - H(X \mid Y)$
$I(X; Y) = \sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}} p(x, y) \log \frac{p(x \mid y)}{p(x)} = \sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}} p(x, y) \log \frac{p(x, y)}{p(x)p(y)}$
$I(X; Y) = I(Y; X) = H(X) + H(Y) - H(X, Y)$
$H(X) - H(X \mid Y)$ is the amount of information provided by the random variable $Y$ about random variable $X$.
12.1.4 Differential Entropy
Differential entropy
$h(X) = - \int_{-\infty}^{\infty} f_X(x) \log f_X(x) \, dx$
where $X$ is continuous random variable with the probability density function $f_X(x)$.
Joint differential entropy
$h(X, Y) = - \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} f(x, y) \log f(x, y) \, dx \, dy$
Conditional differential entropy
$h(X \mid Y) = h(X, Y) - h(Y)$
Mutual information
$I(X; Y) = h(Y) - h(Y \mid X) = h(X) - h(X \mid Y)$
12.2 The Source Coding Theorem
Typical sequences
In the case of a DMS, when the sequence length $n$ is large, most sequences will contain each symbol $a_i$ approximately $np_i$ times, where $p_i$ is the probability of symbol $a_i$.
The sequences $\mathbf{x}$ that have this structure are called typical sequences.
$\begin{align*}
P(\mathbf{X} = \mathbf{x}) &\approx \prod_{i=1}^{N} p_i^{n p_i} \\
&= \prod_{i=1}^{N} 2^{n p_i \log p_i} \\
&= 2^{n \sum_{i=1}^{N} p_i \log p_i} \\
&= 2^{-n H(X)}
\end{align*}$
Effective outputs
Probability of typical sequences is approximately $2^{-n H(X)}$.
Probability of non-typical sequences is negligible.
The number of typical sequences is approximately $2^{n H(X)}$.
Thus, even though a source with alphabet size $N$ can produce $N^n$ sequences, only about $2^{n H(X)}$ are effectively likely — these are the "effective" outputs.
We need $n H(X)$ bits to represent them with negligible error. Alternatively, since the source output has length $n$, this corresponds to $H(X)$ bits per source output.
Source Coding Theorem
If $R > H$, the source can be compressed with arbitrarily low error.
If $R < H$, no matter how smart the encoder/decoder is, the error probability cannot be made small.
where, $R$ is bits/source output, $H$ is entropy (or entropy rate).
12.3 Source Coding Algorithms
The entropy of a source sets a fundamental compression limit ($R > H$), and Huffman and Lempel-Ziv algorithms are practical methods that approach this limit.
12.3.1 The Huffman Source Coding Algorithm
Fixed-to-variable length coding
More frequent → shorter code words
Less frequent → longer code words
Self-synchronizing
In the case of Code 1, each code word starts with a 1.
Instantaneous
Code 2 is also self-synchronizing, but decoding requires the first bit of the next code word. In contrast, in Code 1, the end of a code word is recognized immediately.
Uniquely decodable
Can be decoded in only one way.
Not uniquely decodable
In the case of Code 4, the sequence 110110 can be decoded in two ways, as $a_5 a_5$ or as $a_4 a_2 a_3$.
Prefix condition
A code that is both uniquely decodable and instantaneous.
Average code word length
Codes 1 and 3 both satisfy the prefix condition, but Code 3 has a smaller average code word length.
Huffman Coding Algorithm
Average code word length of a Huffman code
$$\bar{R} = \text{E}[L] = \sum_{x \in \mathcal{X}} p(x) \, l(x)$$
Boundary: $H(X) \leq \bar{R} < H(X) + 1$, where $l(x)$ is the length of the code word corresponding to the source output $x$.
Efficiency: $\eta = \frac{H(X)}{\bar{R}}$
Extension of the source
For sequences of source letters of length $n$ (the $n$th extension of the source):
$$H(X^n) \leq \bar{R}_n < H(X^n) + 1 \quad \text{or} \quad H(X) \leq \bar{R} < H(X) + \frac{1}{n}$$
$\because \bar{R} = \frac{1}{n} \bar{R}_n$ and $H(X^n) = n H(X)$
$\Rightarrow$ Increasing $n$ improves Huffman coding efficiency, but also raises code design and encoding/decoding processing complexity.
Limitations
1. Depend strongly on the source probabilities.
2. Using source memory requires the $n$th extension, but complexity grows exponentially.
$\Rightarrow$ In high-speed storage systems, such as storage in magnetic or optical media, speed of Huffman coding becomes a bottleneck.
12.3.2 The Lempel-Ziv Source Coding Algorithm
Universal source coding algorithm
Independent of the source statistics.
Variable-to-fixed length coding
Source outputs are uniquely parsed into variable-length phrases and encoded with fixed-length code words.
Parsing the sequence
new phrase = concatenate(known phrase, new symbol)
Encoding
code word of new phrase = concatenate(dictionary location of known phrase, new symbol)
Example
sequence: 01000011000010100000101000001100000101000010
phrases: 0, 1, 00, 001, 10, 000, 101, 0000, 01, 010, 00001, 100, 0001, 0100, 0010
code words: 0000 0, 0000 1, 0001 0, 0011 1, 0010 0, 00110, 0101 1, 0110 0, 0001 1, 1001 0, 1000 1, 0101 0, 0110 1, 1010 0, 0100 0
$\Rightarrow$ 15 phrases require 4 bits each, plus 1 extra bit for the new symbol.
Performance of source compression
For stationary ergodic sources, compressed size approaches $nH(X)$ as sequence length grows.
$\Rightarrow$ Widely used in UNIX compression utilities and other programs such as zip and gzip.
Limitations
If the number of phrases is fixed, the dictionary can overflow. To avoid this, the encoder and decoder must remove unused entries and replace them with new ones.
12.4 Modeling of Communication Channels
About channel
Channel receives signals at their inputs and deliver signals at their outputs
1) at a later time (storage case)
2) or at another location (transmission case)
Examples of channel: Coaxial cable, iono spheric propagation, free space, fiber optic cables, and magnetic and optical disks
Due to the presence of fading knd noise, the input output relation in a communication channel is generally a stochastic relation.
Discrete memoryless channel (DMC)
$P(\mathbf{y} \mid \mathbf{x}) = \prod_{i=1}^{n} P(y_i \mid x_i)$
where $\mathbf{x} \in \mathcal{X}^n$, and $\mathbf{y} \in \mathcal{Y}^n$
Binary symmetric channel (BSC)
A special case of a discrete memoryless channel.
$\epsilon = P(0 \mid 1) = P(1 \mid 0)$ and $1-\epsilon = P(0 \mid 0) = P(1 \mid 1)$
$\epsilon$ is called the crossover probability.
12.5 Channel Capacity
Channel capacity
C: channel capacity; a fundamental limit exists for information transmission over communication channels.
Reliable transmission (that is, transmission with an arbitrarily small error probability) is possible even over noisy channels as long as $R < C$.
Noisy channel coding theorem
Basic limitation that noise causes in a communication channel is not on the reliability of communication, but on the speed of communication.
Example of a discrete channel
In the channel with inputs a, b, c, d, outputs overlap—e.g., receiving a could mean a or d—so errors are unavoidable.
But if we use only a and c, their outputs don’t overlap:
- a or b → came from a
- c or d → came from c
This avoids confusion and allows reliable transmission.
Reliable communication over a noisy BSC
By the law of large numbers, a single input sequence is likely to produce an output with $n\epsilon$ errors. The number of such outputs is given by:
$\binom{n}{n\epsilon} = \frac{n!}{(n - n\epsilon)!(n\epsilon)!}$
where $n$ is the length of sequence, and $\epsilon$ is the crossover probability.
Using Stirling's approximation $n! \approx n^n e^{-n} \sqrt{2\pi n}$, we obtain
$\binom{n}{n\epsilon} \approx 2^{n H_b(\epsilon)}$
where $H_b$ is the binary entropy function.
The number of typical output sequences: $2^{n H(\epsilon)}$
The maximum number of input sequences with nearly nonoverlapping outputs: $M = \frac{2^{n H(Y)}}{2^{n H_b(\epsilon)}} = 2^{n(H(Y) - H_b(\epsilon))}$
Transmission rate per channel use: $R = \frac{\log_2 M}{n} = H(Y) - H_b(\epsilon)$
Maximum reliable transmission rate (= channel capacity) over a BSC
In relation $R = H(Y) - H_b(\epsilon)$,
$\epsilon$ depends on the channel and we cannot control it.
$H(Y)$ maximized where $P(X = 0) = P(X = 1) = 0.5$. In that case, $H(Y) = 1$.
proof)
$P(Y = y \mid X = x) =
\begin{cases}
1 - \epsilon & \text{if } y = x \\
\epsilon & \text{if } y \neq x
\end{cases}$
$P(X = x) =
\begin{cases}
p & \text{if } x = 0 \\
1 - p & \text{if } x = 1
\end{cases}$
$\Rightarrow P(Y = 0) = P(Y = 1) = 0.5$ where $p = 0.5$
$\therefore C = R_{\max} = 1 - H_b(\epsilon)$
Noisy channel coding theorem
The capacity of a discrete memoryless channel: $C = \max_{P(x)} I(X; Y)$
Both $R$ (rate) and $C$ (capacity) are measured in bits per transmission or bits per channel use.
12.5.1 Gaussian Channel Capacity
Capacity of a discrete-time Gaussian channel with power constraint
A discrete-time Gaussian channel: for blocks of length $n$ at the input, the output, and the noise, $\mathbf{y} = \mathbf{x} + \mathbf{z}$
If $n$ is large, by the law of large numbers,
Input power contraint: $\frac{1}{n} \sum_{i=1}^n x_i^2 \leq P$
Noise power: $\frac{1}{n} \sum_{i=1}^{n} z_i^2 = \frac{1}{n} \sum_{i=1}^{n} |y_i - x_i|^2 \leq P_N$
Output power: $\frac{1}{n} \sum_{i=1}^{n} y_i^2 \leq P + P_N$
With high probability (as $n \to \infty$), the output sequences lie inside an $n$-dimensional hypersphere of radius $\sqrt{n(P + P_N)}$ centered at the origin.
If we denote the volume of an $n$-dimensional hypersphere by
$V_n = K_n R^n$
where $R$ denotes the radius and $K_n$ is independent of $R$,
the number of non-overlapping spheres we can pack is:
$M = \frac{K_n (n(P_N + P))^{n/2}}{K_n (nP_N)^{n/2}} = \left( 1 + \frac{P}{P_N} \right)^{n/2}$
The capacity of a discrete-time, additive white Gaussian noise (AWGN) channel with power constraint $P$ and noise power $P_N$ is:
$C = \frac{1}{n} \log M = \frac{1}{2} \log \left( 1 + \frac{P}{P_N} \right)$
Bandlimited Gaussian Waveform Channels
The total noise power in the bandwidth is given by: $P_N = \int_{-W}^{+W} \frac{N_0}{2} \, df = W N_0$
where noise power spectral density is $N_0/2$.
Discrete-time capacity: $C = \frac{1}{2} \log \left( 1 + \frac{P}{N_0 W} \right) \quad \text{bits/transmission}$
Using the Nyquist sampling theorem, we can make $2W$ transmissions per second.
Continuous-time capacity: $C = W \log \left( 1 + \frac{P}{N_0 W} \right) \quad \text{bits/sec}$
12.6 Bounds on Communication
Effect of power and bandwidth on channel capacity
Capacity of a bandlimited additive white Gaussian noise channel is given by:
$C = W \log \left( 1 + \frac{P}{N_0 W} \right)$
Increasing the input power $P$
When we have more power to spend, we can use more input levels that are far apart, enabling higher transmission rates.
However, the increase in capacity is logarithmic and slow with respect to power.
To keep input levels sufficiently spaced (Δ) while increasing their number, we must use higher amplitude signals, which requires much more power.
Still, the channel capacity can be increased to any value by increasing the input power (in contrast, increasing the bandwidth has a performance limit).
Increasing the bandwidth $W$
This has two contrasting effects:
- we can transmit more samples per second, increasing the transmission rate,
- but higher bandwidth means higher noise power at the receiver, which deteriorates its performance.
By increasing the bandwidth alone, we cannot increase the capacity to any desired value.
$\because \lim_{W \to \infty} W \log\left(1 + \frac{P}{N_0 W} \right) = \frac{P}{N_0} \log e \approx 1.44 \cdot \frac{P}{N_0}$
Bandwidth efficiency and power efficiency
$R < W \log \left( 1 + \frac{P}{N_0 W} \right)$
Bandwidth efficiency (or spectral bit rate): $r = \frac{R}{W} \quad \frac{\text{bit/sec}}{\text{Hz}} = \text{bit}$
Energy per bit: $\mathcal{E}_b = \frac{P}{R} \quad \frac{\text{energy/sec}}{\text{bit/sec}} = \text{energy/bit}$
Power efficiency: $\frac{\mathcal{E}_b}{N_0} \quad \frac{\text{signal energy/bit}}{\text{noise energy}} = \text{SNR/bit}$
$r < \log \left( 1 + r \frac{\mathcal{E}_b}{N_0} \right)$ or $\frac{\mathcal{E}_b}{N_0} > \frac{2^r - 1}{r}$
Two important parameters in a communication system: $r$, $\frac{\mathcal{E}_b}{N_0}$
- With a higher value of $r$, the system is more bandwidth efficient.
- With a lower value of $\frac{\mathcal{E}_b}{N_0}$, required to achieve a certain error probability, the system is more power efficient.
Below the curve: reliable communication is possible.
Above the curve: reliable communication is not possible.
The closer a system's point lies to the curve, the closer its performance is to optimal.
$\lim_{r \to 0} \frac{2^r - 1}{r} = \ln 2 \approx 0.693$ is an absolute minimum for reliable communication.
$\therefore \frac{\mathcal{E}_b}{N_0} > 0.693 \approx -1.6\text{ dB}$
Power-limited case
When $r \ll 1$: bandwidth is large and the main concern is limitation on power
Example: orthogonal, biorthogonal, and simplex schemes
Bandwidth-limited case
When $r \gg 1$: the bandwidth of the channel is small
Example: PAM, QAM, and PSK
The information transmission theorem
If we want to transmit a source $U$ reliably via a channel with capacity $C$, we require that: $H(U) < C$
$\because$ From the Source Coding Theorem, we have $H(U) \le R$
From the Channel Coding Theorem, we have $R < C$
'대학원 > 학과 공부' 카테고리의 다른 글
[논자시] 개념정리: 회로이론, 확률변수 및 확률과정, 신호 및 시스템 (0) | 2025.03.05 |
---|