대학원/학과 공부

[채널스터디] Fundamentals of Communication Systems : Chap. 12

Sechan Oh 2025. 5. 19. 09:09

Chap. 12 An Introduction to Information Theory

12.1 Modeling Information Sources

12.1.1 Measure of Information

DMS

A Discrete Memoryless Source (DMS) is the simplest model of an information source.

It is a discrete-time, discrete-amplitude i.i.d. random process.

 

Alphabet

A full description of the DMS is given by the set $\mathcal{A}$, called the alphabet.

$\mathcal{A} = \{a_i\}_{i=1}^{N}$ with $\{p_i\}_{i=1}^{N}$

 

Self information $I(p_j)$

1. Information content depends only on the probability $p_j$, not the value $a_j$, of the outcome.
2. $I(\cdot)$ is a continuous function.
3. Self-information decreases as probability increases.
4. If $p_j = p_{j1} p_{j2}$, then $I(p_j) = I(p_{j1}) + I(p_{j2})$.

$\Rightarrow I(x) = -\log(x)$

 

Entropy

$H(X) = \sum_{i=1}^{N} p_i I(p_i) = -\sum_{i=1}^{N} p_i \log p_i = \sum_{i=1}^{N} p_i \log\left(\frac{1}{p_i}\right)$

$H(X)$ is a function of the probability mass function (PMF) of the random variable $X$ and is just a number.

 

Entropy of uniform distribution

$H(X) = -\sum_{i=1}^{N} \frac{1}{N} \log \frac{1}{N} = \log N$

$\therefore$ boundary for any distribution $Y$ : $0 \leq H(Y) \leq \log N$, where $N$ is the number of elements.

 

12.1.2 Joint and Conditional Entropy

Joint entropy

$H(X, Y) = -\sum_{x, y} p(x, y) \log p(x, y)$

 

Conditional entropy

$H(X \mid Y) = -\sum_{x, y} p(x, y) \log p(x \mid y)$

$\Leftarrow$ weighted average of $H(X \mid Y = y) = -\sum_x p(x \mid y) \log p(x \mid y)$

 

Chain rule for entropies

$H(X, Y) = H(Y) + H(X \mid Y)$

$H(\mathbf{X}) = H(X_1) + H(X_2 \mid X_1) + \cdots + H(X_n \mid X_1, X_2, \ldots, X_{n-1})$

if independent: $H(\mathbf{X}) = \sum_{i=1}^{n} H(X_i)$

 

Entropy rate

In general:

$H = \lim_{n \to \infty} H(X_n \mid X_1, X_2, \ldots, X_{n-1})$

Under stationarity, entropy rate for sources with memory:

$H = \lim_{n \to \infty} \frac{1}{n} H(X_1, X_2, \ldots, X_n)$

Entropy rate plays the role of entropy for sources with memory.

 

12.1.3 Mutual Information

Mutual information

$I(X; Y) = H(X) - H(X \mid Y)$

$I(X; Y) = \sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}} p(x, y) \log \frac{p(x \mid y)}{p(x)} = \sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}} p(x, y) \log \frac{p(x, y)}{p(x)p(y)}$

$I(X; Y) = I(Y; X) = H(X) + H(Y) - H(X, Y)$

$H(X) - H(X \mid Y)$ is the amount of information provided by the random variable $Y$ about random variable $X$.

 

12.1.4 Differential Entropy

Differential entropy

$h(X) = - \int_{-\infty}^{\infty} f_X(x) \log f_X(x) \, dx$

where $X$ is continuous random variable with the probability density function $f_X(x)$.

 

Joint differential entropy

$h(X, Y) = - \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} f(x, y) \log f(x, y) \, dx \, dy$

 

Conditional differential entropy

$h(X \mid Y) = h(X, Y) - h(Y)$

 

Mutual information

$I(X; Y) = h(Y) - h(Y \mid X) = h(X) - h(X \mid Y)$

 

12.2 The Source Coding Theorem

Typical sequences

In the case of a DMS, when the sequence length $n$ is large, most sequences will contain each symbol $a_i$ approximately $np_i$ times, where $p_i$ is the probability of symbol $a_i$.

The sequences $\mathbf{x}$ that have this structure are called typical sequences.

$\begin{align*}
P(\mathbf{X} = \mathbf{x}) &\approx \prod_{i=1}^{N} p_i^{n p_i} \\
&= \prod_{i=1}^{N} 2^{n p_i \log p_i} \\
&= 2^{n \sum_{i=1}^{N} p_i \log p_i} \\
&= 2^{-n H(X)}
\end{align*}$

 

Effective outputs

Probability of typical sequences is approximately $2^{-n H(X)}$.

Probability of non-typical sequences is negligible.

The number of typical sequences is approximately $2^{n H(X)}$.

Thus, even though a source with alphabet size $N$ can produce $N^n$ sequences, only about $2^{n H(X)}$ are effectively likely — these are the "effective" outputs.

We need $n H(X)$ bits to represent them with negligible error. Alternatively, since the source output has length $n$, this corresponds to $H(X)$ bits per source output.

 

Source Coding Theorem

If $R > H$, the source can be compressed with arbitrarily low error.

If $R < H$, no matter how smart the encoder/decoder is, the error probability cannot be made small.

where, $R$ is bits/source output, $H$ is entropy (or entropy rate).

 

12.3 Source Coding Algorithms

The entropy of a source sets a fundamental compression limit ($R > H$), and Huffman and Lempel-Ziv algorithms are practical methods that approach this limit.

 

12.3.1 The Huffman Source Coding Algorithm

Fixed-to-variable length coding
More frequent  shorter code words
Less frequent  longer code words

 

 

Self-synchronizing

In the case of Code 1, each code word starts with a 1.

 

Instantaneous

Code 2 is also self-synchronizing, but decoding requires the first bit of the next code word. In contrast, in Code 1, the end of a code word is recognized immediately.

 

Uniquely decodable

Can be decoded in only one way.

 

Not uniquely decodable

In the case of Code 4, the sequence 110110 can be decoded in two ways, as $a_5 a_5$ or as $a_4 a_2 a_3$.

 

Prefix condition

A code that is both uniquely decodable and instantaneous.

 

Average code word length

Codes 1 and 3 both satisfy the prefix condition, but Code 3 has a smaller average code word length.

 

Huffman Coding Algorithm

 

Average code word length of a Huffman code

$$\bar{R} = \text{E}[L] = \sum_{x \in \mathcal{X}} p(x) \, l(x)$$

Boundary: $H(X) \leq \bar{R} < H(X) + 1$, where $l(x)$ is the length of the code word corresponding to the source output $x$.

Efficiency: $\eta = \frac{H(X)}{\bar{R}}$

 

Extension of the source

For sequences of source letters of length $n$ (the $n$th extension of the source):

$$H(X^n) \leq \bar{R}_n < H(X^n) + 1 \quad \text{or} \quad H(X) \leq \bar{R} < H(X) + \frac{1}{n}$$

$\because \bar{R} = \frac{1}{n} \bar{R}_n$ and $H(X^n) = n H(X)$
$\Rightarrow$ Increasing $n$ improves Huffman coding efficiency, but also raises code design and encoding/decoding processing complexity.

 

Limitations

1. Depend strongly on the source probabilities.

2. Using source memory requires the $n$th extension, but complexity grows exponentially.

$\Rightarrow$ In high-speed storage systems, such as storage in magnetic or optical media, speed of Huffman coding becomes a bottleneck.

 

12.3.2 The Lempel-Ziv Source Coding Algorithm

Universal source coding algorithm

Independent of the source statistics.

 

Variable-to-fixed length coding

Source outputs are uniquely parsed into variable-length phrases and encoded with fixed-length code words.

 

Parsing the sequence

new phrase = concatenate(known phrase, new symbol)

 

Encoding

code word of new phrase = concatenate(dictionary location of known phrase, new symbol)

 

Example

sequence: 01000011000010100000101000001100000101000010

phrases: 0, 1, 00, 001, 10, 000, 101, 0000, 01, 010, 00001, 100, 0001, 0100, 0010

code words: 0000 0, 0000 1, 0001 0, 0011 1, 0010 0, 00110, 0101 1, 0110 0, 0001 1, 1001 0, 1000 1, 0101 0, 0110 1, 1010 0, 0100 0

$\Rightarrow$ 15 phrases require 4 bits each, plus 1 extra bit for the new symbol.

 

Performance of source compression

For stationary ergodic sources, compressed size approaches $nH(X)$ as sequence length grows.

$\Rightarrow$ Widely used in UNIX compression utilities and other programs such as zip and gzip.

 

Limitations

If the number of phrases is fixed, the dictionary can overflow. To avoid this, the encoder and decoder must remove unused entries and replace them with new ones.

 

12.4 Modeling of Communication Channels

About channel

Channel receives signals at their inputs and deliver signals at their outputs

1) at a later time (storage case)

2) or at another location (transmission case)

Examples of channel: Coaxial cable, iono spheric propagation, free space, fiber optic cables, and magnetic and optical disks

Due to the presence of fading knd noise, the input output relation in a communication channel is generally a stochastic relation.

 

Discrete memoryless channel (DMC)

$P(\mathbf{y} \mid \mathbf{x}) = \prod_{i=1}^{n} P(y_i \mid x_i)$

where $\mathbf{x} \in \mathcal{X}^n$, and $\mathbf{y} \in \mathcal{Y}^n$

 

Binary symmetric channel (BSC)

A special case of a discrete memoryless channel.

$\epsilon = P(0 \mid 1) = P(1 \mid 0)$ and $1-\epsilon = P(0 \mid 0) = P(1 \mid 1)$

$\epsilon$ is called the crossover probability.

 

12.5 Channel Capacity

Channel capacity

C: channel capacity; a fundamental limit exists for information transmission over communication channels.

Reliable transmission (that is, transmission with an arbitrarily small error probability) is possible even over noisy channels as long as $R < C$.

 

Noisy channel coding theorem

Basic limitation that noise causes in a communication channel is not on the reliability of communication, but on the speed of communication.

 

Example of a discrete channel

 

In the channel with inputs a, b, c, d, outputs overlap—e.g., receiving a could mean a or d—so errors are unavoidable.
But if we use only a and c, their outputs don’t overlap:
- a or b → came from a
- c or d → came from c
This avoids confusion and allows reliable transmission.

 

Reliable communication over a noisy BSC

By the law of large numbers, a single input sequence is likely to produce an output with $n\epsilon$ errors. The number of such outputs is given by:

$\binom{n}{n\epsilon} = \frac{n!}{(n - n\epsilon)!(n\epsilon)!}$

where $n$ is the length of sequence, and $\epsilon$ is the crossover probability.

 

Using Stirling's approximation $n! \approx n^n e^{-n} \sqrt{2\pi n}$, we obtain
$\binom{n}{n\epsilon} \approx 2^{n H_b(\epsilon)}$

where $H_b$ is the binary entropy function.

 

The number of typical output sequences: $2^{n H(\epsilon)}$

The maximum number of input sequences with nearly nonoverlapping outputs: $M = \frac{2^{n H(Y)}}{2^{n H_b(\epsilon)}} = 2^{n(H(Y) - H_b(\epsilon))}$

Transmission rate per channel use: $R = \frac{\log_2 M}{n} = H(Y) - H_b(\epsilon)$

 

Maximum reliable transmission rate (= channel capacity) over a BSC

In relation $R = H(Y) - H_b(\epsilon)$,

$\epsilon$ depends on the channel and we cannot control it.

$H(Y)$ maximized where $P(X = 0) = P(X = 1) = 0.5$. In that case, $H(Y) = 1$.

proof)

$P(Y = y \mid X = x) = 
\begin{cases}
1 - \epsilon & \text{if } y = x \\
\epsilon & \text{if } y \neq x
\end{cases}$

$P(X = x) = 
\begin{cases}
p       & \text{if } x = 0 \\
1 - p   & \text{if } x = 1
\end{cases}$

$\Rightarrow P(Y = 0) = P(Y = 1) = 0.5$ where $p = 0.5$

 

$\therefore C = R_{\max} = 1 - H_b(\epsilon)$

 

Noisy channel coding theorem

The capacity of a discrete memoryless channel: $C = \max_{P(x)} I(X; Y)$

Both $R$ (rate) and $C$ (capacity) are measured in bits per transmission or bits per channel use.

 

12.5.1 Gaussian Channel Capacity

Capacity of a discrete-time Gaussian channel with power constraint

A discrete-time Gaussian channel: for blocks of length $n$ at the input, the output, and the noise, $\mathbf{y} = \mathbf{x} + \mathbf{z}$

If $n$ is large, by the law of large numbers,

Input power contraint: $\frac{1}{n} \sum_{i=1}^n x_i^2 \leq P$

Noise power: $\frac{1}{n} \sum_{i=1}^{n} z_i^2 = \frac{1}{n} \sum_{i=1}^{n} |y_i - x_i|^2 \leq P_N$

Output power: $\frac{1}{n} \sum_{i=1}^{n} y_i^2 \leq P + P_N$

With high probability (as $n \to \infty$), the output sequences lie inside an $n$-dimensional hypersphere of radius $\sqrt{n(P + P_N)}$ centered at the origin.

If we denote the volume of an $n$-dimensional hypersphere by

$V_n = K_n R^n$

where $R$ denotes the radius and $K_n$ is independent of $R$,

the number of non-overlapping spheres we can pack is:

$M = \frac{K_n (n(P_N + P))^{n/2}}{K_n (nP_N)^{n/2}} = \left( 1 + \frac{P}{P_N} \right)^{n/2}$

 

The capacity of a discrete-time, additive white Gaussian noise (AWGN) channel with power constraint $P$ and noise power $P_N$ ​is:

$C = \frac{1}{n} \log M = \frac{1}{2} \log \left( 1 + \frac{P}{P_N} \right)$

 

Bandlimited Gaussian Waveform Channels

The total noise power in the bandwidth is given by: $P_N = \int_{-W}^{+W} \frac{N_0}{2} \, df = W N_0$

where noise power spectral density is $N_0/2$.

Discrete-time capacity: $C = \frac{1}{2} \log \left( 1 + \frac{P}{N_0 W} \right) \quad \text{bits/transmission}$

Using the Nyquist sampling theorem, we can make $2W$ transmissions per second.

Continuous-time capacity: $C = W \log \left( 1 + \frac{P}{N_0 W} \right) \quad \text{bits/sec}$

 

12.6 Bounds on Communication

Effect of power and bandwidth on channel capacity

Capacity of a bandlimited additive white Gaussian noise channel is given by:

$C = W \log \left( 1 + \frac{P}{N_0 W} \right)$

 

Increasing the input power $P$

When we have more power to spend, we can use more input levels that are far apart, enabling higher transmission rates.
However, the increase in capacity is logarithmic and slow with respect to power.
To keep input levels sufficiently spaced (Δ) while increasing their number, we must use higher amplitude signals, which requires much more power.
Still, the channel capacity can be increased to any value by increasing the input power (in contrast, increasing the bandwidth has a performance limit).

 

Increasing the bandwidth $W$

This has two contrasting effects:
- we can transmit more samples per second, increasing the transmission rate,
- but higher bandwidth means higher noise power at the receiver, which deteriorates its performance.
By increasing the bandwidth alone, we cannot increase the capacity to any desired value.

$\because \lim_{W \to \infty} W \log\left(1 + \frac{P}{N_0 W} \right) = \frac{P}{N_0} \log e \approx 1.44 \cdot \frac{P}{N_0}$

 

Bandwidth efficiency and power efficiency

$R < W \log \left( 1 + \frac{P}{N_0 W} \right)$

Bandwidth efficiency (or spectral bit rate): $r = \frac{R}{W} \quad \frac{\text{bit/sec}}{\text{Hz}} = \text{bit}$

Energy per bit: $\mathcal{E}_b = \frac{P}{R} \quad \frac{\text{energy/sec}}{\text{bit/sec}} = \text{energy/bit}$

Power efficiency: $\frac{\mathcal{E}_b}{N_0} \quad \frac{\text{signal energy/bit}}{\text{noise energy}} = \text{SNR/bit}$

$r < \log \left( 1 + r \frac{\mathcal{E}_b}{N_0} \right)$ or $\frac{\mathcal{E}_b}{N_0} > \frac{2^r - 1}{r}$

 

Two important parameters in a communication system: $r$, $\frac{\mathcal{E}_b}{N_0}$

- With a higher value of $r$, the system is more bandwidth efficient.

- With a lower value of $\frac{\mathcal{E}_b}{N_0}$, required to achieve a certain error probability, the system is more power efficient.

 

Below the curve: reliable communication is possible.

Above the curve: reliable communication is not possible.

The closer a system's point lies to the curve, the closer its performance is to optimal.

$\lim_{r \to 0} \frac{2^r - 1}{r} = \ln 2 \approx 0.693$ is an absolute minimum for reliable communication.

$\therefore \frac{\mathcal{E}_b}{N_0} > 0.693 \approx -1.6\text{ dB}$

 

Power-limited case

When $r \ll 1$: bandwidth is large and the main concern is limitation on power

Example: orthogonal, biorthogonal, and simplex schemes

 

Bandwidth-limited case

When $r \gg 1$: the bandwidth of the channel is small

Example: PAM, QAM, and PSK

 

The information transmission theorem

If we want to transmit a source $U$ reliably via a channel with capacity $C$, we require that: $H(U) < C$

$\because$ From the Source Coding Theorem, we have $H(U) \le R$

    From the Channel Coding Theorem, we have $R < C$