|
ADVANCED PICTURE CODING
Amarnath Dutta (B. E. V E. C.
Roll No. 4507)
Desai Sameer H. (B. E. V E. C. Roll No. 4514)
Pandya Nishit D. (B.E. V E. C. Roll No. 4523)
(L.D.College Of Engineering, Ahmedabad - 380015)
Living in an era of Internet and moving towards a paperless
society, it is seen that a distributed form of information leads to a more
inferential and unconstrained exchange of ideas, which leads to information
networks, where multimedia is emerging as an essential tool for efficient
information exchange Unfortunately multimedia applications are found to be
requiring more storage space and large bandwidths during transmissions. These
parameters lead to multimedia compression.
Though computer graphics present a colorful picture, the world of computer
graphics formats (the standard that a graphic file is stored in) is a confused
mass of incompatibility between various standards and their environments.
Basically all standards require two compression algorithms: One for compressing
the data at the source and the other for decompressing it at the destination,
referred to as encoding and decoding algorithms respectively. The essentials of
these algorithms are: compatibility, faithful reproduction, and substantial
amount of compression and user friendly.
IDEA BEHIND THE COMPRESSION
WHY TO COMPRESS?
Video compression mainly consists of still image compression and full motion
compression. A still picture consists of pixel format. During compression is
broken down into thousands of small points called pixels along with the
individual color and brightness. The information related to each pixels is the
chrominance, position and luminance, is stored in a file upon which the encoding
algorithm is applied. In full motion picture to digitize and store a 10 seconds
clip of full motion video in a computer requires transfer of an enormous amount
of data in a very short period of time. Reproducing just one frame at 24 bits
requires almost 1MB of computer data and hence a 30-second of video will fill up
1.2GB hard disk. The additional information that a motion picture requires is
the frame rate and synchronization timings. The algorithm for motion picture
encoding is mathematically designed to handle such a format synchronously.
Real time video compression
experienced an bottle neck of speed because typical hard disk drives transfer
data at only about 1MB per second and quad #speed CD-ROMs at a paltry 600KB per
second. This overwhelming technology bottleneck is currently overcome by digital
video compression techniques. Real time video compression algorithms such as
JPEG, MPEG, P*64, DVI-INDEO, and C-CUBE are now available at rates that range
from compression ratios of 50:1 to 200:1. JPEG, MPEG and P864 compression
techniques use Discrete Cosine Transform (DCT), an encoding algorithm that
quantifies human eye’s ability to detect color and image distortions.
TECHNIQUE BEHIND COMPRESSION
While the depth of color directly defines the file size, most formats offer
mechanisms to reduce file sizes artificially following the logic that in a
picture full of 256 shading is quite likely that two or more neighboring pixels
have the same color. The Lempel-Ziv-Welch (LZW) procedure uses this symmetry
probability to reduce the graphic file, also known as delta compression. The
information of identical color data of neighboring pixels is stored only once,
thus avoiding repetitive storage.
Another method known as "lossy compression" make
use of the inability of the human eye to perceive different levels of brightness
in close proximity and the limitation to distinguish between colors in a small
area. The JPEG format this method to offer superior compression. The number of
pixels that can be combined depends on the compression factor. In the case of
compression of a file by about 15% of the original one, the loss of picture
quality is hardly evident even after lossy compression through JPEG format. The
thumb rule: the larger the file output the lower the perceivable loss.
JPEG: A STANDARD THAT WORKS
The JPEG uses lossy compression algorithm operative in three
successive stages as shown below:

These steps combine to form a powerful compressor up to 15% of
the original while using a little of the original fidelity. The first block
known as Discrete Cosine Transform (DCT) is a class of mathematics that includes
the well know Fast Fourier Transforms (FFT) and many others that transform pixel
information into another form of representation using digital audio/video
samples. The basic input block typically a gray scale image is fed to the DCT
algorithm creating an output DCT matrix of the input pixel matrix which shows
the spectral compression characteristics the DCT is supposed to create. The
drastic action to reduce the number of bits require for storage of a DCT matrix
is referred to as "Quantization", which is simply a process of
reducing the number of bits required to store an integer value by reducing the
precision. The JPEG algorithm implements the Quantization matrix by which a
corresponding value in the Quantization matrix gives a quantum value for every
element position in the DCT matrix. A "DC coefficient" is located
arbitrarily at position (0,0) at the upper left corner of the matrix. By
reducing the precision of an integer as we move away from the DC coefficient at
the origin. The farther away from (0,0) the less the element contributes to the
graphical image.
CODING
The final step in JPEG process is coding quantized images. The JPEG coding
phase combines three different steps to compress the image. The first changes
the DC coefficient at (0,0) from an absolute value to a relative value. Since
adjacent blocks in an image exhibit a high degree of correlation, coding the DC
element as the difference from the previous DC element typically produces a very
small number. Next, the coefficient of the image is arranged in the "zigzag
sequence". Then they are encoded using two different mechanisms. The first
is run-length encoding of zero values. The second is what JPEG calls
"entropy coding". This involves sending out the coefficient codes
using either Hoffmann’s codes or arithmetic coding. The reason that JPEG
algorithm compresses so effectively is that a large number of coefficients in
the DCT image are truncated to zero value during the coefficient quantization.
Color images are generally composed of three components such as RED, GREEN and
BLUE (RGB) or the luminance and the chrominance of YUV. In this case JPEG treats
the image as if it were actually three separate images. Hence an RGB image would
first have its red component compressed, followed by compression of green and
blue components.
WHAT IS MPEG?
To the real world, MPEG (Moving
Pictures Experts Groups) is a generic means of compactly representing
digital video and audio signals for consumer distribution. The basic idea is to
transform a stream of discrete samples into a bitstream of tokens which
takes less space, but is just as filling to the eye (…or ear). This
"transformation," or better representing, exploits perceptual and even
some actual statistical redundancies. The orthogonal dimensions of Video and
Audio streams can be further linked with the Systems layer---MPEG's own means of
keeping the data types synchronized and multiplexed in a common serial bitstream.
The essence of MPEG is its syntax: the little
tokens that make up the bitstream. MPEG's semantics then tell you (if you happen
to be a decoder, that is) how to inverse represent the compact tokens back into
something resembling the original stream of samples. These semantics are merely
a collection of rules (which people like to called algorithms, but that would
imply there is a mathematical coherency to a scheme cooked up by trial and error….).
These rules are highly reactive to combinations of bitstream elements set in
headers and so forth.
PRE MPEG:
Before providence gave us MPEG, there was the looming threat of world domination
by proprietary standards cloaked in syntactic mystery. With lossy compression
being such an inexact science (which always boils down to visual tweaking and
implementation tradeoffs), you never know what's really behind any such scheme
(other than a lot of marketing hype).
A respected method developed by the old Sarnoff Princeton NJ
research group was purchased in 1988 by our friend Intel. (The August 1988 issue
of Stereo Review discusses the early days of compact disc digital video).
It then became known as DVI, or Digital Video Interactive.
Seeing this threat… that is,
need for world interoperability, the Fathers of MPEG sought the help of their
colleagues to form a committee to standardize a common means of representing
video and audio (a la DVI) onto compact discs…. and maybe it would be useful
for other things too.
MPEG borrowed a significantly from JPEG and, more directly,
H.261.
Seeing how this MPEG things was such a good deal, and not
wanting to be left behind in the industry, participants amassed, reaching a peak
of more than 200 people by 1992.
By the end of the third year (1990), a syntax emerged, which
when applied to represent SIF-rate video and compact disc-rate audio at a
combined bitrate of 1.5 Mbit/sec, approximated the pleasure-filled viewing
experience offered by the standard VHS format.
After demonstrations proved that the syntax was generic
enough to be applied to bit rates and sample rates far higher than the original
primary target application ("Hey, it actually works!"), a second phase
(MPEG-2) was initiated within the committee to define a syntax for efficient
representation of broadcast video, or SDTV as it is now known (Standard
Definition Television), not to mention the side benefits: frequent flier miles,
impress friends, job security, obnoxious party conversations.
Yet efficient representation of interlaced (broadcast) video
signals was more challenging than the progressive (non-interlaced) signals
thrown at MPEG-1. Similarly, MPEG-1 audio was capable of only directly
representing two channels of sound (although Dolby Surround Sound can be mixed
into the two channels like any other two channel system).
MPEG-2 would therefore introduce a scheme to decorrelate
multichannel discrete surround sound audio signals, exploiting the moderately
higher redundancy factor in such a scenario. Of course, propriety schemes such
as Dolby AC-3 have become more popular in practice.
Need for a third phase (MPEG-3) was anticipated way
back in 1991 for High Definition Television, although it was later discovered by
late 1992 and 1993 that the MPEG-2 syntax simply scaled with the bit rate,
obviating the third phase. MPEG-4 was launched in late 1992 to explore the
requirements of a more diverse set of applications (although originally its goal
seemed very much like that of the ITU-T SG15 group, which produced the new low-bitrate
videophone standard---H.263).
Today, MPEG (video and systems) is exclusive syntax of the
United States Grand Alliance HDTV specification, the European Digital Video
Broadcasting group, and the Digital Versatile Disc (DVD).
WHAT IS MPEG VIDEO SYNTAX?
MPEG video syntax provides an efficient way to represent image
sequences in the form of more compact coded data. The language of the coded bits
is the "syntax." For example, a few tokens amounting to only, say, 100
bits can represent an entire block of 64 samples rather transparently ("you
can't tell the difference") which otherwise normally consume (64*8), or,
512 bits. MPEG also describes a decoding (reconstruction) process where the
coded bits are mapped from the compact representation into the original,
"raw" format of the image sequence. For example, a flag in the coded
bitstream signals whether the following bits are to be decoded with a DCT
algorithm or with a prediction algorithm. The algorithms comprising the decoding
process are regulated by the semantics defined by MPEG. This syntax can be
applied to exploit common video characteristics such as spatial redundancy,
temporal redundancy, uniform motion, spatial masking, etc.
MPEG-2 can represent interlaced or progressive video
sequences, whereas MPEG-1 is strictly meant for progressive sequences since the
target application was Compact Disc video coded at 1.2 Mbit/sec.
MPEG-2 changed the meaning behind the
aspect_ratio_information variable, while significantly reducing the number of
defined aspect ratios in the table. In MPEG-2, aspect_ratio_information refers
to the overall display aspect ratio (e.g. 4:3, 16:9), whereas in MPEG-2, the
ratio refers to the particular pixel. The reduction in the entries of the aspect
ratio table also helps interoperability by limiting the number of possible modes
to a practical set, much like frame_rate_code limits the number of display frame
rates that can be represented.
Optional picture header variables called
display_horizontal_size and display_vertical_size can be used to code unusual
display sizes.
Frame_rate_code in MPEG-2 refers to the intended display
rate, whereas in MPEG-1 it referred to the coded frame rate. In film source
video, there are often 24 coded frames per second. Prior to bitstream coding, a
good encoder will eliminate the redundant 6 frames or 12 fields from a 30
frame/sec video signal which encapsulates an inherently 24 frame/sec video
source. The MPEG decoder or display device will then repeat frames or fields to
recreate or synthesize the 30 frame/sec display rate. In MPEG-1, the decoder
could only infer the intended frame rate, or derive it based on the Systems
layer time stamps. MPEG-2 provides specific picture header variables called
repeat_first_field and top_field_first which explicitly signal which frames or
fields are to be repeated, and how many times.
To address the concern of software decoders which may operate
at rates lower or different than the common television rates, two new variables
in MPEG-2 called frame_rate_extension_d and frame_rate_extension_n can be
combined with frame_rate_code to specify a much wider variety of display frame
rates. However, in the current set of define profiles and levels, these two
variables are not allowed to change the value specified by frame_rate_code.
Future extensions or Profiles of MPEG may enable them.
In interlaced sequences, the coded macroblock height (mb_height)
of a picture must be a multiple of 32 pixels, while the width, like MPEG-1, is a
coded multiple of 16 pixels. A discrepancy between the coded width and height of
a picture and the variables horizontal_size and vertical_size, respectively,
occurs when either variable is not an integer multiple of macroblocks. All
pixels must be coded within macroblocks, since there cannot be such a thing as
"fractional" macroblocks.
Never intended for display, these "overhang" pixels
or lines exist along the left and bottom edges of the coded picture. The sample
values within these trims can be arbitrary, but they can affect the values of
samples within the current picture, and especially future coded pictures (since
all coded samples are fair game for the prediction process).
To drive this to the point nausea: in the current pictures,
pixels which reside within the same 8x8 block as the "overhang" pixels
are affect by the ripples of DCT quantization error. In future coded pictures,
their energy can propagate anywhere within an image sequence as a result of
motion compensated prediction. An encoder should fill in values which are easy
to code, and should probably avoid creating motion vectors which would cause the
Motion Compensated Prediction stage to extract samples from these areas. To help
avoid any confusion, the application should probably select horizontal_size and
vertical_size that are already multiples of 16 (or 32 in the vertical case of
interlaced sequences).
GROUP OF PICTURES:
The concept of the "Group of Pictures" layer does not exist
in MPEG-2. It is an optional header useful only for establishing a SMPTE time
code base or for indicating that certain B pictures at the beginning of an
edited sequence comprise a broken_link. This occurs when the current B picture
requires prediction from a forward reference frame (previous in time to the
current picture) has been removed from the bitstream by an editing process. In
MPEG-1, the Group of Pictures header is mandatory, and must follow a sequence
header.
PICTURE LAYER:
In MPEG-2, a frame may be coded progressively or interlaced, signaled
by the progressive_frame variable. In interlaced frames (progressive_frame==0),
frames may then be coded as either a frame picture (picture_structure==frame) or
as two separately coded field pictures (picture_structure==top_field or
picture_structure==bottom_field).
Progressive frames are a logic choice for video material
which originated from film, where all "pixels" are integrated or
captured at the same time instant. Most electronic cameras today capture
pictures in two separate stages: a top field consisting of all "odd
lines" of the picture are nearly captured in the time instant, followed by
a bottom field of all "even lines." Frame pictures provide the option
of coding each macroblock locally as either field or frame. An encoder may
choose field pictures to save memory storage or reduce the end-to-end
encoder-decoder delay by one field period.
Repeat_first_field was introduced in MPEG-2 to signal that a
field or frame from the current frame is to be repeated for purposes of frame
rate conversion (as in the 30 Hz display vs. 24 Hz coded example above). On
average in a 24 frame/sec coded sequence, every other coded frame would signal
the repeat_first_field flag. Thus the 24 frame/sec (or 48 field/sec) coded
sequence would become a 30 frame/sec (60 field/sec) display sequence. This
processes has been known for decades as 3:2 Pulldown. Most movies seen on NTSC
displays since the advent of television have been displayed this way. Only
within the past decade has it become possible to interpolate motion to create 30
truly unique frames from the original 24. Since the repeat_first_field flag is
independently determined in every frame structured picture, the actual pattern
can be irregular (it doesn't have to be every other frame literally). An
irregularity would occur during a scene cut, for example.
METHOD TO OBTAIN HIGH COMPRESSION RATIOS:
MPEG video is often quoted as achieving compression ratios over 100:1,
when in reality the "sweet spot" rests between 8:1 and 30:1.
Here's how the fabled "greater than 100:1"
reduction ratio is derived for the popular Compact Disc Video (White Book)
bitrate of 1.15 Mbit/sec.
Step 1. Start with the oversampled rate!
Most MPEG video sources originate at a higher sample rate
than the "target" sample rate encoded into the final MPEG bitstream.
The most popular studio signal, known canonically as "D-1" or "CCIR
601" digital video, is coded at 270 Mbit/sec.
The constant, 270 Mbit/sec, can be derived as follows:
|
Luminance (Y): |
858 samples/line x 525 lines/frame x 30 frames/sec x
10 bits/sample ~= 135 Mbit/sec |
|
R-Y (Cb): |
429 samples/line x 525 lines/frame x 30 frames/sec x
10 bits/sample ~= 68 Mbit/sec |
|
B-Y (Cb): |
429 samples/line x 525 lines/frame x 30 frames/sec x
10 bits/sample ~= 68 Mbit/sec |
|
Total: |
27 million samples/sec x 10 bits/sample 270 Mbit/sec. |
So, we start with a compression ratio of: 270/1.15... an
amazing 235:1 !!!!!
Step 2. Throw in the blanking intervals!
Only 720 out of the 858 luminance samples per line contain
active picture information. In fact, the debate over the true number of active
samples is the trigger for many hair-pulling cat-fights at TV engineering
seminars and conventions, so it is healthier to say that the number lies
somewhere between 704 and 720. Likewise, only 480 lines out of the 525 lines
contain active picture information. Again, the actual number is somewhere
between 480 and 496. For the purposes of MPEG-1's and MPEG-2's famous
conformance points (Constrained Parameters Bitstreams and Main Level,
respectively), the number shall be 704 samples x 480 lines for luminance, and
352 samples x 480 lines for each of the two chrominance pictures. Recomputing
the source rate, we arrive at:
|
Y |
704 samples/line x 480 lines x 30 fps x 10
bits/sample ~= 104 Mbit/sec |
|
C |
2 components x 352 samples/line x 480 lines x 30 fps
x 10 bits/sample ~= 104 Mbit/sec |
|
Total: |
~ 207 Mbit/sec |
The ratio (207/1.15) is now only 180:1
Step 3. Let's Include higher bits/sample!
The MPEG sample precision is 8 bits. There has been some talk
of a 10-bit extension, but that's on hold (as of April 2, 1996, 10:35 PM GMT).
Studio equipment often quantize samples with 10 bits of accuracy, because some
engineers and artists feel the extra dynamic range is needed in the iterative
content production loop.) .
Getting rid of this sneaking factor, the ratio is now
deflated to only 180 * (8/10 ), or 144:1
Step 4. Ok then, include higher chroma sampling ratio!
The famous CCIR-601studio signal represents the chroma
signals (Cb, Cr) with half the horizontal sample density as the luminance
signal, but with full vertical "resolution." This particular ratio of
subsampled components is known as 4:2:2. However, MPEG-1 and MPEG-2 Main Profile
specify the exclusive use of the 4:2:0 format, deemed sufficient for consumer
applications, where both chrominance signals have exactly half the horizontal
and vertical resolution as luminance (the MPEG Studio Profile, however, centers
around the 4:2:2 macroblock structure). Seen from the perspective of pixels
being comprised of samples from multiple components, the 4:2:2 signal can be
expressed as having an average of 2 samples per pixel (1 for Y, 0.5 for Cb, and
0.5 for Cr). Thanks to the reduction in the vertical direction (resulting in a
352 x 240 chrominance frame), the 4:2:0 signal would, in effect, have an average
of 1.5 samples per pixel (1 for Y, and 0.25 for Cb and Cr each). Our source
video bit rate may now be recomputed as:
720 pixels x 480 lines x 30 fps x 8 bits/sample x 1.5
samples/pixel = 124 Mbit/sec
... and the ratio is now 108:1.
Step 5. Include pre-subsampled
image size… yeah, that the ticket!
As a final act of pre-compression, the CCIR 601 frame is
converted to the SIF frame by a subsampling of 2:1 in both the horizontal and
vertical directions.... or 4:1 overall. Quality horizontal subsampling can be
achieved by the application of a simple FIR filter (7 or 4 taps, for example),
and vertical subsampling by either dropping every other field (in effect,
dropping every other line) or again by an FIR filter (regulated by an interfield
motion detection algorithm). Our ratio now becomes:
352 pixels x 240 lines x 30 fps x 8 bits/sample x 1.5
samples/pixel ~= 30 Mbit/sec !!
.. and the ratio is now only 26:1
Thus, the true A/B comparison should be between the source
sequence at the 30 Mbit/sec stage just prior to encoding, which is also the
actual specified sample rate in the MPEG bitstream (sequence_header()), and the
reconstructed sequence produced from the 1.15 Mbit/sec coded bitstream. If you
can achieve compression through subsampling alone, it means you never really
needed the extra samples in the first place.
Step 6. Don't forget 3:2
pulldown!
A majority of high budget
programs originate from film, not video. Most of the movies encoded onto Compact
Disc Video were in fact captured and edited at 24 frames/sec. So, in such an
image sequence, 6 out of the 30 frames displayed on a television monitor (30
frame/sec or 60 field/sec is standard NTSC rate in North America and Japan) are
in fact redundant and need not be coded into the MPEG bitstream. This naturally
leads us to the shocking discovery that the actual source bit rate has really
been 24 Mbit/sec all along (24 fps/30 fps * 30 Mbit/sec), and the compression
ratio only a mere 21:1
!!! ("phone the police!").
Even at the seemingly modest 20:1 ratio,
"discrepancies" (in polite conversational terms) will appear between
the 24 Mbit/sec source sequence and the reconstructed sequence. Only
conservative ratios in the neighborhood of 12:1 and 8:1 have demonstrated true
transparency for sequences with complex spatial-temporal characteristics (i.e.
rapid, divergent motion and sharp edges, textures, etc.). However, if the video
is carefully encoded by means of pre-processing and intelligent distribution of
bits (no, really), higher ratios can be made to "appear at least
artifact-free."
References:
- Multimedia – Making it work – Tay Vaughan
- Computer Networks – Andrew Tannenbaum
- The Data Compression Book – Mark Nelson
& Jean-Loup Gailly
- Making Movies with your PC – Robert Hone
& Margy Kuntz
|