Free Web Hosting Provider - Web Hosting - E-commerce - High Speed Internet - Free Web Page
Search the Web

Sameer Desai, Papers written or presented by me... ADVANCED PICTURE CODING

 

ADVANCED PICTURE CODING

Amarnath Dutta (B. E. V E. C. Roll No. 4507)
Desai Sameer H. (B. E. V E. C. Roll No. 4514)
Pandya Nishit D. (B.E. V E. C. Roll No. 4523)
(L.D.College Of Engineering, Ahmedabad - 380015)

Living in an era of Internet and moving towards a paperless society, it is seen that a distributed form of information leads to a more inferential and unconstrained exchange of ideas, which leads to information networks, where multimedia is emerging as an essential tool for efficient information exchange Unfortunately multimedia applications are found to be requiring more storage space and large bandwidths during transmissions. These parameters lead to multimedia compression.

Though computer graphics present a colorful picture, the world of computer graphics formats (the standard that a graphic file is stored in) is a confused mass of incompatibility between various standards and their environments. Basically all standards require two compression algorithms: One for compressing the data at the source and the other for decompressing it at the destination, referred to as encoding and decoding algorithms respectively. The essentials of these algorithms are: compatibility, faithful reproduction, and substantial amount of compression and user friendly.

IDEA BEHIND THE COMPRESSION
WHY TO COMPRESS?
Video compression mainly consists of still image compression and full motion compression. A still picture consists of pixel format. During compression is broken down into thousands of small points called pixels along with the individual color and brightness. The information related to each pixels is the chrominance, position and luminance, is stored in a file upon which the encoding algorithm is applied. In full motion picture to digitize and store a 10 seconds clip of full motion video in a computer requires transfer of an enormous amount of data in a very short period of time. Reproducing just one frame at 24 bits requires almost 1MB of computer data and hence a 30-second of video will fill up 1.2GB hard disk. The additional information that a motion picture requires is the frame rate and synchronization timings. The algorithm for motion picture encoding is mathematically designed to handle such a format synchronously.

Real time video compression experienced an bottle neck of speed because typical hard disk drives transfer data at only about 1MB per second and quad #speed CD-ROMs at a paltry 600KB per second. This overwhelming technology bottleneck is currently overcome by digital video compression techniques. Real time video compression algorithms such as JPEG, MPEG, P*64, DVI-INDEO, and C-CUBE are now available at rates that range from compression ratios of 50:1 to 200:1. JPEG, MPEG and P864 compression techniques use Discrete Cosine Transform (DCT), an encoding algorithm that quantifies human eye’s ability to detect color and image distortions. 

TECHNIQUE BEHIND COMPRESSION
While the depth of color directly defines the file size, most formats offer mechanisms to reduce file sizes artificially following the logic that in a picture full of 256 shading is quite likely that two or more neighboring pixels have the same color. The Lempel-Ziv-Welch (LZW) procedure uses this symmetry probability to reduce the graphic file, also known as delta compression. The information of identical color data of neighboring pixels is stored only once, thus avoiding repetitive storage.

Another method known as "lossy compression" make use of the inability of the human eye to perceive different levels of brightness in close proximity and the limitation to distinguish between colors in a small area. The JPEG format this method to offer superior compression. The number of pixels that can be combined depends on the compression factor. In the case of compression of a file by about 15% of the original one, the loss of picture quality is hardly evident even after lossy compression through JPEG format. The thumb rule: the larger the file output the lower the perceivable loss. 

JPEG: A STANDARD THAT WORKS
The JPEG uses lossy compression algorithm operative in three successive stages as shown below:

JPEG coding, JPEG - a standard that works

These steps combine to form a powerful compressor up to 15% of the original while using a little of the original fidelity. The first block known as Discrete Cosine Transform (DCT) is a class of mathematics that includes the well know Fast Fourier Transforms (FFT) and many others that transform pixel information into another form of representation using digital audio/video samples. The basic input block typically a gray scale image is fed to the DCT algorithm creating an output DCT matrix of the input pixel matrix which shows the spectral compression characteristics the DCT is supposed to create. The drastic action to reduce the number of bits require for storage of a DCT matrix is referred to as "Quantization", which is simply a process of reducing the number of bits required to store an integer value by reducing the precision. The JPEG algorithm implements the Quantization matrix by which a corresponding value in the Quantization matrix gives a quantum value for every element position in the DCT matrix. A "DC coefficient" is located arbitrarily at position (0,0) at the upper left corner of the matrix. By reducing the precision of an integer as we move away from the DC coefficient at the origin. The farther away from (0,0) the less the element contributes to the graphical image.

CODING
The final step in JPEG process is coding quantized images. The JPEG coding phase combines three different steps to compress the image. The first changes the DC coefficient at (0,0) from an absolute value to a relative value. Since adjacent blocks in an image exhibit a high degree of correlation, coding the DC element as the difference from the previous DC element typically produces a very small number. Next, the coefficient of the image is arranged in the "zigzag sequence". Then they are encoded using two different mechanisms. The first is run-length encoding of zero values. The second is what JPEG calls "entropy coding". This involves sending out the coefficient codes using either Hoffmann’s codes or arithmetic coding. The reason that JPEG algorithm compresses so effectively is that a large number of coefficients in the DCT image are truncated to zero value during the coefficient quantization. Color images are generally composed of three components such as RED, GREEN and BLUE (RGB) or the luminance and the chrominance of YUV. In this case JPEG treats the image as if it were actually three separate images. Hence an RGB image would first have its red component compressed, followed by compression of green and blue components. 

WHAT IS MPEG?
To the real world, MPEG (Moving Pictures Experts Groups) is a generic means of compactly representing digital video and audio signals for consumer distribution. The basic idea is to transform a stream of discrete samples into a bitstream of tokens which takes less space, but is just as filling to the eye (…or ear). This "transformation," or better representing, exploits perceptual and even some actual statistical redundancies. The orthogonal dimensions of Video and Audio streams can be further linked with the Systems layer---MPEG's own means of keeping the data types synchronized and multiplexed in a common serial bitstream.

The essence of MPEG is its syntax: the little tokens that make up the bitstream. MPEG's semantics then tell you (if you happen to be a decoder, that is) how to inverse represent the compact tokens back into something resembling the original stream of samples. These semantics are merely a collection of rules (which people like to called algorithms, but that would imply there is a mathematical coherency to a scheme cooked up by trial and error….). These rules are highly reactive to combinations of bitstream elements set in headers and so forth. 

PRE MPEG:
Before providence gave us MPEG, there was the looming threat of world domination by proprietary standards cloaked in syntactic mystery. With lossy compression being such an inexact science (which always boils down to visual tweaking and implementation tradeoffs), you never know what's really behind any such scheme (other than a lot of marketing hype).

A respected method developed by the old Sarnoff Princeton NJ research group was purchased in 1988 by our friend Intel. (The August 1988 issue of Stereo Review discusses the early days of compact disc digital video). It then became known as DVI, or Digital Video Interactive.

Seeing this threat… that is, need for world interoperability, the Fathers of MPEG sought the help of their colleagues to form a committee to standardize a common means of representing video and audio (a la DVI) onto compact discs…. and maybe it would be useful for other things too.

MPEG borrowed a significantly from JPEG and, more directly, H.261.

Seeing how this MPEG things was such a good deal, and not wanting to be left behind in the industry, participants amassed, reaching a peak of more than 200 people by 1992.

By the end of the third year (1990), a syntax emerged, which when applied to represent SIF-rate video and compact disc-rate audio at a combined bitrate of 1.5 Mbit/sec, approximated the pleasure-filled viewing experience offered by the standard VHS format.

After demonstrations proved that the syntax was generic enough to be applied to bit rates and sample rates far higher than the original primary target application ("Hey, it actually works!"), a second phase (MPEG-2) was initiated within the committee to define a syntax for efficient representation of broadcast video, or SDTV as it is now known (Standard Definition Television), not to mention the side benefits: frequent flier miles, impress friends, job security, obnoxious party conversations.

Yet efficient representation of interlaced (broadcast) video signals was more challenging than the progressive (non-interlaced) signals thrown at MPEG-1. Similarly, MPEG-1 audio was capable of only directly representing two channels of sound (although Dolby Surround Sound can be mixed into the two channels like any other two channel system).

MPEG-2 would therefore introduce a scheme to decorrelate multichannel discrete surround sound audio signals, exploiting the moderately higher redundancy factor in such a scenario. Of course, propriety schemes such as Dolby AC-3 have become more popular in practice.

Need for a third phase (MPEG-3) was anticipated way back in 1991 for High Definition Television, although it was later discovered by late 1992 and 1993 that the MPEG-2 syntax simply scaled with the bit rate, obviating the third phase. MPEG-4 was launched in late 1992 to explore the requirements of a more diverse set of applications (although originally its goal seemed very much like that of the ITU-T SG15 group, which produced the new low-bitrate videophone standard---H.263).

Today, MPEG (video and systems) is exclusive syntax of the United States Grand Alliance HDTV specification, the European Digital Video Broadcasting group, and the Digital Versatile Disc (DVD). 

WHAT IS MPEG VIDEO SYNTAX?
MPEG video syntax provides an efficient way to represent image sequences in the form of more compact coded data. The language of the coded bits is the "syntax." For example, a few tokens amounting to only, say, 100 bits can represent an entire block of 64 samples rather transparently ("you can't tell the difference") which otherwise normally consume (64*8), or, 512 bits. MPEG also describes a decoding (reconstruction) process where the coded bits are mapped from the compact representation into the original, "raw" format of the image sequence. For example, a flag in the coded bitstream signals whether the following bits are to be decoded with a DCT algorithm or with a prediction algorithm. The algorithms comprising the decoding process are regulated by the semantics defined by MPEG. This syntax can be applied to exploit common video characteristics such as spatial redundancy, temporal redundancy, uniform motion, spatial masking, etc.

MPEG-2 can represent interlaced or progressive video sequences, whereas MPEG-1 is strictly meant for progressive sequences since the target application was Compact Disc video coded at 1.2 Mbit/sec.

MPEG-2 changed the meaning behind the aspect_ratio_information variable, while significantly reducing the number of defined aspect ratios in the table. In MPEG-2, aspect_ratio_information refers to the overall display aspect ratio (e.g. 4:3, 16:9), whereas in MPEG-2, the ratio refers to the particular pixel. The reduction in the entries of the aspect ratio table also helps interoperability by limiting the number of possible modes to a practical set, much like frame_rate_code limits the number of display frame rates that can be represented.

Optional picture header variables called display_horizontal_size and display_vertical_size can be used to code unusual display sizes.

Frame_rate_code in MPEG-2 refers to the intended display rate, whereas in MPEG-1 it referred to the coded frame rate. In film source video, there are often 24 coded frames per second. Prior to bitstream coding, a good encoder will eliminate the redundant 6 frames or 12 fields from a 30 frame/sec video signal which encapsulates an inherently 24 frame/sec video source. The MPEG decoder or display device will then repeat frames or fields to recreate or synthesize the 30 frame/sec display rate. In MPEG-1, the decoder could only infer the intended frame rate, or derive it based on the Systems layer time stamps. MPEG-2 provides specific picture header variables called repeat_first_field and top_field_first which explicitly signal which frames or fields are to be repeated, and how many times.

To address the concern of software decoders which may operate at rates lower or different than the common television rates, two new variables in MPEG-2 called frame_rate_extension_d and frame_rate_extension_n can be combined with frame_rate_code to specify a much wider variety of display frame rates. However, in the current set of define profiles and levels, these two variables are not allowed to change the value specified by frame_rate_code. Future extensions or Profiles of MPEG may enable them.

In interlaced sequences, the coded macroblock height (mb_height) of a picture must be a multiple of 32 pixels, while the width, like MPEG-1, is a coded multiple of 16 pixels. A discrepancy between the coded width and height of a picture and the variables horizontal_size and vertical_size, respectively, occurs when either variable is not an integer multiple of macroblocks. All pixels must be coded within macroblocks, since there cannot be such a thing as "fractional" macroblocks.

Never intended for display, these "overhang" pixels or lines exist along the left and bottom edges of the coded picture. The sample values within these trims can be arbitrary, but they can affect the values of samples within the current picture, and especially future coded pictures (since all coded samples are fair game for the prediction process).

To drive this to the point nausea: in the current pictures, pixels which reside within the same 8x8 block as the "overhang" pixels are affect by the ripples of DCT quantization error. In future coded pictures, their energy can propagate anywhere within an image sequence as a result of motion compensated prediction. An encoder should fill in values which are easy to code, and should probably avoid creating motion vectors which would cause the Motion Compensated Prediction stage to extract samples from these areas. To help avoid any confusion, the application should probably select horizontal_size and vertical_size that are already multiples of 16 (or 32 in the vertical case of interlaced sequences). 

GROUP OF PICTURES:
The concept of the "Group of Pictures" layer does not exist in MPEG-2. It is an optional header useful only for establishing a SMPTE time code base or for indicating that certain B pictures at the beginning of an edited sequence comprise a broken_link. This occurs when the current B picture requires prediction from a forward reference frame (previous in time to the current picture) has been removed from the bitstream by an editing process. In MPEG-1, the Group of Pictures header is mandatory, and must follow a sequence header. 

PICTURE LAYER:
In MPEG-2, a frame may be coded progressively or interlaced, signaled by the progressive_frame variable. In interlaced frames (progressive_frame==0), frames may then be coded as either a frame picture (picture_structure==frame) or as two separately coded field pictures (picture_structure==top_field or picture_structure==bottom_field).

Progressive frames are a logic choice for video material which originated from film, where all "pixels" are integrated or captured at the same time instant. Most electronic cameras today capture pictures in two separate stages: a top field consisting of all "odd lines" of the picture are nearly captured in the time instant, followed by a bottom field of all "even lines." Frame pictures provide the option of coding each macroblock locally as either field or frame. An encoder may choose field pictures to save memory storage or reduce the end-to-end encoder-decoder delay by one field period.

Repeat_first_field was introduced in MPEG-2 to signal that a field or frame from the current frame is to be repeated for purposes of frame rate conversion (as in the 30 Hz display vs. 24 Hz coded example above). On average in a 24 frame/sec coded sequence, every other coded frame would signal the repeat_first_field flag. Thus the 24 frame/sec (or 48 field/sec) coded sequence would become a 30 frame/sec (60 field/sec) display sequence. This processes has been known for decades as 3:2 Pulldown. Most movies seen on NTSC displays since the advent of television have been displayed this way. Only within the past decade has it become possible to interpolate motion to create 30 truly unique frames from the original 24. Since the repeat_first_field flag is independently determined in every frame structured picture, the actual pattern can be irregular (it doesn't have to be every other frame literally). An irregularity would occur during a scene cut, for example. 

METHOD TO OBTAIN HIGH COMPRESSION RATIOS:
MPEG video is often quoted as achieving compression ratios over 100:1, when in reality the "sweet spot" rests between 8:1 and 30:1.

Here's how the fabled "greater than 100:1" reduction ratio is derived for the popular Compact Disc Video (White Book) bitrate of 1.15 Mbit/sec.

Step 1. Start with the oversampled rate!
Most MPEG video sources originate at a higher sample rate than the "target" sample rate encoded into the final MPEG bitstream. The most popular studio signal, known canonically as "D-1" or "CCIR 601" digital video, is coded at 270 Mbit/sec.

The constant, 270 Mbit/sec, can be derived as follows:

Luminance (Y):

858 samples/line x 525 lines/frame x 30 frames/sec x 10 bits/sample ~= 135 Mbit/sec

R-Y (Cb):

429 samples/line x 525 lines/frame x 30 frames/sec x 10 bits/sample ~= 68 Mbit/sec

B-Y (Cb):

429 samples/line x 525 lines/frame x 30 frames/sec x 10 bits/sample ~= 68 Mbit/sec

Total:

27 million samples/sec x 10 bits/sample 270 Mbit/sec.

So, we start with a compression ratio of: 270/1.15... an amazing 235:1 !!!!!

Step 2. Throw in the blanking intervals!
Only 720 out of the 858 luminance samples per line contain active picture information. In fact, the debate over the true number of active samples is the trigger for many hair-pulling cat-fights at TV engineering seminars and conventions, so it is healthier to say that the number lies somewhere between 704 and 720. Likewise, only 480 lines out of the 525 lines contain active picture information. Again, the actual number is somewhere between 480 and 496. For the purposes of MPEG-1's and MPEG-2's famous conformance points (Constrained Parameters Bitstreams and Main Level, respectively), the number shall be 704 samples x 480 lines for luminance, and 352 samples x 480 lines for each of the two chrominance pictures. Recomputing the source rate, we arrive at:

Y

704 samples/line x 480 lines x 30 fps x 10 bits/sample ~= 104 Mbit/sec

C

2 components x 352 samples/line x 480 lines x 30 fps x 10 bits/sample ~= 104 Mbit/sec

Total:

~ 207 Mbit/sec

The ratio (207/1.15) is now only 180:1

Step 3. Let's Include higher bits/sample!
The MPEG sample precision is 8 bits. There has been some talk of a 10-bit extension, but that's on hold (as of April 2, 1996, 10:35 PM GMT). Studio equipment often quantize samples with 10 bits of accuracy, because some engineers and artists feel the extra dynamic range is needed in the iterative content production loop.) .

Getting rid of this sneaking factor, the ratio is now deflated to only 180 * (8/10 ), or 144:1

Step 4. Ok then, include higher chroma sampling ratio!
The famous CCIR-601studio signal represents the chroma signals (Cb, Cr) with half the horizontal sample density as the luminance signal, but with full vertical "resolution." This particular ratio of subsampled components is known as 4:2:2. However, MPEG-1 and MPEG-2 Main Profile specify the exclusive use of the 4:2:0 format, deemed sufficient for consumer applications, where both chrominance signals have exactly half the horizontal and vertical resolution as luminance (the MPEG Studio Profile, however, centers around the 4:2:2 macroblock structure). Seen from the perspective of pixels being comprised of samples from multiple components, the 4:2:2 signal can be expressed as having an average of 2 samples per pixel (1 for Y, 0.5 for Cb, and 0.5 for Cr). Thanks to the reduction in the vertical direction (resulting in a 352 x 240 chrominance frame), the 4:2:0 signal would, in effect, have an average of 1.5 samples per pixel (1 for Y, and 0.25 for Cb and Cr each). Our source video bit rate may now be recomputed as:

720 pixels x 480 lines x 30 fps x 8 bits/sample x 1.5 samples/pixel = 124 Mbit/sec

... and the ratio is now 108:1.

Step 5. Include pre-subsampled image size… yeah, that the ticket!
As a final act of pre-compression, the CCIR 601 frame is converted to the SIF frame by a subsampling of 2:1 in both the horizontal and vertical directions.... or 4:1 overall. Quality horizontal subsampling can be achieved by the application of a simple FIR filter (7 or 4 taps, for example), and vertical subsampling by either dropping every other field (in effect, dropping every other line) or again by an FIR filter (regulated by an interfield motion detection algorithm). Our ratio now becomes:

352 pixels x 240 lines x 30 fps x 8 bits/sample x 1.5 samples/pixel ~= 30 Mbit/sec !!

.. and the ratio is now only 26:1

Thus, the true A/B comparison should be between the source sequence at the 30 Mbit/sec stage just prior to encoding, which is also the actual specified sample rate in the MPEG bitstream (sequence_header()), and the reconstructed sequence produced from the 1.15 Mbit/sec coded bitstream. If you can achieve compression through subsampling alone, it means you never really needed the extra samples in the first place.

Step 6. Don't forget 3:2 pulldown!
A majority of high budget programs originate from film, not video. Most of the movies encoded onto Compact Disc Video were in fact captured and edited at 24 frames/sec. So, in such an image sequence, 6 out of the 30 frames displayed on a television monitor (30 frame/sec or 60 field/sec is standard NTSC rate in North America and Japan) are in fact redundant and need not be coded into the MPEG bitstream. This naturally leads us to the shocking discovery that the actual source bit rate has really been 24 Mbit/sec all along (24 fps/30 fps * 30 Mbit/sec), and the compression ratio only a mere 21:1 !!! ("phone the police!").

Even at the seemingly modest 20:1 ratio, "discrepancies" (in polite conversational terms) will appear between the 24 Mbit/sec source sequence and the reconstructed sequence. Only conservative ratios in the neighborhood of 12:1 and 8:1 have demonstrated true transparency for sequences with complex spatial-temporal characteristics (i.e. rapid, divergent motion and sharp edges, textures, etc.). However, if the video is carefully encoded by means of pre-processing and intelligent distribution of bits (no, really), higher ratios can be made to "appear at least artifact-free."

References:

  1. Multimedia – Making it work – Tay Vaughan
  2. Computer Networks – Andrew Tannenbaum
  3. The Data Compression Book – Mark Nelson & Jean-Loup Gailly
  4. Making Movies with your PC – Robert Hone & Margy Kuntz
 

TOP