If the left image is encoded using pyramid encoding, it seems that it may be possible to reduce the total bit rate by exploiting redundancies between the levels of the image pairs. This should have its greatest effects on the lowest image pyramid levels, since this (low-pass) information is common to greater areas of the original image, and will not be significantly affected by offsets of several pixels between the stereo images.
Two approaches were evaluated; in both cases the left image level differences were encoded using the DCT.
In the first case the left level differences (between successive planes) were subtracted from the right level differences, and the result was DCT encoded. This was unsuccessful: for a given SNR the joint rate was always higher than pure independent coding.
A point of note is that a uniform quantiser leads to slightly better rates than the CCIT-601 non-uniform quantizater. This is logical, since most of the level difference information is in the higher frequencies.
In the second case the rate required for the encoded right level difference was evaluated. The left level difference was subtracted from the right level difference, and the rate required to code the result was evaluated. Whichever case had lower rate was encoded.
Interestingly this case also did not improve on independent encoding, although in the lower levels the minimum rate was much lower than that for independent encoding, as shown in Figure 12.
![]() |
This appears to be due to an accumulation of uncorrelated values from the DCT encoding of the joint differences as the levels increase in size, while the saving in rate applies only to a relatively small percentage of the whole pyramid -- the lowest levels are the smallest.
Another approach was an attempt to exploit the redundancy of the right image in a different way. The images were divided up into blocks of size 8x4 pixels. Then corresponding blocks were taken from the left and right images. The right image block was flipped left-to-right (to reduce discontinuities), and the 2 blocks were placed side-by-side in an 8x8 DCT block, which was then transformed. It was hoped that this would result in an entropy lower than than of independently coding the image pair.
Early experiments showed that not only did this approach increase the entropy, but SNR fell quickly. This attack on the problem was thus quickly abandoned.
Disparity estimation is similar to motion estimation in the sense that both are used to exploit the similarity between two images in order to reduce the bit rate. However, there are two main differences between disparity- and motion estimation. Firstly, if the parallel axis constraint is met, as is usually assumed, it is only necessary to find the disparity along one axis, thus simplifying the search. Secondly, in typical video scenes there are only a few moving objects and the background is either static or shows uniform motion due to panning. In contrast, disparity depends on the distance from the camera and thus different parts of the background will show different disparity. From a compression perspective, therefore, we can expect the disparity field to be less uniform than typical motion fields and thus to require a larger number of bits to be encoded for the same block size.
In stereo image processing and analysis, a major goal is to find the exact correspondence between the two images to reconstruct the depth information. This task, however, is complicated by local ambiguities in the correspondence due to noise, occlusion, and repetition or lack of texture [12]. This algorithm, however, is motivated by the recognition that it is not necessary to establish exact correspondence between the two images if compression is to be performed with out regard to context or image content. All that is required in order to reproduce the right image is an accurate representation of pixel intensity. Thus, given a pixel at position (x,y) in the right image, it is necessary only to find the disparity between this pixel and any pixel of similar intensity in the same row . It is not necessary to find the pixel which corresponds to the same point in the scene.
This fact has the advantage that in uniform areas of the image, there will be little or no disparity between pixels. Furthermore, in areas of the image in which the disparity between exactly corresponding pixels is large, it will often be possible to find a closer pixel exhibiting a similar intensity level, thus reducing the entropy of the final disparity map.
The use of a pixel-based algorithm might perhaps seem questionable in light of the widespread preference for block- and subband-based compression schemes. There are, though, instances in which block-based compression schemes are not wholly appropriate; the most notable example in the field of stereo is that of random dot stereograms, images characterized by a lack of structure and coherence.
Figure 14 shows the rate-distortion curve achieved using the pixel-based disparity map as the basis for the compression algorithm.
The bit rate required to transmit the block-based disparity map was a fairly constant 0.05 bits/pixel and was achieved using run-length chain encoding. The bit rate required to transmit the full set of errors (thus ensuring perfect reconstruction and lossless encoding) was approximately 5.5 bits/pixel. Thus, coding gains were realizable by compressing the error information.
Compression of the error information was achieved in two ways: firstly, the errors below a certain threshold were simply set to zero and discarded; secondly, the errors above the threshold were quantised. The final error matrix was encoded using zero run-length encoding. This yielded two parameters with which to manipulate the rate-distortion curve and a 2D optimization problem. The optimal solution was achieved experimentally by calculating the quantisation step level as a function of the error threshold. Figure 15 shows the resulting rate-distortion curve.
The superior performance of the block-based predictive encoder over the other methods can be attributed to two factors: excellent block matching between the two frames due to the direct block prediction (as opposed to the indirect prediction for the following method) which resulted in a more accurate disparity map and secondly, the use of a block-based disparity map which better exploited the intra-frame redundancy than the pixel-based method.
In order to compress the inter-frame redundancy, pixels can be selected from the left frame to predict the right frame as described above in pixel-based disparity map encoding. This prediction can be enhanced by using blocks rather than pixels to further exploit the intra-frame correlation as outlined in block-based disparity map encoding, which described a straightforward prediction scheme using the block to be encoded from the right frame to search the left frame for its best match for producing errors. This scheme, however, required extra overhead bits for informing the decoder of which block is chosen as the best match in the left frame.
In order to take full advantage of the intra-frame correlation, Jiang proposed an alternative method which uses neighbouring blocks to select the predictor in the left frame and compress the corresponding block in the right frame [13].
The algorithm proceeds as follows: Let Rij represent the 8x8 block to be encoded in the right frame with the top left pixel in the ith row and jth column; Lij is then the corresponding block in the left frame. Two blocks, Rij-1 and Ri-1j, can be selected to construct a pioneering block, since they are the immediate neighbours, both vertically and horizontally, to the block Rij.
Since the stereo pair is produced by observation of the same scene from two horizontally separated positions, disparity between the two frames will only occur horizontally rather than vertically. Consequently, the two neighbouring blocks, Rij-1 and Ri-1j, can be used instead to search for the predictor. In this way, no overhead bits specifying the position of the predictor block in the left frame are required to be transmitted , since the two blocks are always available at the decoding end before the block Rij is decoded.
To align the right frame with the left frame and to search for the best match between the two frames a pioneering block is produced which is simply the mean of the horizontal and vertical neighbour. (Each block is given the same weight since both contribute equally to the prediction.) Thus:
![]()
If
represents the pioneering block constructed from the right frame,
and
represents the corresponding blocks constructed inside the
search window of the left frame, then the distance between the two blocks can be calculated as:

![]()
The algorithm thus constructs a pioneering block in the right frame and then conducts a windowed search in the left frame for its best match. A predictive block is then selected by exploiting both inter- and intra-frame correlations to produce predictive errors for further compression as in the previous algorithm. The rate-distortion curve for this encoder is shown in Figure 16.
Notice the curious feature that this encoder can achieve a signal-to-noise ratio at the receiving end of greater than 15dB without transmitting any bits for the description of the right image .
Naturally, this algorithm is sensitive to the quality of the DCT compression of the left image. This is because the left frame used for predictive coding at the decoding end is not exactly the same as that at the encoding end, since the DCT algorithm is a lossy compression process. When the quality is reduced below a certain limit, the errors introduced into the reconstructed left frame will have a negative effect on the searching and matching processes. As a result, the wrong block is picked as the predictor which makes the quallity of the reconstructed right frame even worse. One simple solution suggested by Jiang [13] is to add a DCT decoder into the encoder before the pioneering block-based search is conducted. Thus the left frame for the predictive coder will be the same for both the encoder and the decoder.
While a stereo pair of images can be coded considerably more efficiently than two independent images due to the similarity between the images of a stereo pair and the consequent inherent redundancy, compression of stereo images can be facilitated further by exploiting the human singleness of vision property. That is, when the two eyes are presented with two different images, only one of the images is perceived at any given neighbourhood, while the second image is suppressed at that area [15]. This explanation of the binocular rivalry problem is known as the suppression theory [14].
When a stereo image pair is compressed, the left image can be conventionally compressed in a manner preserving the high-frequency features as desired. The right image can then be low-pass filtered and subsampled before being compressed. This yields a very low bit-rate for the right frame. At the receive side, both images are decompressed, and the subsampled image is interpolated to its original size so that the two images can be stereoscopically displayed as illustrated in Figure 18.
The Gaussian pyramid is used as the means for low pass filtering and subsampling. Thus, each pixel in a given level is a linear combination of a 5x5 window in the level below. The separable generating kernel used to construct the Gaussian pyramid is defined by W(m,n) = W(m)W(n) where W(0)=2, W(-1)=W(-1)=0.25, W(-2)=W(2)=-0.75. Final compression of the left and right image is achieved using a discrete cosine transform. Figure 17 shows the results for the subsampling algorithm.
The evaluation of the reconstructed compressed images is still an open problem. Objective measures such as RMS error, and the rate-distortion function do not take into account the peculiarities of the human visual system while subjective evaluation is unreliable and not very repeatable. When considering stereo image compression, it is important to measure how well the depth perception is preserved in the reconstructed images. Dinstein et al. [15] have conducted subjective tests involving the measurement of the response time and accuracy with which subjects could perceive the depth of objects in reconstructed stereo pairs. Their results show that subsampling and retaining one fourth of the sample does not interfere with the 3D perception. Subsampling and retaining one sixteenth of the samples caused a degradation in the 3D perception. Proper subsampling rates depend on the spectral bandwidth of the depth information.