Evaluation of search strategies for block matching

EE 368b Project

Marcus Isaksson and Joakim Jalden
December 1, 2000

1. Abstract

We will present an evaluation of three different search strategies used for motion estimation using block matching. The strategies that we will compare are: full search, 2D logarithmic search and conjugate direction search, carried out for integer-pel, half-pel, and quarter-pel accuracy of motion compensation using bilinear interpolation for sub-pel positions.
The strategies will be compared in terms of prediction error and computational complexity.

2. Introduction

In order to efficiently encode a sequence of frames one could use block based motion estimation to predict frames from previous frames. This involves finding the block in the previous frame that in some sense is the most similar block to the block currently viewed in the present frame. The block in the previous frame would then be used as a prediction for the block in the present frame and the prediction error would be coded and transmitted along with the offset to the block in the previous frame. This offset will be referred to as the motion vector as it is an indication of the motion of the particular block.

An apparent problem with this scheme is that the process of finding the best block in the previous frame requires very heavy computation. If one were to work with frames of size 176x144 pixels and were allowed to chose 8x8 pixel blocks at 0.25 pixels apart there would be 366785 blocks to chose from. It is of course not plausible to believe that all blocks are equally likely to be the best match and one could limit the search to a smaller number of blocks. The number of blocks that still has to be tested in a practical system will however be rather large.

Several more or less clever ways of finding the best match or finding a match that is good enough without having to consider every possible block have been proposed. Two of these methods for finding the best match was evaluated in this project and the results of these were compared to the "brute force" method.

3. Search strategies

A. Full Search

The most general search strategy. Each possible motion vector within a given maximal search distance is evaluated. This is slow, but clearly gives an optimal motion vector within given maximal search distance. This algorithm is too slow to be of any practical use when run on a single processor system, although it's regularity makes it a good candidate for parallel systems or specially designed hardware. In our test sequences a maximal search distance of 18 pixels was used. This number is based on an estimation of the maximal motion between frames in those sequences. Further increase of the search distance would not have resulted in any significant (if any) improvement of the prediction.

B. 2D Logarithmic Search

The 2D logarithmic search as described by Jain and Jain [1] is a method of finding the motion of a block by successively moving towards a (not necessarily global) minimum using a shrinking step size. The method work as follows:

1. Chose an initial step size s and an initial starting point, (i,j), of the search.

2. Look at the distortion that would result from choosing the blocks at coordinates (i,j), (i-s,j), (i+s,j), (i,j-s) and (i,j+s) as prediction for the current block of interest. If the minimum distortion is at (i,j) the step size is changed to s/2. Otherwise the coordinates (i,j) is change to the coordinates that revealed the minimum distortion.

3. If s is larger than the resolution of the motion vectors then continue at 2.

A more thorough description of the method can be found in the paper by Jain and Jain [1] written in 1981. This paper does not discuss logsearch at a subpixel level but the implementation of this using logsearch is straight forward in that only the resolution of the motion vectors will be set to a value below 1 pixel.

C. Conjugate Direction Search

Conjugate direction search as described by Srinivasan and Rao [2] is based on the assumption that the goal function to be minimized (the prediction error variance) behaves quadratic close to an optimum.
In each step the algorithm tries to minimize the goal function along one direction from a starting point. Using a given step length it searches along that direction until a minimum is found. Then a search in another direction is performed. More specific, the algorithm always keeps track of two directions, say D and E, initially set to vertical resp. horizontal directions. First a search along E is performed. From that optimal point a search along D is performed. From the direction of the total movement of the search point during these two searches a new search direction C is found. Another search along C is then performed, and after that E is replaced by D and D by C and the algorithm starts all over again with the new set of directions. When the current search point is minimal in both the D and the E direction the algorithm terminates. After some steps this algorithm will produce non-integer search points. Since we only allow integer motion vectors (or half/quarter of an integer in sub-pixel resolution) these points have to be rounded of to the nearest integer value (or half/quarter of an integer).

4. Description of Test Data

To test the algorithms at hand 3 sets of test data were used. These were empirically selected in an effort to represent a wide variety of sequences while minimizing the number of data sets. The test sequences consisted of 6 qcif frames each which enables 5 different frames to be subject of the block matching techniques. Only the luminance part of the frames were used in the evaluation of the block matching techniques. This makes a data set to be 6 frame each of 176x144 pixels quantized to 8 bit values. These sets of frames will from here on be referred to as Suzie1, Suzie2 and Foreman.

Suzie1

This set of frames are taken from the suzie qcif animation frames 47 to 52. The sequence shows a woman that faces the camera while speaking on the telephone. In the frames selected for suzie1 the woman is just about to change position and the frames show her head going rapidly backwards while the she is lowering her telephone.

The circles in the plot above marks the center of a block in the current frame and the lines goes out from the center of a circle to the position of where this block was located in the previous frame. This sequence mainly has two moving parts that move in different directions. These are the head that goes to the right while turning and the telephone that goes down. The frames also consist of a rather large portion that is, except from some noise, constant from frame to frame. A typical motion between frames in this sequence is 5 pixels for the telephone and 0 for the background.

Suzie2

This set of frames are also taken from the suzie qcif animation. Suzie2 consist of frames 125 to 130. In these frames the woman is simply listening to the phone and the frames contains virtually no motion at all.

The only movement in these frames are small movements of the eyes and a gentle, nearly undetectable, movement of the head. This sequence of frames also consist of large parts of constant background. Motion vectors in this sequence are typically 0.

Foreman

The foreman data set are the frames 308 to 313 of the foreman qcif movie. These frames show a pan of the camera across a construction yard. The sequence consist of large diagonal motion. The frames, in contrast to the other sets, also have a fair bit of motion blur to them.

Large portions of new image is presented in the lover and right edges of each frame. The size of the motion between frames is somewhere between 8 and 9 pixels. All motion of this sequence is translatory.

5. The distortion function

In the implementation of the block matching algorithms mean square error was used as a measurement of distortion between blocks. More precisely a function D(x,y) which returned the sum of the squared differences between the pixels divided by the number of pixels in the current block at coordinates (i,j) and the block at position (i+x,j+y) in the previous frame was used. In order to be able to handle sub-pel accuracy in the motion, non integer values of x and y, bilinear interpolation was used to get pixel values of the previous frame at non integer coordinates.

The distortion function D(x,y) proved to be very different depending on what block was being used. Ideally one would wish that D(x,y) would prove to be an convex function with one single minimum corresponding to the motion of the block. This is something that both the papers by Jain and Jain [1] and Srinivasan [2] claim to be a good approximation of the real case and it is somewhat crucial in order for the search algorithms to work efficiently.

It was however found in this evaluation of search methods that this assumption about the distortion function is valid only for smooth sequences with very limited motion. In sequences containing higher frequency components and where the motion is larger this model of the distortion function is not a very good one. In order to illustrate this point two block from different sequences and the corresponding distortion function is shown below.

As noted above the distortion of the block in the higher motion sequence Foreman which also features higher frequency components (trees) does is not have a single minimum. It is however true that the global minimum at (x,y) = (7,4.5) correspond to the true motion. In the block taken from the Suzie1 sequence the model is better but by no means perfect.

An artifact that can be seen in both plots of the distortion is a set of small bumps in the surface. The tops of these bumps are located at integer coordinates. This is a result of filtering due to the bilinear interpolation. At integer positions the interpolation results in the value of the pixel at that position while at the non integer positions the value from the interpolation is a weighted average of 4 neighboring pixels. This has a low pass filtering effect on the image and consequently reduce the noise. As an effect of this the distortion of the block match corresponding to this position is lower. This effect has to be kept in mind if one is to use a search strategy based on the gradient of the distortion function at sub-pel accuracy.

6. Algorithm Performance

We have compared the distortion and the execution speed of the three algorithms using the three test sequences, and both 8x8 blocks and 16x16 blocks. For the distortion comparison the PSNR (Peak Signal To Noise Ratio) for the predicted images has been used.
For the speed comparison we have not really measured the execution time, since it's very implementation dependent (Matlab favors algorithms that can be implemented using matrix operations instead of loops). Instead the average number of block comparisons performed in the search for a given motion vector has been used. This is where a good non-matlab implementation would spend most of it's time, so ignoring the execution time of the other code doesn't matter that much.
The following plot shows the comparison for Foreman using 8x8 blocks.

One of the problems with 2D logarithmic search and conjugate direction search is that they tend to behave very bad if we start the search far away from the actual optimum. This is very clear in the Foreman sequence, where the optimum motion vector is 8-9 pixels away if we start with a zero motion vector. To reduce the impact of this problem, we have also implemented both algorithms "with memory", where starting point is chosen as either zero, or the motion vector from the block above or the motion vector from the block to the left, whichever gives the minimal distortion.
To see the impact of this change look at the following two images, where the motion vectors found by conjugate direction with, and without this change is shown (8x8 blocks, quarter-pel). Most motion vectors, except those in the upper left corner which are evaluated first, becomes much better, since we use the information provided by the previous evaluated motion vectors.

On the Foreman sequence using 8x8 blocks, conjugate direction search with memory outperforms the other algorithms (except, of course, full search). On the Suzie1 sequence, with motion only in parts of the image, all algorithms except conjugate directions search without memory behaves similarly.

On the Suzie2 sequence, which contains very little motion, the differences are (as expected) very small.

When it comes to speed the result is the same for all sequences. For any given PSNR value, conjugate direction search with memory has the lowest number of average block comparisons per motion vector. The use of 16x16 blocks instead, increases the distortion quite much. The PSNR values drops by 3-5 dB for most algorithms and test sequences. See the following diagrams with the distortion comparison of 16x16 blocks:

Suzie1
Suzie2
Foreman

7. Conclusions

Given a sequence that contains relatively small motion and with and few high frequency detail the search algorithms algorithms work reasonably well. Given a sequence with hight motion an a lot of detail the results are far from perfect. The main problem is the apparent difficulty of these algorithms to find a minimum that is more than an about block size away. If however these algorithms (and conjugate directions in particular) are given the extra information of a good guess as to where the minimum might be, in our case the use of previously calculated motion vectors, the algorithms do perform remarkably well. Especially considering that for the maximum search length of 18 pixels used in these measurements the number of block comparisons for conjugate direction versus full search are cut by a factor of about 100 for integer-pel to 500 for quarter-pel. And this with an decrease in prediction PSNR by only about 1 to 2 dB as compared to full search.

With the speedups available for the full search block these algorithms are probably not that useful if there is time to compress the sequence of frames and store the compressed data prior to transmission but in a software based encoder compressing in real time these algorithms might prove to be very useful indeed.

8. References

[1] Jain J.R., Jain A.K., Displacement Measurement and Its Application in Interframe Image Coding , IEEE Transactions on Communications, Vol. 29, No. 12, December 1981 <0br>
[2] Srinivasan R., Rao K.R., Predictive Coding Based on Efficient Motion Estimation, IEEE Transactions on Communications, Vol. 33, No. 8, August 1985