Outcome of a random variable can be thought to provide a certain amount of information, which is quantified as a logarithm of 1/(probability associated with the outcome) [1]. The unit of information becomes 'bit' if we choose 2 as a base of the logarithm.

 For a given random variable X, entropy H(X) is defined as an expectation of the information,

(1)
 where p(X=x) denotes the probability that a random variable X turns out to be an outcome x.

 Also, joint entropy H(X,Y) and conditional entropy H(X|Y) are defined as below [2],

(2)
(3)

 and these two concepts are related by

(4)

 Mutual information I(X,Y) is defined in terms of entropies,

(5)

 and can be rewritten using conditional entropy,

(6)

 A more ordered random variable brings out a more predictable outcome, which means each outcome provides less information. Therefore, expectation of information (entropy) can be interpreted as a measure of uncertainty and the above definition makes mutual information the reduction in the uncertainty of X due to the knowledge of Y.

 We can consider pixel values as outcomes of a random variable. Pixels with common pixel values are regarded as little information sources whereas an uncommon-valued pixel is appraised at big information.

 While entropy for an image remains fixed, joint entropy and mutual information of two images vary as the 1-1 correspondence between the pixels from each image changes with every geometrical alignment. When mutual information is maximized, the geometric relationship, under which one image explains the other most effectively, is achieved. In other words, maximization of mutual information provides image registration.

 Joint entropy of two images is calculated in the following way. Let Fig. 1 and Fig. 2 represent an image and its pixel values. Its 1-pixel translation, or misalignment, is Fig. 3 and Fig. 4. From Fig. 2 and Fig. 4, joint outcomes for 1-pixel misalignment and exact alignment are obtained ( Fig. 5 and Fig. 6). Fig. 7 and Fig. 8 are joint histogram of Fig. 5 and Fig. 6. The height of joint histogram, number of occurrence for an outcome, corresponds to probability. Using the above definitions, joint entropy is calculated to be 3.00bit for Fig. 7 and 2.06bit for Fig. 8.

 Fig. 9 demonstrates the behavior of the entropy as a function of the misalignment for the 2D display of a pelvic CT image shown in Fig. 10 with itself. On the extremes, if the image is shifted 60 pixels to the left ( Fig. 11) or right ( Fig. 12), then the joint entropy of it with the original is 9.2. For perfect alignment the joint entropy is 5.4. The full width at half minimum is 1.2 pixels. The exact alignment defines a sharp, narrow minimum of joint entropy with well-defined gradient only in the proximity of the alignment. This makes the task of finding minimum difficult. I used 'fminsearch' routine of matlab to find out maximum mutual information, which was successful only when the initial point is given near the alignment (กพ5 pixels of center). More efficient algorithm that searches for minimum is necessary to implement image registration by maximization of mutual information.

 As is discussed above, mutual information is a function of transformation between the images. 3D transformation is specified by 6 variables (3 for translation, 3 for rotation). The algorithm that searches maximum value for a function, which is mutual information, of 6 independent variables can achieve 3D (volumetric) registration.

References:
[1] S. Haykin (1994), Communication Systems Chap.10, 3rd ed., John Wiley & Sons, Inc.
[2] A. Papoulis (1991), Probability, Random Variables, and Stochastic Processes Chap.15, 2nd ed., McGrawHill