Its easy to see why considering time is important for solving the invariance problem. Think of this. There are umpteen transformations in this world. Why did the visual system pick up scaling rotation and translation as the invariances that it would have? Clearly this has to do with the fact that brain has to deal with motion in this world.
Suppose now you show a brain image after image after image..all independent of one another. That is, first you show a banana, then you show an apple then you show a car then you show a face, then you show another banana and so on. And you don't tell the system anything other than showing these images. There is no reason why such a system would learn to associate banana's with each other, cars with each other and the sort of categorization that we humans easily perform. Clearly, this is not the way the cortex learns to recognize images.
When a baby looks at things, he/she sees a continuous motion not independent snap shots. I believe that temporal context is the cue that the cortex uses to form invariant representations of the sort we consider here. We elaborate our approach in the following few sections.