We haven't yet told you how the high level conecpts themselves are formed. To this first we need to understand how a cortical region can make predictions. Think of a region of cortex way down in the hierarchy, for example V1. These neurons in these region have very small spatial extent. They see only through a narrow window of this world. For example, these regions would see a line segment moving through their visual field, but they don't know that the line segment actually belong to a square.
Now suppose that this region learns all the most likely sequences that occur in its input. (Clearly, I am skipping some things here. I am assuming that the input is already quantized)