Time to Chase Computer Bugs
Isn't annoying when a computer program, like a word processor, suddenly crashes? Due to their complexity, more or less all computer programs contain errors or "bugs". Some bugs are harder to find than others, especially if they only pop-up on rare occasions. Although annoying, a bug might not be such a big issue in a word processing program. But in cases like medical equipment and banking software, you probably do not want any of these "once-in-a-while" bugs. My research is about making sure no such hard-to-find bugs are left undetected in computer programs.
As you might have guessed, I work in the field of computer science. A computer by itself is nothing but a stupid machine performing calculations on numbers. To be able to do anything useful, a programmer must tell the computer exactly what the numbers mean and what to do with them. I don't write these kind of programs. Instead I'm working on a special program to analyze other programs. My goal is to make sure computer programs are free of bugs before people start using them.
If your word processor suddenly crashes, it's obviously an error. But why did it crash? If you can cause a second crash by repeating the same steps that caused the first crash, you are starting to narrow in on the bug. But what if you can't make it crash again, no matter how many times you try? You know a bug is in there, but you can't get it out into the light again. This is what I call a "once-in-a-while" bug, and these are the specific ones I study.
In the old days, computer programs could only do one thing at the time. There was only one thread of "thought" or computation, pretty much as fun as juggling with only one ball. To keep more balls in the air, programmers came up with ways of having multiple threads of "thought" or computation going on at the same time. As an example, think of a big jigsaw puzzle. If you try to complete the puzzle on your own, using one thread of thought, it will take quit a while to finish. A different solution is to randomly make four piles of pieces, keep one for yourself and give the rest to three of your friends. Using four threads of thought, you are likely to finish much faster this time. However, from time to time, team members need to exchange pieces with each other before the final assembly. This is called communication between threads.
What if suddenly, every person needs a piece from someone else, but no one is willing to give a piece away before he or she gets a new piece first? Clearly the puzzle will never be finished and you end up sitting staring stubbornly at each other forever. This is called a dead lock. The reason for the dead lock is that all participants, at the same time, suddenly need a piece from someone else. The dead lock would never have happened if only one person had been a little slower or made one different choice during the process of solving the puzzle. The lesson learned is that although the problem can be solved faster using multiple threads, communication between threads makes hard-to-find "once-in-a-while" bugs possible.
When multiple threads are working together, the huge number of possible and varied timing of communications between threads makes the bug hunt complicated. This also explains the "once-in-a-while" nature of dead locks and other similar bugs common in multi-threaded programs. Simply running the same program over and over again may not be enough to find such a bug. Instead of waiting for the unusual occasion when a bug pops up, why not make the unusual happen more often? This is exactly what my research is about.
I try to build a sort of time-bending machine. When chasing after "once-in-a-while" bugs, the machine first tries to find out where in the program different threads communicate. Next, the machine runs the programs a few times, each time bending the relative timing between threads in a clever way. If the time-bending is clever enough, the machine should find the bug without trying all possible ways of running the program (all different ways you and your friends can solve the jig-saw puzzle).
The machine itself is a computer program running other computer programs.
Hence, my work is a combination of computer programming and computer science
research. At best, a buggy program is annoying. However, if the program
controls bank accounts, pace-makers, or nuclear power plants, the results
of a bug can be disastrous. Hopefully, my work will make future computer
programs more reliable.
|Modified 15 January 2003 * Contact Us|