Error Handling

Lecture Notes for CS 190
Spring 2015
John Ousterhout

  • Errors and exceptions come from many sources:
    • From above:
      • Bad input from user or client
      • Misconfiguration and operator errors
    • From below: underlying system facilities don't work as desired:
      • Disk I/O error
      • Can't open file (wrong permissions, missing directory, etc.)
      • Out of memory space
      • Network socket already in use
    • From peers in a distributed system:
      • Server crashes
      • Slow communication
      • Lost network packets
    • From ourselves: internal bugs
  • Errors and exceptions are a major source of complexity and bugs
    • They account for a lot of code in large systems
    • They disrupt normal code flow:
      • They happen in the middle of other activities
      • Something didn't work like you expected
    • Hard to figure out how to handle them
      • May not be able to complete work in progress
    • Language support is clunky
      • Verbose
      • Makes code hard to read
    • Hard to test
    • Don't occur very often in running applications
      • May not work when needed
  • Programmers often make the exception problem worse:
    • Defensive programming: throw exceptions for anything that looks a tiny bit suspicious. More errors are better?
    • Expediency: rather than figure out how to solve a problem, just throw an exception, punt it to the next level
    • Result: even more exceptions, many of which no-one really knows how to handle.
  • Key idea: reduce the number of exceptions that must be handled. Specific techniques:
    • Whenever possible, define errors out of existence:
      • Deleting variables in Tcl
      • File deletion in Windows
      • Bounds checks in Java substring method
    • Mask errors (recover automatically so the error doesn't have to be reported)
      • E.g., if a server crashes, automatically fetch data from a backup server
      • Or, if a server crashes, wait until it restarts
    • Collapse errors (handle several different cases with the same code)
      • Promote one error to another (in RAMCloud, many errors get promoted to "server crash").
        • Reuse existing handler
        • May not work for exceptions that happen frequently
      • Defer reporting to a place where other exceptions could already happen.
        • Example: in RAMCloud, report RPC errors only on wait, not send.
      • Advantages of collapsing:
        • Simplifies code (fewer cases).
        • Remaining handlers get invoked more often (will get debugged).
    • Just panic (crash app)?
      • If there's not a viable way to handle it
      • Example: running out of memory in malloc
      • Or, throw an error, which isn't handled except at the very top level.
  • Before throwing an error, think about how the caller will handle it.
    • If you can't visualize how the caller will handle it, rethink the error
    • Choose between throwing an error and returning a value.
      • If the caller will almost always care about the error, might as well report it with a return value.
  • Make lots of information available after errors:
    • Include in message in exception?
    • Or, output to system log