File System Crash Recovery

Lecture Notes for CS 140
Winter 2012
John Ousterhout

  • Readings for this topic from Operating System Concepts: Section 11.7
  • The problem: crashes can happen anywhere, even in the middle of operations:
    • Lost data: information cached in main memory may not have been written to disk yet.
      • E.g. original Unix: up to 30 seconds worth of changes
    • Inconsistency:
      • If a modification modifies multiple blocks, a crash could occur when some of the blocks have been written to disk but not the others.
      • Examples:
        • Adding block to file: free list was updated to indicate block in use, but file descriptor wasn't yet written to point to block.
        • Creating link to a file: new directory entry refers to file descriptor, but reference count wasn't updated in file descriptor.

Approach #1: repair during reboot

  • Example: Unix fsck ("file system check")
    • During every system boot fsck is executed.
    • Checks to see if disk was shut down cleanly; if so, no more work to do.
    • If disk didn't shut down cleanly (e.g., system crash, power failure, etc.), then scan disk contents, identify inconsistencies, repair them.
    • Example: block in file and also in free list
    • Example: reference count for a file descriptor doesn't match the number of links in directories
    • Example: block in two different files
    • Example: file descriptor has a reference count > 0 but is not referenced in any directory.
  • Limitations of fsck:
    • Restores disk to consistency, but doesn't prevent loss of information; system could end up unusable.
    • Security issues: a block could migrate from the password file to some other random file.
    • Can take a long time: 1.5 hours to read every block in a medium-size disk today. Can't restart system until fsck completes. As disks get larger, recovery time increases.

Approach #2: ordered writes

  • Prevent certain kinds of inconsistencies by making updates in a particular order.
    • For example, when adding a block to a file, first write back the free list so that it no longer contains the file's new block.
    • Then write the file descriptor, referring to the new block.
    • What can we say about the system state after a crash?
    • In general:
      • Never write a pointer before initializing the block it points to (e.g., indirect block).
      • Never reuse a resource before nullifying all pointers to it.
      • Never clear last pointer to a live resource before setting new pointer (e.g. mv).
  • Result: no need to wait for fsck when rebooting
    • Run fsck in background to reclaim lost storage.
  • Problem:
    • Requires lots of synchronous metadata writes, which slows down file operations.
  • Improvement:
    • Don't actually write the blocks synchronously, but record dependencies in the buffer cache.
    • For example, after adding a block to a file add dependency between file descriptor block and free list block.
      • When it's time to write the file descriptor back to disk, make sure that the free list block has been written first.
    • Tricky to get right: potentially end up with circular dependencies between blocks.

Approach #3: write-ahead logging

  • Also called journaling file systems
  • Similar in function to logs in database systems; allows inconsistencies to be corrected quickly during reboots
    • Before performing an operation, record information about the operation in a special append-only log file; flush this info to disk before modifying any other blocks.
    • Example: adding a block to a file
      • Log entry: "I'm about to add block 99421 to file descriptor 862 at block index 93"
    • Then the actual block updates can be carried out later.
    • If a crash occurs, replay the log to make sure all updates are completed on disk.
    • Problem: log grows over time, so recovery could be slow.
    • Solution #1: checkpoint
      • Occasionally stop and flush all dirty blocks to disk.
      • Once this is done, the log can be cleared.
    • Solution #2: keep track of which parts of the log correspond to which unwritten blocks; as blocks get written to disk, can gradually delete old portions of the log that are no longer needed.
    • Typicallly the log is used only for metadata (free list, file descriptors, indirect blocks), not for actual file data.
  • Logging advantages:
    • Recovery much faster.
    • Eliminate inconsistencies such as blocks confused between files.
    • Log can be localized in one area of disk, so writes are fast (no seeks).
    • Metadata writes can be delayed a long time, for better performance.
  • Logging disadvantages:
    • Synchronous disk write before every metadata operation.

Remaining problems

  • Can still lose recently-written data after crash
    • Solution: apps can use fsync to force data to disk.
  • Disks fail
    • One of the greatest causes of problems in large datacenters
    • Solution: replication or backup copies
  • Disk writes are not atomic:
    • If a block is being written at the time of the crash, it may be left in inconsistent state (neither old contents nor new).
    • At the level of sectors, inconsistencies are detectable; after crash, sector will be either
      • Old contents
      • New contents
      • Unreadable trash
    • But, blocks are typically multiple sectors. After crash:
      • Sectors 0-5 of block may have new contents.
      • Sectors 6-7 of block may have old contents.
    • Example: appending to log
      • If adding new log entries to an existing log block, crash could cause old info in the block to be lost.
    • Solution:
      • Replicate log writes (if crash corrupts one of the logs, the other will still be safe).
      • Add checksums and/or versions to detect incomplete writes.
  • Conclusions:
    • Must decide what kinds of failures you want to recover from.
    • To get highest performance, must give up some crash recovery capability.