File System Crash Recovery

Optional readings for this topic from Operating Systems: Principles and Practice: Chapter 14 up through Section 14.1.

The problem: crashes can happen anywhere, even in the middle of critical sections:

  • Lost data: information cached in main memory may not have been written to disk yet.
    • E.g. original Unix: up to 30 seconds worth of changes
  • Inconsistency:
    • If a modification affects multiple blocks, a crash could occur when some of the blocks have been written to disk but not the others.
    • Examples:
      • Adding block to file: free list was updated to indicate block in use, but inode wasn't yet written to point to block.
      • Creating link to a file: new directory entry refers to inode, but reference count wasn't updated in inode.
    • The block cache may reorder writes.
  • Ideally, we'd like something like an atomic operation where multi-block operations happen either in their entirety or not at all.

Approach #1: check consistency during reboot, repair problems

Example: Unix fsck ("file system check")

  • During every system boot fsck is executed.
  • Checks to see if system was shut down cleanly; if so, no more work to do.
  • If system didn't shut down cleanly (e.g., system crash, power failure, etc.), then scan disk contents, identify inconsistencies, repair them.
  • Example: block in file and also in free list
  • Example: reference count for an inode doesn't match the number of links in directories
  • Example: block in two different files
  • Example: inode has a reference count > 0 but is not referenced in any directory.

Limitations of fsck:

  • Restores disk to consistency, but doesn't prevent loss of information; system could end up unusable.
  • Security issues: a block could migrate from the password file to some other random file.
  • Can take a long time: can't restart system until fsck completes. As disks get larger, recovery time increases.

Approach #2: ordered writes

Prevent certain kinds of inconsistencies by making updates in a particular order.

  • For example, when adding a block to a file, first write back the free list so that it no longer contains the file's new block.
  • Then write the inode, referring to the new block.
  • What can we say about the system state after a crash?
  • Some general rules:
    • Make sure a block is properly initialized (e.g., indirect block) before storing a pointer to it.
    • Nullify existing pointers to a resource before reusing the resource

Result: no need to wait for fsck when rebooting

Problems:

  • Can leak resources (run fsck in background to reclaim leaked resources).
  • Requires lots of synchronous metadata writes, which slows down file operations.

Improvement:

  • Don't actually write the blocks synchronously, but record dependencies in the buffer cache.
  • For example, after adding a block to a file, add dependency between inode block and free list block.
    • When it's time to write the inode back to disk, make sure that the free list block has been written first.
  • Tricky to get right: potentially end up with circular dependencies between blocks.

Approach #3: write-ahead logging

Also called journaling file systems

Implemented in Linux ext3 and NTFS (Windows).

Similar in function to logs in database systems; allows inconsistencies to be corrected quickly during reboots

  • Before performing an operation, record information about the operation in a special append-only log file; flush this info to disk before modifying any other blocks.
  • Example: log entries to add a block to a file
    • "Mark block 99421 allocated"
    • "Store 99421 as the location of block index 93 in inode 862"
  • Then the actual block updates can be carried out later, in any order.
  • If a crash occurs, replay the log to make sure all updates are completed on disk.
  • Guarantees that once an operation is started, it will eventually complete.

How do log entries describe changes?

  • One possibility: log entries describe logical operations such as "add block 99421 to inode 862 as block index 93"
  • Another approach: log physical disk updates, such as "patch 4 bytes at offset 324 in block 6159972 with the value 99421"
  • Assignment 8 uses a hybrid:
    • Patch disk blocks
    • Mark blocks free/allocated
  • It's important to identify consistent groups (e.g., all the changes required to add a block to a file)
    • Don't process any of the entries in a group unless all are present
    • Assignment 8 uses a transaction mechanism to group related entries

Problem: log grows over time, so recovery could be slow.

  • Solution: occasional checkpoints
  • Record current log head (start of oldest incomplete transaction)
  • Flush all dirty blocks to disk
  • Once this is done, the log can be cleared up to the recorded position

How much to log?

  • Typically the log is used only for metadata (free list, inodes, indirect blocks), not for actual file data.
  • Logging all file data is much more expensive.

Logging advantages?

Logging disadvantages?

Solution: delay log writes

  • Safety requirement: log entry must be written before any other disk blocks related to the log entry
  • Buffer log in memory
  • Before writing back a cache block, flush log
  • This separates durability from consistency

Remaining problems

Crashes can still lose recently-written data if it hasn't been flushed to disk.

  • Solution: apps can use fsync to force data to disk.

Disks fail

  • One of the greatest causes of problems in large datacenters
  • Solution: replication or backup copies (e.g., on tape)

Conclusions:

  • Interesting tradeoffs between performance, durability, and consistency
  • To get highest performance, must give up some crash recovery capability.
  • Must decide what kinds of failures you want to recover from.