Assignment 8: Journaling File System (Write-Ahead Log)

Assignment 8: Journaling File System (Write-Ahead Log)

In this assignment you will implement file system recovery based on journaled metadata. We will be using the Unix V6 file system format from 1975 that you learned about in the previous assignment, but bringing it (at least partially) into the 21st century by journaling all metadata updates to a write-ahead log and tracking free blocks in a bitmap instead of a linked list.

This assignment also contains an additional ethics exercise related to long-term support for operating systems.

Here are the learning goals for this assignment:

  • Learn about write-ahead logging, which is an important concept in file systems, databases, and many flash devices.
  • Learn how to repair inconsistencies by replaying log entries.
  • Learn about user-space file systems.
  • Consider the ethical issues around long-term support (or lack thereof) for operating systems.

Project Overview

We will supply you with a C++ implementation of the V6 file system using FUSE (File system in USErspace). FUSE is a Linux feature allowing developers to implement file systems in ordinary user processes known as fuse drivers, without modifying the kernel. When applications make file system calls, the kernel passes the system calls through as messages to the user-level fuse driver, which then passes the results back to the kernel. The result is a fully functional file system that you can use just like the system's built-in file system to store files, run builds, and so on. Though this architecture incurs a bit of performance overhead, it is used by production file systems such as NTFS-3G and the exFAT implementation commonly used on linux. The V6 file system doesn't store its data directly on a disk; instead, it reads and writes a disk image stored in a file, similar to what you used in Assignment 7.

The V6 file system we provide you has an inode cache and a block cache, and uses delayed writes to push modified data back to the underlying disk image. If the file system crashes, this is likely to result in metadata inconsistencies. For example, blocks still allocated in files might be marked free, posing a danger of cross-allocation, or directory entries might point at unallocated inodes.

We supply you with a file system scavenger program fsckv6 that can restore a disk image to a consistent state using techniques similar to those discussed in lecture, but it may result in heavy data loss. If there have been a lot of metadata changes, you will likely lose files.

The good news is that we've augmented the file system to keep a write-ahead log of all metadata changes. We've also created a program that can replay log entries after a crash to recover lost metadata and ensure that file system data structures are consistent.

Your task for this assignment is to write code to replay 3 different kinds of log entries. The actual code you will write is tiny (only about a dozen lines), but you'll need to spend a bit of time learning about how the log works and about other facilities in the file system that you will use to write your code, such as the block cache and the freemap.

Getting Started

Login to the myth cluster and clone the starter repo with this command:

git clone /afs/ir/class/archive/cs/cs111/cs111.1236/repos/assign8/$USER assign8

This will create a new directory assign8 in your current directory and clone a Git starter repository into that directory. Do your work for the assignment in this directory. You will make all of your code changes in replay.cc, which is responsible for replaying log entries after a crash.

The directory also contains a Makefile; if you type make, it will create several executables, including the following:

    mkfsv6: Creates an image file for a disk with no files other than a root directory; the file system will use this as if it were a disk, to store file system data.

    mountv6: This is the file system driver.

    fsckv6: A version of fsck for the V6 file system. It will restore consistency to the disk, but it's likely to lose a lot of information.

    recover: This program will contain your code for crash recovery; it will run after crashes to replay the log and restore consistency to the disk.


Try running make now, followed by tools/sanitycheck: almost all of the tests should fail. See the section Testing Your Code below for more information about how to test your assignment (the infrastructure is a bit different from previous assignmnents).

Exercise 1: Using the FUSE Implementation of Unix V6

This exercise will walk you through how to run the FUSE implementation of the Unix V6 file system. In order to use a file system you must mount it. When a file system is mounted, its root directory overlays an existing directory in a parent file system; all of the information in the mounted file system will suddenly appear where the mount directory used to be.

Your starter repo contains a directory mnt; you'll mount the V6 file system on top of it. Make sure that mnt exists (it should have a README.txt file explaining the behavior of the mount directory). Now invoke the following commands:

cp samples/disk_images/assign1.img v6.img
./mountv6 -j v6.img mnt &

The first command makes a private copy of a disk image that we have created for you. The mountv6 command runs our implementation of the V6 file system using v6.img as the disk image: it mounts the file system over the mnt directory and will handle requests to access that file system (the -j option tells it to use its journaling mechanism to track metadata changes). This command will continue running in the background until the file system is unmounted. The output that appears on your screen may seem confusing because mountv6 is running concurrently with the shell and each program generates output (the shell writes a prompt and mountv6 generates an information message). This is a race, so the outputs may appear in any order. Nonetheless, the shell is ready for you to type more commands.

Now type

ls mnt

You will see that the old contents of the directory have disappeared. Instead, you'll see the root directory of the mounted file system. It contains one subdirectory, assign1, in which we have made a copy of the starter files for Assignment 1. The behavior of the mounted file system is indistinguishable from other directories and files except that performance will be a little worse because requests have to be forwarded to the mountv6 process. Type the following commands:

cd mnt/assign1
ls
make
./nearest

All of the file system kernel calls invoked by these commands were forwarded to the mountv6 process, which carried them out using its V6 implementation. Instead of reading and writing blocks on a disk, mountv6 read and wrote blocks in the disk image file.

Once you have run the commands above, answer Question 1 in questions.txt.

Finally, unmount the file system by invoking the following commands (you must run the fusermount command in the same directory where you invoked ./mountv6; otherwise it will fail):

cd ../..
fusermount -u mnt

This command will cause the mountv6 process to exit after shutting down the file system cleanly. The pre-existing contents of the mnt directory will now reappear. The changes you made to the V6 file system have been saved in the disk image file. If you invoke ./mountv6 to remount the file system, its contents will appear just as they were at the time it was unmounted.

Exercise 2: Experiments with Crash Recovery

In this exercise you will intentionally crash a V6 file system, leaving its on-disk state inconsistent. Then you will recover from the crash in two different ways, once with fsck and once with the log, and compare the results.

First, let's crash a file system. Working in the top-level directory for the assignment, create an empty disk image and mount it with the following commands:

./mkfsv6 crash.img
./mountv6 -j --crash 935 crash.img mnt &

The --crash argument tells the file system driver that it should crash itself (exit without flushing dirty cache blocks) after 935 writes to the disk image file. Now type the following commands:

cd mnt
../create_files.py

The create_files.py script tries to create a large number of files with names file0, file1, and so on. Each file contains a single line of the form

This line belongs to file 101

The script groups the files in subdirectories, with 10 files in each subdirectory. The script will generate enough data to cause the file system to reach its --crash limit; when this happens the mountv6 process will exit and the file system will be unmounted, so create_files.py will exit with an error message. Once the file system has crashed, answer Question 2A in questions.txt.

Now we are going to recover the crashed file system. We'll do it twice: once with fsck and once using the file system's log. First, make a copy of the crashed disk image:

cp crash.img crash2.img

We have implemented a version of fsck for the V6 file system; run it on the original crashed image:

./fsckv6 -y crash.img

The -y switch tells fsckv6 to print out information about any inconsistencies, and also repair the disk; without that switch it will print information about inconsistencies without actually repairing them. This version of fsck does not implement the lost+found recovery described in lecture: if it finds an inode that is allocated but there are no directory entries pointing to the inode, it just deletes the inode's file (this results in "freeing unreachable inode" messages in the output). Once you have run fsckv6, answer Questions 2B and 2C in questions.txt.

Next, remount the recovered image and explore it with the following commands:

./mountv6 -j crash.img mnt &
cd mnt
../check_files.py

The check_files.py script will scan over all of the files in the directory; its output indicates how many files it found, plus any errors it found in the files (such as a file whose contents are entirely zero, or a directory with no files in it). Once you have run check_files.py, answer Questions 2D, 2E, and 2F in questions.txt.

Now recover the second copy of the crashed image using log-based recovery:

cd ..
samples/recover_soln crash2.img

This program recovers the crashed file system by replaying the log that was generated by the file system. It includes our sample solution for the code you will write.

Finally, unmount the file system recovered with fskv6, mount the file system recovered with recover_soln, and inspect that file system:

fusermount -u mnt
./mountv6 -j crash2.img mnt &
cd mnt
../check_files.py

Then answer Questions 2G, 2H, and 2I in questions.txt.

Exercise #3: Implementing Log Replay

In this exercise you will write code that replays individual log entries to restore consistency to a disk image. Your code will be in the file replay.cc; this file will be compiled with additional code we've written to form the recover executable. We've already written code that does the following things:

  • Reads the log from the disk image
  • Finds all the valid log entries that must be applied. For example, it skips any entries preceding the most recent checkpoint, and stops replaying either at the end of the log or if it finds an inconsistent entry. The log entries contain checksums that allow us to detect if an entry was not correctly written (e.g. if the system crashed in the middle of writing an entry).
  • Checks for complete transactions: the logging mechanism allows a collection of log entries to be grouped into an atomic transaction. If a transaction isn't complete (the system crashed before writing all of the entries in a transaction) then none of its entries will be replayed.
  • We have also added code to the file system to generate log entries as needed and ensure that they are written to disk before any of the disk blocks affected by the log entries are written.

If you're interested in learning more about the structure of the V6 file system and its log, check out this page with additional information.

Once our code has determined that a log entry should be applied, it will invoke your code, which consists of several V6Replay::apply methods in replay.cc. The methods are overloaded: they all have the same name, but each method has a different argument type reflecting the specific type of log entry that it must process. You must fill in the bodies of these methods.

There are 3 different types of log entry that you must process:

Patch bytes

struct LogPatch {
    uint16_t blockno;           // Block number to patch
    uint16_t offset_in_block;   // Offset within block of patch
    std::vector<uint8_t> bytes; // Bytes to place at offset_in_block
};

This log entry specifies a change to portion of metadata on disk. This is used for operations like updating the block numbers stored in an inode or indirect block, updating an inode's last modified time, or updating directory entries. It contains specific bytes that must be written to the disk at position offset_in_block of block blockno; the length of bytes determines how many bytes are overwritten.

Note: the interface for reading and writing disk blocks is different in this assignment than in Assignment 7; see Essential Utility Modules below.

Free block

struct LogBlockFree {
    uint16_t blockno;           // Block number of freed block
};

Specifies that block number blockno, which was previously allocated, should be marked free (1 or true in the freemap). See Essential Utility Modules below for information on how to manipulate the freemap.

Once you have implemented the apply methods for LogPatch and LogBlockFree you should be able to pass the first sanity test (repairing unlink-rmdir.img).

Allocate block

struct LogBlockAlloc {
    uint16_t blockno;       // Block number that was allocated
    uint8_t zero_on_replay; // Metadata--should zero out block on replay
};

Specifies that block number blockno, which was previously free, should now be marked as allocated (0 or false in the freemap). In addition, if zero_on_replay is non-zero then the contents of the block must be cleared to zeroes. If zero_on_replay is zero, then the contents of the block must be left as-is.

The zero_on_replay field is needed so that metadata and data blocks can be handled differently. It will be set whenever the allocated block is going to be used for metadata — i.e., as an indirect (or double-indirect) block, or for a directory. This is needed because new metadata blocks are always zeroed out when they are allocated during normal operation (e.g., all indirect block pointers in an indirect block will start off as zero, as will all entries in a new directory block). However, there is no guarantee that this zeroed-out block reached disk before the crash. If the block isn't cleared during crash recovery, it could contain arbitrary garbage that causes the file system to misbehave. If the block needs to contain any nonzero information, that information will be provided by subsequent LogPatch log entries.

For data blocks, it would be safe to zero out the block during crash recovery. However, this could lose valid data. The reason is that data block updates are not logged. If the data block was actually written safely to disk before the crash, zeroing the block will lose those contents. So, we don't clear data blocks during log replay; of course, this means that the file could inherit random garbage data, but this will not cause the file system to misbehave (presumably users will notice the garbage data and figure out what to do with it).

You may notice that whenever the log contains a LogBlockAlloc entry, there's also a LogPatch entry setting a pointer to that block in the same transaction. You may also find multiple LogBlockAlloc entries in the same transaction (one to allocate an indirect block, and another to allocate a data block, whose disk address will be stored in the indirect block using a LogPatch entry).

Once all three of the apply methods have been implemented, you should be able to pass all of the sanity tests.

Other log entry types

In addition to the log entry types above, which you must implement, there are three other types of log entry that we have implemented for you: LogBegin and LogCommit, which identify the start and end of each transaction, and LogRewind, which indicates that the log wraps back to the beginning of the log storage area. For more details on these entries, see the V6 extra information page.

Essential Utility Modules

In order to implement log entry replay you will need to interact with the file system buffer cache and the freemap.

Buffer cache interface

In the previous assignment you used diskimg_readsector to read blocks from the disk. For this assignment, however, we have a block cache, so you will need to interface with that instead. The methods you will write all have access to an instance variable V6FS &fs_, which you will use for file I/O. The main methods on this variable (defined in v6fs.h) are:

struct V6FS {
    ...
    Ref<Buffer> bread(uint16_t blockno);
    Ref<Buffer> bget(uint16_t blockno);
    ...
};
  • bread reads a block from disk, and returns the contents in a buffer in the file system block cache.

  • bget returns a buffer for a particular disk block, but doesn't actually read it from disk. The contents of the buffer could be garbage. However, if you are going to overwrite an entire block (for instance by zeroing it out), there is no reason to read the old contents from disk, so bget is more efficient.

You will need to use both bread and bget in your solution, and it's very important to understand the difference between them. One of the most common (and confusing) mistakes is using bget when bread is needed; this results in the sudden appearance of garbage in metadata.

Both of the above methods return a Ref<Buffer>. You can treat a Ref<Buffer> as if it were a Buffer*, but it's actually a smart pointer that keeps a reference count of how many Ref's exist for a block in the cache. The cache won't evict a block while there exist Ref's for it. It's a clever class that is similar in some ways to std::unique_lock, which you used earlier in this class, and also to std::shared_ptr, which some of you may have seen before. If you have time, take a look at the implementation of Ref in cache.hh.

A Buffer contains the cached contents of a disk block. The two things you need to do to a buffer are to access the memory, and to tell the system when the buffer is dirty. The memory is in a simple byte array called mem_, while the method bdwrite() is used to tell the system that you have modified a buffer and it should at some point be written back. (bdwrite is a commonly used function name in Unix kernels, where the "d" stands for "delayed." A name such as mark_dirty() would arguably be more intuitive.)

struct Buffer {
    ...
    char mem_[SECTOR_SIZE];
    void bdwrite();
    ...
};

Bitmap interface

The freemap is a simple bitmap stored in the freemap_ instance variable of V6Replay. The freemap is compact enough to store the entire map in memory. It is read from disk for you in the constructor V6Replay::V6Replay, and written back to disk for you at the end of the V6Replay::replay method.

The freemap is implemented by the Bitmap structure defined in bitmap.hh, which is similar to a std::vector<bool>. Unlike std::vector<bool>, however, Bitmap offers a feature in which valid indices start at an arbitrary number. Since the first data block is INODE_START_SECTOR + s_isize (a.k.a. datastart()) rather than block zero, the first bit physically on disk in the freemap area corresponds to this "datastart" block. The Bitmap structure handles the translation because we construct it with a min_index. What this means in practice is that to mark block bn free, you just say freemap_.at(bn) = true; to mark it allocated, you say freemap_.at(bn) = false. Bitmap itself will do the work of translating bn to the appropriate bit.

If you have time, take a look through the implementation of Bitmap in bitmap.hh. It's interesting to see how the class allows you to address individual bits in the bitmap.

Testing Your Code

As with previous assignments, you can run tools/sanitycheck to exercise your replay code with a collection of test cases. We will run sanitycheck when we grade your assignment. However, the underlying infrastructure for testing is different from previous assignments. There is no test program. If you would like to run a single test in isolation, use the shell script examine, which can be invoked as follows:

./examine -r samples/disk_images/image

where image is the name of a corrupt disk image that we have prepared for you. This script will make a copy of the disk image, run the recover program on that image to repair it, and then print information about the repaired disk.

The output from examine contains the following information:

  • The first two lines print info about what was replayed and when it stopped.
  • The next line prints how many disk blocks are in use.
  • The next lines display the contents of each of those blocks in a side-by-side format showing hexadecimal on the left and the corresponding ASCII text on the right, with 16 bytes per line -- the same size as the direntv6 structure. The text version is useful to see the text in directory entries and text files, while the hexadecimal version is useful for block numbers and other information.
  • Next, it prints how many inodes are used; this is followed by the contents of each inode, starting with its number and including information like its permissions, last modified time, block numbers, size, etc.

You can also run samples/examine_soln on a disk image; this program will use our sample solution for recover instead of your code, so you can compare its output with the output generated by ./examine.

If you wish to run gdb on your code, you will need to run recover (the executable containing your log replaying code) directly without using examine. To do this, you will first need to make a copy of the image to repair:

cp samples/disk_images/image crash.img

Then you can run recover, either directly or under gdb like this:

gdb recover
...
(gdb) run crash.img

There is also a tool dumplog that lets you view the file system log. You can run it like this (the "c" specifies that it should only display entries more recent than the last checkpoint):

./dumplog samples/disk_images/image c

As usual, we do not guarantee that the tests we have provided are exhaustive, so passing all of the tests is not necessarily sufficient to ensure a perfect score (CAs may discover other problems in reading through your code).

Troubleshooting

  • If your file system ever crashes and gets linux into a bad state, you may need to remove the fuse mount. The command to do so is:

    fusermount -u mnt
    

    (assuming mnt is the name of your mount point). This won't work if you still have a process in the directory, so make sure to cd out of the mnt directory if you are having problems.

  • Remember that .. won't work if your file system has crashed, so you'll need to give cd an absolute pathname. A convenient choice is cd $PWD or cd $PWD/.., since your shell maintains the $PWD environment variable to the path to the current working directory.

Exercise 4: Long-Term Support and Trust

The final exercise for this project continues our discussion of ethical issues related to trust.

Operating systems and file systems have a very long lifespan. For instance, the current Windows filesystem, NTFS, was introduced in 1993. Once a system is adopted, it can be challenging for customers to transition to newer systems.

Although there is no requirement for operating systems to have a minimum support period, most organizations that develop operating systems provide support for around 10 years. This is typically enough for most users, but not all. For example, critical government infrastructure in the US, UK, and Netherlands relied on Windows XP (which debuted in 2001) through at least 2014.

Consider the following questions relating to OS minimum support periods and trust:

  • What is one argument for strengthening requirements for a minimum support period? (e.g., requiring it by law)?
  • What is one argument against requiring a minimum support period?
  • Identify one way that an organization developing & maintaining an operating system could support users in inferring trust that the minimum support period will be respected.
  • Identify one way that an organization developing & maintaining an operating system could support users in substituting some need for trusting that the minimum support period would be respected.

Once you have considered these questions, answer Questions 4A-4D in questions.txt.

Here are some other articles related to operating system support, in case you are interested:

Submitting Your Work

Once you are finished working and have saved all your changes, submit by running tools/submit.

We recommend you do a trial submission in advance of the deadline to allow time to work through any snags. You may submit as many times as you like; we will grade the latest submission. Submitting a stable but unpolished/unfinished version is like an insurance policy. If the unexpected happens and you miss the deadline to submit your final version, the earlier submit will earn points. Without a submission, we cannot grade your work. You can confirm the timestamp of your latest submission in your course gradebook.

Grading

Here is a recap of the work that will be graded on this assignment:

  • questions.txt: answer all of the questions.
  • replay.cc: add enough code to properly replay the three types of log entries.

We will grade your code using the provided sanity check tests and possible additional autograder tests. We will also review your code for style and complexity. Check out our course style guide for tips and guidelines for writing code with good style!

Credits

This assignment, along with the FUSE implementation of the V6 file system, was created by David Mazières.