Assignment 5: Memory-Mapped Encrypted Files

Assignment 5: Memory-Mapped Encrypted Files

In this assignment you will implement memory-mapped access to data that is stored in encrypted files. Here's how this will work:

  • A region of the virtual address space of a process will be allocated, large enough to hold the contents of a file.
  • The file won't initially be read into memory, and there won't even be physical pages assigned to this region.
  • If the process attempts to read or write from the file's region in virtual memory, page faults will occur.
  • Your code will catch the page faults, allocate physical pages, and read in the file. Only the pages of the file that are actually accessed will be read into memory. Once a page has been loaded into memory, the process will be able to access data in that page without additional faults.
  • The process can also modify the file's bytes in memory. When this happens, your code will remember which pages are dirty and write those pages back to memory when the memory-mapped file is closed (or "flushed"). Only pages that were actually modified will be written.
  • As an additional twist, the data in the file is stored in encrypted form to thwart eavesdroppers. Information is decrypted when read into memory and re-encrypted when flushed back to disk.

Normally, code like this would be written inside the operating system kernel. For this assignment, we have used the Linux mmap facility to emulate features that are typically available in the kernel (such as page faults), so you can write your code at user level. The goal of this assignment is to teach you about the following concepts:

  • How demand paging works
  • How page protection can be used to emulate hardware features such as dirty bits

Getting Started

Login to the myth cluster and clone the starter repo with this command:

git clone /afs/ir/class/archive/cs/cs111/cs111.1236/repos/assign5/$USER assign5

This will create a new directory assign5 in your current directory and clone a Git starter repository into that directory. Do your work for the assignment in this directory. The files mcryptfile.hh and mcryptfile.cc contain a skeleton for all of the code you need to write. You will add to the declarations in mcryptfile.hh and fill in the bodies of the methods in mcryptfile.cc to implement the facilities described below. You can also create additional methods or classes as needed to implement your solution.

The directory also contains a Makefile; if you type make, it will compile your code for both problems together with a test program test.cc, producing an executable file test. You can invoke ./test with an argument giving the name of a test to run (invoke it with no arguments to see a list of available tests). The same test program is used for both this assignment and Assignment 6; this assigment will use only the tests up through update_three_files. You can also invoke the command tools/sanitycheck to run a series of basic tests on your solution. Try this now: the starter code should compile but almost all the tests will fail.

As usual, we do not guarantee that the tests we have provided are exhaustive, so passing all of the tests is not necessarily sufficient to ensure a perfect score (CAs may discover other problems in reading through your code).

Assignment Overview

You will implement a class MCryptFile that provides the following methods:

  • MCryptFile(Key key, std::string path)
    Constructs an MCryptFile object that can be used to access data in the file given by path. The file is encrypted. When data is read from the file into memory, it will be decrypted using key; when data is written from memory back to the file, it will be encrypted using key. Key objects can be constructed from strings. If the file doesn't currently exist, a new file will be created. Throws a std::system_error exception if the file cannot be opened or created. Most of the functionality of this constructor (such as throwing exceptions) is actually implemented in the CryptFile superclass constructor described below.

  • char *map(size_t min_size = page_size)
    Maps the associated file into a contiguous region of virtual memory and returns the virtual address of the first byte of that region. Byte i of the decrypted file contents will henceforth be directly accessible at offset i into the region. The min_size argument can be used to create a new file or grow an existing file. The actual size of the virtual region will be the larger of min_size and the original file length; if bytes are written in the region beyond the original file length, the file will be extended when flush is called. If you want to grow a file beyond the current size of the region, invoke unmap to unmap the region, then invoke map to map it again with a larger size. Note: file sizes are always rounded up to page boundaries. This method does not allocate physical memory for the region or read information in from the file; that should happen later, on demand, as pages in the region are accessed.

  • void flush()
    Encrypts all pages that have been modified since the last call to flush and writes them back to the associated file. All pages currently in memory should remain in memory. This operation may grow the size of the encrypted file in the file system.

  • void unmap()
    Flush any dirty pages and remove the mapping created by map. After this method returns, the caller must no longer use any references into the previously mapped region.

  • char *map_base()
    Returns the address of the first byte of the memory mapped file, or NULL if the associated file is not currently mapped.

  • size_t map_size()
    Returns the current size of the mapped region, or 0 if the associated file is not currently mapped.

  • static void set_memory_size(size_t npages)
    Invoked to specify how many pages should be in the pool of physical memory that is shared by all MCryptFile objects. This method should only be invoked before the first MCryptFile is created; it will have no effect after that. The test infrastructure will invoke this method as appropriate for the tests being run.

Supporting Code

We have written several classes for you to use in implementing MCryptFile:

CryptFile

The CryptFile class provides basic mechanisms for reading and writing encrypted files (but not for memory-mapping them). Your MCryptFile class will be a subclass of CryptFile. The CryptFile class has the following methods:

  • CryptFile(Key key, std::string path)
    The API for this method is identical to that for the MCryptFile constructor (see above).

  • size_t file_size()
    Returns the number of bytes in the file (which is the same as the number of bytes required to hold the decrypted file in memory).

  • int aligned_pread(void *dst, size_t len, size_t offset)
    Reads information from the file into memory. More precisely, reads len bytes of data at position offset in the file, decrypts it, and stores the unencrypted information at dst. Both len and offset must be multiples of the AES encryption algorithm's block size (16), which is accessible via CrypteFile::blocksize. The block size will not be an issue for this assignment, because you will only be reading and writing full pages aligned on page-size boundaries. Returns the number of bytes read or -1 on error.

  • int aligned_pwrite(const void *src, size_t len, size_t offset)
    Write information from memory to the associated file. Encrypts len bytes starting at src and writes them to position offset in the associated file at position offset. Returns len on success and -1 on error. Both len and offset must be multiples of CryptFile::blocksize.

VMRegion

The VMRegion class provides basic mechanisms for mapping pages into a region of virtual memory, taking page faults, and managing permissions. A VMRegion corresponds roughly to a contiguous range of page map entries for one process in an operating system. You will create one VMRegion object for each mapped MCryptFile. The VMRegion class is defined in the header file vm.hh and has the following methods:

  • VMRegion(size_t nbytes, std::function<void(char *)> handler)
    Constructs a VMRegion object and allocates a region of virtual memory in the process that is currently unused. The size of the region will be nbytes; if nbytes isn't a multiple of the page size, the VMRegion will behave as if nbytes were rounded up to the next higher multiple of the page size. If nbytes is zero, it will be rounded up to one full page. The handler parameter specifies a function that will be invoked whenever a page fault occurs in the region. Page faults occur whenever an unmapped page is referenced or an attempt is made to write a page that is currently read-only. handler is invoked with a single argument giving the virtual address that triggered the page fault. You will need to write a handler function as part of your solution (it is not currently defined in mcryptfile.cc).

  • VPage get_base()
    Returns the address of the first page in the virtual region. VPage is a type that refers to the first byte of a virtual page in a VMRegion. It is equivalent to char *, so you can add an offset to it to get the address of a value in the middle of a page. You can also add multiples of the page size to the value returned by get_base to produce VPages for other pages in the virtual region.

  • size_t get_size()
    Returns the total number of bytes in the virtual region.

  • void map(VPage va, PPage pa, Prot prot)
    Sets the mapping for a particular VPage inside a VMRegion, so that accesses to that page will be directed to pa (PPages are obtained using the PhysMem class discussed below). If a different page was previously mapped at VPage, the old mapping is removed. The prot argument specifies what sort of accesses are allowed; it should be either PROT_NONE to prohibit both loads and stores, PROT_READ to allow loads but not stores, or PROT_READ|PROT_WRITE (bitwise OR of two values) to allow both loads and stores. This function's behavior is equivalent to setting the contents of a page map entry.

  • void unmap(VPage va)
    Removes the mapping for va, if there is one; future references to the page will cause page faults. This function's behavior is roughly equivalent to clearing the present bit in a page map entry.

In addition to these methods, VMRegion also exports a variable page_size, which contains the number of bytes in each page on the current machine. Page sizes are 4096 bytes on the myth cluster as well as Windows or MacBook laptops, but you should use the page_size variable to ensure portability; you can assume page_size will always be an even power of two.

PhysMem and PPages

The PhysMem class provides a mechanism for allocating and freeing pages of physical memory. It implements PPage objects. A PPage is a token for a physical page, which you can pass to VMRegion::map. In addition, you can use a PPage to access the bytes of the page: a PPage is actually a char * pointer referring to the first byte of the page, and you can add offsets to this to access other values in the page. Unlike VPages, a PPage is always accessible and writable; references to it will never generate page faults. PPages are useful because they allow your MCryptFile implementation to access physical pages that may not currently be mapped, such as when transferring page contents to or from encrypted files. PPages are mapped into virtual memory by the PhysMem class, at virtual addresses different from those in VMRegions. This is an example of aliasing, where the same physical page can appear at multiple virtual addresses.

The PhysMem class is defined in the header file vm.hh and has the following methods:

  • PhysMem(size_t npages)
    Allocates npages physical memory pages, each of which may be mapped into any VMRegion.

  • PPage page_alloc()
    Allocates a page and returns its PPage, or nullptr if there are no free pages.

  • void page_free(PPage p)
    Returns p to the free page pool for this PhysMem. The caller must ensure that this page is not mapped in any VMRegion.

  • size_t npages()
    Returns the total number of pages in this object (free or allocated).

  • size_t nfree()
    Returns the number of pages that are not currently allocated.

  • PPage pool_base()
    Returns the address of the first page in the pool (the pages in the pool occupy a range of contiguous PPage addresses).

Exercise 1: vm_demo

(You will go through this exercise in section with your CA)

The file vm_demo.cc contains a simple program that shows how to use the VMRegion and PhysMem classes. It creates a VMRegion and a PhysMem. It then allocates one PPage and writes to it. Next, it tries to read the data through the VMRegion. That causes a page fault, and the page fault handler maps the PPage so that it can be accessed via the VMRegion. Finally, the program modifies the page through the VMRegion and reads the result back through the PPage. This example illustrates aliasing, where a single physical page can be accessed through multiple virtual addresses.

Read through the code to familiarize yourself with it, then run it (make; ./vm_demo) and observe its output. Then answer the questions in questions.txt.

C++ Proficiency: Deleting While Iterating

At some point in this assignment you will need to scan a C++ container and delete some of the entries in it. This is tricky in C++ because object deletion is not generally safe while iterating. For example, suppose you try to iterate over an std::unordered_map and delete some of its entries using code like this:

std::unordered_map<int, Foo*> foo_map;
for (auto it = foo_map.begin(); it != foo_map.end(); ++it) {
    if (...) {
        foo_map.erase(it);
    }
}

This code is unsafe, because deleting the element leaves the iterator it in an undefined state; bad things will happen if you keep using that iterator. However, the following code is safe:

std::unordered_map<int, Foo*> foo_map;
for (auto it = foo_map.begin(); it != foo_map.end(); ) {
    if (...) {
        it = foo_map.erase(it);
    } else {
        ++it;
    }
}

The erase method returns a new iterator that refers to the next element of the map after the deleted one, so it is safe to continue iterating. Notice in this case that the for statement no longer increments it: that happens only if the element isn't deleted.

Implementation Milestones

Milestone 1: Faults Forever

Implement the MCryptFile constructor and map method. You will need to create a VMRegion object in the map method to manage the virtual addresses for this mapped file. To do that, you will have to define a page fault handler function and pass it to the VMRegion constructor as the handler argument. For now, have the handler print out the virtual address of the fault and then return without doing anything else.

In addition, you should implement the map_base and map_size methods. Be sure to change the return value of the map method (it should not return nullptr). The map_size test should now pass.

Now run the read test (./test read). You should see that the file is successfully mapped, but the test will hang because your fault handler doesn't actually make the page accessible; thus page faults will happen repeatedly (as soon as the handler returns, the application retries the faulting instruction, which causes another page fault).

Milestone 2: Map Pages

Create enough new functionality to load pages into memory on demand. First, add code to allocate a PhysMem object during the first call to map (don't allocate the PhysMem until the first call to map). A single PhysMem will be shared across all MCryptFiles and used to allocate physical memory pages, just as an operating system uses a single physical memory to allocate pages for all processes. MCryptFile::set_memory_size may have been invoked to specify how large the pool of physical memory should be. If MCryptFile::set_memory_size has not been invoked by the time the PhysMem is created, use 1000 pages for the PhysMem object.

Then fill in the page fault handler function to make a new page accessible with the appropriate contents of the file. For now, set the permissions on each page to be PROT_READ|PROT_WRITE. If you run the read test again, you should see that 3 pages are successfully read, but the test will generate an error because pages are not being unmapped.

Milestone 3: Supplemental Page Map, Destructor, and Unmap

Implement the destructor and the unmap method. In order to do this, you will need to unmap all of the VPages that have been mapped for that file and return their PPages to the PhysMem. This will require you to define an additional data structure called a supplemental page map. The supplemental page map will provide information for each VPage, such as whether it is mapped and, if so, the associated PPage. As you work through the assignment you'll discover other information that needs to be stored in the supplemental page map. It's up to you to determine the structure of the supplemental page map; it should be implemented so that you can easily look up the information for a page given its VPage (such as when a page fault occurs for that page).

Once you have an initial implementation of the supplemental page map, you should be able to implement the destructor and the unmap method. At this point, the read test will pass except for a mismatch in protections (this will be fixed in Milestone 5).

Milestone 4: Flush

Implement the flush method. For now, flush all pages in physical memory that belong to the MCryptFile without considering whether they are dirty. Remember to flush in the unmap method. All the tests should now complete, but you will get errors because pages that aren't dirty are getting written back to the file.

Milestone 5: Tracking Dirty Pages

Make flush more efficient by keeping track of the dirty pages and only writing dirty pages back to the file (clean pages should not be written back). Paging hardware usually provides a "dirty" bit in page map entries, which is set by the hardware when a page is written. Unfortunately, this information is not passed through by the mmap mechanism we are using for this assignment, so you will have to use clever software to emulate a dirty bit for each page. You can use page protections for this: if a page's protection is set to PROT_READ, then a page fault will occur the first time the page is written (and a page fault will not occur unless the page is written). Given this, you should be able to emulate dirty bits for each page. Use the emulated dirty bits to avoid writing back clean pages during flushes.

Page faults may now happen multiple times on the same virtual page, but you should only read each page from the file once (unless the file is unmapped and re-mapped).

Your dirty bit should behave like a dirty bit in a page map, i.e. it should be reset when pages are flushed (and therefore no longer dirty). Since you are notified about writes by page faults, this implies something about the permissions after flush is called.

At this point all of the tests should pass. Congratulations!

Milestone 6: Odds and Ends

If you haven't already done so, implement set_memory_size. Also, go over the Miscellaneous Notes below and implement anything else that is needed.

Miscellaneous Notes

  • For this assignment you may assume that there are enough physical pages to accommodate all of the pages in all of the open MCryptFiles: you need not worry about page replacement. You can exit the program (with an informative message, of course) if PhysMem::page_alloc returns nullptr.

  • Load pages into memory on demand, so that no memory is wasted on pages that are never accessed. You should only read pages from disk when responding to page faults.

  • You may assume that your code is used only in single-threaded environments; you do not need to worry about synchronization for this assignment.

  • Your solution must support multiple MCryptFile objects mapped at the same time, with one VMRegion per MCryptFile. All of the MCryptFiles must share the same PhysMem.

  • If you use gdb to debug your assignment, you will notice that it catches the SIGSEGV signals used to signal page faults and stops the application before it can handle those page faults. If you type continue then the signal will be transmitted to your application so the page fault will be handled. If you get tired of typing continue you can change gdb's behavior with the following command:

    handle SIGSEGV noprint nostop pass
    

    The arguments indicate that, when SIGSEGV signals occur, they should be passed to the application; gdb will not stop the application or print any indication that the signal occurred. You may find other combinations of argument values convenient as well.

Submitting Your Work

Once you are finished working and have saved all your changes, submit by running tools/submit. Make sure that you have answered the questions in questions.txt before submitting.

We recommend you do a trial submission in advance of the deadline to allow time to work through any snags. You may submit as many times as you like; we will grade the latest submission. Submitting a stable but unpolished/unfinished version is like an insurance policy. If the unexpected happens and you miss the deadline to submit your final version, the earlier submit will earn points. Without a submission, we cannot grade your work. You can confirm the timestamp of your latest submission in your course gradebook.

Grading

Here is a recap of the work that will be graded on this assignment:

  • questions.txt: answer all of the questions.
  • mcryptfile.hh and mcryptfile.cc: flesh out the MCryptFile class.

We will grade your code using the provided sanity check tests and possible additional autograder tests. We will also review your code for additional errors as well as style and complexity. Check out our course style guide for tips and guidelines for writing code with good style!