Memory Resource Management in VMWare ESX Server (Waldspurger) ESX server sits b/n OSs (each of which on a virtual machine (VM)) and the hardware Each virtual machine runs some unmodified commodity OS Ballooning: reclaims pages considered least valuable by OS Idle Memory Tax: achieves efficient memory util while maintaining performance isolation guarantees. Content-based page sharing: exploit transparent page remapping to eliminate redundancy and reduce copying overheads Hot I/O page remapping: does same thing as content-based page sharing --> Goal: efficiently support VM workloads that OVERCOMMIT mem INTRO: Each VM given illusion of being a dedicated phys machine that is fully protected and isolated from other VMs; VMs encapsulate entire state of a running system (including both user-level apps & kernel-mode OS svcs.) Indiv servers underutilized --> consolidate servers as VMs on 1 phys svr Want to be able to flexibly overcommit mem, CPU & other resources while still providing resource guarantees of varying import. VMM = software layer; virtualizes hw resources; exports virtual hw i/f that reflects the underlying actual hw i/f ESX: designed to mux hw resources efficiently among VMs --manages system hw directly --> provides higher I/O perf & complete control over resource mgmt (vs. hosted VM arch where a read() call would be made by the VMM to a host OS) High level resource management policies compute a target memory allocation for each VM on the basis of specified params & system load. Invoke lower level mechs to reclaim mem from other VMs as necessary; bkgd activity exploits oppties to share identical pages b/n VMs. MEMORY VIRTUALIZATION: -- guest OS expects zero-based phys addr space -- ESX gives each VM this illusion machine address: actual hw mem physical address: sw abstraction PMAP DS: maintained for each VM to xlate PPNs to MPNs -- VM issues instructions to update its (guest) OS PT or the TLB -- These are INTERCEPTED -- Separate shadow page tables contain VPN-to-MPN mappings ==> used by processor ==> kept consistent with mappings in PMAP -- HARDWARE TLB caches: VPN-to-MPN mappings; so no addt'l OH for "ordinary mem refs" B/c of extra level of indirection, ESX can remap a physical page by changing the MPN to which it maps -- and VM is none the wiser! Server can also monitor or interpose on guest mem accesses RECLAMATION MECHANISMS: ESX supports overcommitment of mem such that can have a HIGHER DEGREE of SERVER CONSOLIDATION; Overcommitment == total size configured for all running VMs > amt of mem Each VM is given illusion of having a fixed amount of physical memory; this MAX SIZE is a configuration param; remains constant during lifetime of an OS (where "lifetime" == as long as OS is up and running) If mem is not overcommitted, a VM will be alloc'd its MAX SIZE PAGE REPLACEMENT ISSUES: When mem is OVERCOMMITTED, ESX must reclaim space from >= 1 VM. -->standard approach to this was for VMM to page some VM physical pages to swap on disk; --> VMM must choose VM to revoke from --> VMM must choose which pages from that VM to revoke == BUT ALL KNOWLEDGE a/b page use is local to the VM... so resource mgmt decisions are uninformed. Plus introduces double paging prob; VMM chooses page X to replace, swaps that page out, gives the PPN to some other VM, then the VM that owns page X also chooses page X to replace (not realizing that page X has already been swapped out by the VMM) so the VM causes page X to be swapped IN by the VMM so that the VM can swap X out to its virtual disk --> UGLY!!! BALLOONING: coax guest OS into cooperating in effort to reclaim some pages Balloon module loaded into eac guest OS; this module cannot be accessed by the guest (OS); the module (really a device driver) communicates with the ESX server; ESX server may INFLATE the balloon which will cause the module to request some # of pages from the guest OS which will cause the guest OS to choose some pages for eviction, then pass those PPNs to the module which in turn passes them to ESX which figures out the MPNs then grabs them for use by other VMs; or the ESX server may deflate the ballooon, causing the module to deallocate some of the pages the guest OS allocated to it (not sure exactly HOW this works...; how do you deallocate pages?); Guest OS should never touch any phys mem it allocs to the balloon mod; when VMM reclaims a page, it notes in that page's PMAP entry that the guest OS thinks this page has been alloc'd for use; so if the guest OS tries to touch this PPN, the VMM will alloc another page for it; Balloon drivers for Linux, FreeBSD, Windows poll ESX once per second to obtain the TARGET BALLOON SIZE for its guest OS. Limit allocation rates adaptively so as not to freak out the guest OS. Use standard kernel i/fs to get pages from guest OSs: -- get_free_page() in Linux -- MmAllocatePagesForMd1() or MmProbeAndLockPages in Windows If the guest OSs could support hot-pluggable mem cards, then could insert VM cards into a VM in order to rapidly give it lots more mem (or remove). DBENCH benchmark: OH kept to a minimum so a system with 128 MB acts similarly to one with 256 MB which is ballooned to have only 128 MB available; the OH is due to guest OS DSs that are sized on basis of amt of phys mem; so when phys mem == 256 (even though only 128-MB), kernel uses more mem to manage the larger mem size; BUT: (1) balloon driver may be uninstalled, disabled, unavailable or (2) unable to recover memory quick enough DEMAND PAGING: When ballooning is not possible or insufficient, the system falls back to a paging mechanism; mem reclaimed by paging out to an ESX server swap area on the disk (the guest OS is none the wiser). ESX swap daemon keeps track of target swap levels for each VM; --manages selection of candidate pages --coordinates asynch page outs to swap area on disk Use randomized page replacement policy; done to avoid double paging scheme described earlier; Paging expected to be uncommon. SHARING MEMORY: Several VMs might be running the same guest OS, the same apps or have the same components loaded --> lots of oppty for sharing mem! --> so server worloads running in VMs can consume less mem than they would otherwise need to --> support more overcommitment Transparent Page Sharing: once copies are identified (that are used by > 1 VM), multiple guest "physical" pages mapped to the SAME MACHINE PAGE and marked Copy-On-Write. (Writing to a shared page causes a fault that generates a private copy). In order to not have to modify the guest OSs, use... --> i.e. modify guest OSs to identify redundant copies as they're created --> also changes to APIs --------------------------- Content-Based Page Sharing: Identify page copies based on their contents --------------------------- (1) no need to modify/hook/understand guest OS code (2) can identify MORE oppties for sharing (all potentially shareable pages can be identified by their contents; even pages that might not have been recognized by DISCO as shareable) COST: work must be done to SCAN for sharing opportunities Hashing used to identify pages with potentially-identical contents Hash table contains entries for other pages that have already been marked COW; for new page, take its hash and use that as a lookup into the table. -- if find match, do full comparison of new page's contents to COW page -- if match, map new page to existing MPN; --> any subsequent attempts to write to *any* page mapped to that COW page will result in a fault which will transparently create a private copy of the page for the writer -- if no match, make HINT ENTRY for this new page; if any future page's hash matches this HINT page's, then we rehash the hint page to see if it's hash is still the same; if it is, then we compare the two pages and if they are the same, then we create a COW entry & share the page; if the hash has changed (of the orig page), then we remove the HINT ENTRY of the orig page ==> but do we create a hint entry for the matching page?? Scanning for copies (building up COW and hint entries): --scan pages incrementally (sequentially, randomly or heuristically) ==> ESX scans pages RANDOMLY ==> the longer a page is alive, the more likely it is to be selected for scanning; --and can scan only during idle cycles to limit CPU overhead --can configure max per-VM and system-wide scanning rates; SYSTEM ALWAYS ATTEMPTS TO SHARE A PAGE BEFORE PAGING THAT PAGE OUT TO DISK. So the first entry in the hash table will be a hint entry, then if a second page matches that hint entry, we will rehash the orig page and if its hash hasn't changed, do a full page comparison b/n the orig page & the 2nd page and then if the two pages match completely, we convert that hint entry to a COW entry. Perhaps we will sequentially iterate through all pages (during idle CPU cycles only) either creating hint entries for them or sharing them with existing COW pages. ONE GLOBAL HASH TABLE: contains frames for all scanned pages (either COW or hint); each frame is encoded in 16 bytes. Shared frame: hash value (8 B) MPN for shared page (4 B) ref count (2 B) link for chaining (2 B) the pages sharing this page Hint frame: truncated hash value (4 B) PPN (4 B) VM ID (4 B) MPN (4 B) Ref count: keeps track of # VPNs mapping to this shared page; -- have separate overflow table to store extended frames with refcounts > 2^16 - 1 -- e.g. zero page (filled with zeros) is shared w/large ref count If two shared pages have the same hash value but different contents, then we consider them INELIGIBLE for sharing. Most sharing (for experiment with various #s of VMs running same OS and same apps) is due to REDUNDANT CODE and READ-ONLY data pages. CPU OH due to page sharing was negligible; throughput roughly the same with page sharing enabled or disabled; throughput was slightly higher when page sharing was enabled; b/c (1) page sharing improves memory locality; (2) may therefore increase hit rates in physically-indexed caches. How page sharing MAY improve memory locality: -- w/page sharing, theoretically have fewer dupes of pages in the cache so more room for orig pages so better hit rate... SHARES vs. WORKING SETs: Balance b/n maximizing overall system mem performance AND enforcing guarantees of service (some proportional share allocation system). Param allows the controlling of the relative import of those two goals -- idle memory tax rate; if this == 0, then have pure proportional share; if this == 1 (i.e. 100%), then have pure maximize sys mem perf Resource rights encapsulated as SHARES which are owned by CLIENTs; Client is guaranteed a minimum resource fraction equal to its fraction of the total shares in the system. Client allocs degrade gracefully in overload situations and clients proportionally benefit when there are extra resources b/c some allocs are underutilized. MIN-FUNDING REVOCATION ALGORITHM: When one client demands more space, a replacement algo selects a vitim client that relinquishes some of its previously-alloc'd space. Mem is revoked from the client that owns the fewest shares per every page allocated to that client (whichever client is getting the most bang for his buck has to give some of that BANG up). RECLAIMING IDLE MEMORY: limitation of pure proportional share algos is that they don't incorporate any info about ACTIVE MEMORY USAGE or working sets; idle clients with many shares can hoard mem unproductively. IDLE MEMORY TAX: charge a client more for an idle page THAN for a page that the client is actively using; when mem is scarce, pages will be reclaimed from clients that are not actively using their full allocs (perhaps by ballooning such that the CLIENT can ID the inactive pages and boot them out). TAX RATE: specifies the maximum fraction of idle pages that may be reclaimed from a client; if the tax rate is 75% (the default), then we can take up to 3/4 of a client's pages if they're not being used (so we need a way of tracking which pages are being used by clients). MEASURING IDLE MEMORY: don't need to individually identify active & idle pages; just use statistical sampling; each VM is sampled independently using a configurable sampling period defined in units of VM exec time. N pages selected @ random; invalidate cache mappings associated with the page's PPN (HW TLB entries, ...); next guest access to that page will be intercepted and touched page count T is incremented. Estimate of fraction of mem actively accessed == T / N. Estimates smoothed across multiple sampling periods. Also slow moving average used to produce smooth, steable estimate Fast moving average adapts quickly to working set changes Plus one other thing Server uses max of these three values to estimate amt of mem being actively used by the guest. So system responds quickly to abrupt demands for more! but degrades slowly PARAMS (configurable) -- to control alloc of mem TO EACH VM: (1) min size: guaranteed lower bound on amt of mem alloc'd to the VM (2) max size: amount of phys mem config'd for use by guest OS; if mem not overcommitted, guest OSs will be alloc'd this amt (3) memory shares: entitle a VM to a fraction of physical mem based on proportional share alloc policy ADMISSION CONTROL: ensures that sufficient unreserved mem and server swap space are available before a VM is allowed to power ON. Machine mem must be reserved for the guaranteed min size plus OH required for virtualization (pmap DS, etc. -- usually a/b 32 MB). Disk swap space must be reserved for remaining VM mem (max - min). DYNAMIC REALLOC: ESX recomputes mem allocs dynamically in response to changes to system-wide or per-VM alloc params, addition or removal of a VM from the sys & changes in amt of free mem that cross thresholds. Rebalancing is performed perioidcally too to reflect changes in idle mem estimates for each VM. 4 thresholds define reclamation states: (1) HIGH == 6% free mem; no reclamation performed (2) SOFT == 4% free mem; uses ballooning to reclaim mem; only uses paging if ballooning disabled on VMs (3) HARD == 2% free mem; uses paging to forcibly reclaim mem (4) LOW == 1% free mem; continues to reclaim via paging; blocks execution of all VMs that are ABOVE THEIR TARGET ALLOCATIONS In all states, ESX computes target allocs for VMs to drive aggregate amt of free space above the HIGH threshold. After reclaiming mem, system transitions to higher state ONLY after significantly exceeding the higher threshold.