Some notes Ch. 2 : Data Storage - DBMS can deal with v. large amts of data efficiently - how does a system store and manage a lot of data? - what DSs best support efficient manipulation of this data? - technology for storing a lot of data - devices used to store info - memory hierarchy (determines the efficiency of algos that move data b/n main mem and storage) - how to lower time to read or write data from disk - how to improve reliability of disks - what to do about intermittent read/write errors or disk crashes ======================= A. The Memory Hierarchy ======================= Comp usually has different components in which data can be stored - these components differ in size and also in access speed; also differ in cost per byte - devs with smallest capacities have fastest access speeds and highest cost per byte --------- (1) Cache: is a chip or part of the processor's chip --------- - can hold data or machine instructions - data are a copy of portions of main mem - may change values in cache w/o immediately changing value in main mem - unit of xfer b/n cache and main mem is usually some small # of bytes - cache itself may be divided into two levels: -- on-board cache: found on same chip as processor itself -- level-2 cache: found on another chip - when a machine execs an instruction, it looks for both the instructions and the data used by those instructions in the cache; if it doesn't find them there, it looks in main mem and copies the instructions or data from there to the cache - when adding to cache may need to remove something else first; if remove something that has changed while it's been in the cache, need to update the new value to main mem before blowing the cached value away - if just have one CPU and change cache, don't need to update main mem with changes til flush that value out of the cache - if have multiple CPUs though may need to have cache updates "write through" which is to say that all changes to cache are pushed to main mem immediately - 2000: typical caches have capacities of upto 1 MB - data can be written b/n cache and CPU at around 10 ns per - moving instructions/data b/n cache and main mem takes around 100 ns per --------------- (2) Main Memory --------------- - everything that happens in a comp (execution of instructions, alteration of data) == working on info that exists in main mem - 1999: machines config'd with around 100 MB of main mem (RAM) - may have machines with 10 GB of main mem - main mems are random access: one can obtain any byte of main mem in the same amount of time; access times around 10 - 100 ns ------------------ (3) Virtual Memory ------------------ - when we write progs, the data we use occupies virtual mem space - instructions occupy an addy space of their own - many machines use 32-bit addy space; so 2^32 ~= 4 Billion mem locs - since each byte requires its own addy then a typical virtual mem would be 4 Billion bytes == 4 GB - virtual mem space >> main mem space - so most of content of virtual mem is actually on disk - disk divided logically into blocks (block size: 4 KB to 56 KB) - virtual mem moved b/n disk and main mem in blocks (i.e. pages in main mem) - progs and apps refer to data via virtual memory addys - main-memory DBMSs also manage their data via virtual mem and rely on the OS to bring needed data into mem via paging this works if data is small enough to remain in main mem w/o being swapped out - so if a DBMS only needs to keep <= (size of main mem) in memory at once, then this would work - most DBMSs need to keep much more data though (which contrasts them to typical apps); therefore most DBMSs manage their data directly on disk --------------------- (4) Secondary Storage: much slower, much bigger vs. main mem --------------------- - random access - often disk used as 2ndary storage; usually mag disk; may also be optical or magneto-optical disk -- optical, magneto-optical may be cheaper but may not support writing data to the disk (therefore likely to be used for archival data that doesn't need to be changed) - disk == support for both virtual mem and the file system - some disk blocks used to hold pages of an app's virtual mem - other disk blocks used to hold parts of files - files moved b/n disk and main mem in blocks (under control of FS or DBMS) - moving block from disk to main mem: disk read - moving block from main mem to disk: disk write - disk I/O: disk reads and disk writes - certain parts of main mem may be used to hold block-sized pieces of files (those parts of main mem are said to BUFFER those files) - may open a file for reading and when do so OS reserves 1 block for the file then reads in first 4 KB (e.g.) of file into that block then after that 4 KB consumed reads in next 4 KB of the file into the same block, ... continuing til the entire file is read or the file is closed - a DBMS will manage disk blocks itself (rather than calling OS file manager to move blocks to and from memory) - takes 10 - 30 ms to read or write a block on disk - want the disk block we need to be in main mem buff when we need it - 1999: single disk units can hold 1 - 10 GB - machines can use several disk units; so a machine might consist of 100 GB - 2ndary mem: 10^5 times slower vs. main mem but has 100 times the capacity - 2ndary mem: also much cheaper vs. main mem --> disks: $.05 to $.10 per MB --> main mem: $1.00 to $2.00 per MB -------------------- (5) Tertiary Storage: what if 100 GB ain't enuf? designed to hold terabytes -------------------- - may actually have 10^15 bytes (petabytes of data) or 10^12 B (terabytes) -- retail stores; satellite images - tertiary storage: hold terabytes of data - higher read/write times than 2ndary storage but cheaper and higher capacity - widely varying access times -- depends on how close to a read/write point the desired data is Kinds of tertiary storage devs: ------------------------------- (a) ad-hoc tape storage: - put data on tape reels or cassettes; store cassettes in racks; - when data needed, human operator locates & mounts tape on a tape reader - wind tape to correct location then read from it into 2ndary storage - writing uses same process to find location then copies from disk (b) optical-disk juke boxes: - juke box == racks of CD-ROMs - read bits via shining laser on a spot and seeing whether light is reflected - robotic arm can extract any one CD-ROM and move it to a CD reader - then read contents of CD into 2ndary mem - can't write onto CD w/o using special equipment (c) tape silos - silo == room-sized device; holds racks of tapes - tapes accessed via robotic arms which bring a tape to one of many tape readers - computer controlled inventory and automated tape-retrieval - at least an order of magnitude faster vs. (a) Capacity of a tape cassette: >= 50 GB; so tape silos can hold many TBs; CDs can hold 2/3 of a GB; next-gen CDs can hold 2.5 GBs; so can have CD-ROM jukeboxes holding multi-TBs Time to access data from tertiary storage: [seconds, minutes] - robotic arm in a jukebox or silo: several seconds for tape retrieval - human operation: several minutes to locate & retrieve desired tape - once a CD is loaded, any part can be accessed in a fraction of a second - for tapes, may take longer b/c have to move sequentially to the correct location Access times: 1000 times slower vs. 2ndary mem access (seconds vs. milliseconds) ------------------------------------ (6) Volatile and Nonvolatile storage ------------------------------------ magnetic materials: hold their magnetism in absence of power - so mag disks, mag tapes are nonvolatile optical devices hold black/white dots even in absence of power - so CDs nonvolatile All 2ndary and Tertiary storage devs nonvolatile Main mem: volatile; mem chip can be designed with simpler circuits if the value of the bit is allowed to degrade over time; simplicity --> cheapness DRAM: must have entire contents read and rewritten periodically; so if powered off then refresh doesn't occur DBMS: that runs on a machine with volatile main mem must back up every change to disk Could alternatively use non-volatile mem (flash mem); Could build a RAM disk from conventional mem chips by providing a battery backup to the main power supply (so if have to move to battery power then at that point copy contents of RAM to disk) ======== B. Disks ======== ---------------------- (1) Mechanics of Disks ---------------------- pieces of a disk drive which move: disk assembly, head assembly - disk assembly: consists of one or more circular platters (each w/diam 3.5 inches); these rotate around a central spindle -- upper and lower surfaces of platter covered with mag material which stores 1s, 0s -- 0: orient magnetism in small area in one direction -- 1: orient magnetism in small area in other direction -- these locations (small areas) organized into tracks - track: concentric circle containing bits; - tracks organized into sectors - sector: an indivisible unit; a portion of a track; - each track consists of same # of sectors - might read several sectors -- each of which are located at the same relative spot on platters which share the same spindle -- at once - if some location is busted, then the entire sector containing this portion cannot be used - gaps which in total equal about 10% of the track used to identify beginnings of sectors - blocks may consist of one or more sectors (blocks are logical units) - disk head assembly: one head for each surface; all heads point to same relative spot on all disks at the same time -- i.e. each points to same sector on same track on their respective disk - the disk heads shouldn't actually touch the surface (else we get head crash) - head reads the magnetism passing underneath it - head can also alter the magnetism to write to the disk ------------------- (2) Disk Controller: controls one or more disk drives; is a small processor ------------------- - controls mechanical actuator that moves the head assembly -- positions heads at a particular radius (track selection) --> so all heads will point to same track simultaneously (each to track # X on his disk) --> all of these tracks taken together: cylinder -- at this radius 1 track from each surface will be under the head for that surface and thus will be readable and writeable - selects surface from which to read or write; selects sector from the track pointed to by the desired head -- rotate disk so that correct sector from selected track is under head -- then controller knows when desired sector is under head -- rotating spindle rotates all connected disks at once - transfers bits read from desired sector to comp's main mem or transfers bits from main mem to intended disk sector -------------------------------- (3) Disk Storage Characteristics -------------------------------- (a) rotation speed of disk assembly: e.g. 5400 RPM == one rotation every 11 ms (b) number of platters per unit: e.g. 5 platters == 10 surfaces; (c) number of tracks per surface: e.g. 10,000 (d) number of bytes per track: 12 - 500 sectors/track; 512 - 4096 B/sector commonly 10^5 bytes / track * 10,000 tracks/surface * 10 surfaces == 10^10 B = 10 GB ----------- Example 2.1: Megatron 747 disk ----------- - 4 platters and 8 surfaces total - 2^13 tracks/surface - 2^8 sectors/track (average) - 2^9 bytes/sector - blocks are 2^12 = 4096 bytes --> so 1 block covers 8 sectors b/c 2^12/2^9 == 2^3 Capacity = (8 surface) * (2^13 tracks/surface) * (2^8 sectors/track) * (2^9 bytes/sector) = 2^3 * 2^13 * 2^8 * 2^9 == 2^33 bytes One track holds: (2^8 * 2^9) == 2^17 bytes The disk has diameter 3.5 inches; so the length of the outermost track is: 3.5 * pi - so outermost track is: 10.99 inches ~= 11 inches - and 90% of that distance is able to hold the track's data; so that's 11 * .9 --> 9.9 inches - and in those 9.9 inches, we hold 2^17 bytes - so the density is: ( 2^17 bytes ) * ( 2^3 bits/byte ) / 9.9 inches ~= 105,916 bits/inch --> so approximate density is 100k bits / inch - the innermost track is: 1.5 inches * pi == 4.71 inches - and 90% of that distance is able to hold track's data: 4.71 * .9 ~= 4.24 inches - and in those 4.24 inches, we hold 2^17 bytes - so the density is: ( 2^17 bytes ) * ( 2^3 bits/byte ) / 4.24 inches ~= 247,305 bits/inch --> so approximate density is 250k bits / inch ------------------------------- (4) Disk access characteristics ------------------------------- Blocks are read or written when: - heads positioned at cylinder containing track on which block is located and - sectors containing block move under disk head (rotates disk assembly) latency: time b/n command to read a block and when block contents available in main mem - time by processor and disk controller to process the request -- is a fraction of a millisecond; neglecting this -- also neglecting time due to contention for disk controller - time to position head assembly at proper cylinder [seek time] (selecting the correct track) -- time to start + time to move + time to stop -- a few milliseconds to 10 - 40 milliseconds -- often use average seek time - time for disk to rotate (so first of sectors containing block reaches the head) (selecting the correct sector) --> rotational latency -- on average desired sector will be 1/2 way around -- on average time to rotate full way around is 10 ms -- so on average 5 ms - time to transfer: for a 4096-B block is < .5 ms -- typical disk has 100,000 B/track and rotates 1/10 ms --> 10 MB/s -- time to rotate sectors of the block past the head plus any gaps b/n them -- is a function of read speed ----------- Example 2.3: ----------- - blocks are 4096 bytes - 3840 RPM --> (3840 revs/min) * (1 min/60 s) == 64 revs/s 1 s / 64 revs ~= .015625 seconds for 1 rev --> 15.6 ms per rev - time over useful data: (15.6 ms) * .9 ~= 14.04 ms - time over gaps: (15.6 ms) * .1 ~= 1.56 ms - moving between cylinders: ( 1 + n/500 ) ms where n == # of cylinders traveled so to move 1 cylinder requires time: (1 + 1/500) = 1.002 ms so to move 8192 cylinders requires time: (1 + 8191/500) = 17.382 ms - min time to read one block = seek + rot_late + xfer assume seek is 0; rot is 0; so is just xfer time need to get 8 sectors; so we have 7 gaps to travel over + 8 reads there are 256 sectors and 256 gaps around the circle the gaps cover 1/10 * 360 degrees == 36 degrees the sectors cover 9/10 * 360 degrees == 324 degrees so the total degrees we need to read are: 7 / 256 * 36 + 8/256 * 324 == 11.11 degrees so to get the fraction of a rotation needed, we take: 11.11/360 which equals: 0.03086 and then we multiply that by the time for a full revolution 0.03086 * 15.6 ms == 0.4814 ms ~= 0.5 ms - max time to read one block 17.4 ms + 15.6 ms + 0.5 ms == 33.5 ms - avg time to read one block average # of tracks to move is - take each of possible starting positions - add up distances to every other possible position - sum up total - divide it by # of combos we have (# starting positions) * (# moving-to positions) --> get that on average we'll move 2730 cylinders so seek is: (1 + 2730/500) ~= 6.46 ms then total avg time is: 6.46 ms + 15.6/2 ms + .5 ms == 14.76 ms ------------------ (5) Writing blocks: just like reading a block ------------------ - if want to verify that the block was written correctly then have to add in another rotation and read each sector back to make sure that what was intended to be written was actually written -------------------- (6) Modifying blocks: cannot modify a block on disk directly -------------------- - read block into main mem - change that location as desired - write new contents of block back out to disk - maybe verify that write was done correctly ====================================== C. Using Secondary Storage Effectively ====================================== RAM model: assume that data is in main mem; access to any item of data takes same amt of time for DBMS, cannot make this assumption; instead assume data does not fit into main mem - so when designing algos must assume 2ndary, tertiary storage used - best algos assuming RAM model not the same as best algos w/o that assumption Here considering mostly interactoin b/n main and 2ndary mem - want algos that limit disk accesses - instead prefer to exec more instructions on items in main mem Cannot design algos w/o taking underlying hw into consideration -- loss of abstraction! -------------------------------- (1) The I/O model of computation -------------------------------- Imagine: 1 comp running a DBMS; # of users; 1 CPU, 1 disk controller, 1 disk - DB too large to fit into main mem; - average time to read / write a block is 15 ms - many users; disk controller will have a queue of requests (FCFS) - each request for a given user will appear random (since disk controller alterating b/n users in servicing them) Dominance of I/O cost: - if block needs to be moved b/n disk and main mem, then cost of this movement >> cost of whatever ops performed on data - ergo approximate time needed by algo via # of block accesses -- want to minimize this; by "block accesses" we mean mvmt b/n disk & main mem E.g. suppose have a DB w/relation R and query asks for a tuple of R having some key value k - want to have an index on R which identifies which disk block(s) associated with which key values - don't care where in the block the tuple precisely appears -- b/c cost of getting block into main mem much more expensive than cost of searching block for k ------------------------------------- (2) Sorting data in secondary storage ------------------------------------- How to sort when data being sorted >> size of main mem? Sample problem: - relation R which has 10,000,000 tuples (each of which is 100 B); - each tuple represented by a record with several fields, one of which is the key (where that key will be the basis on which sorting performed) - want to order records in increasing order of sorting keys - assume sort keys unique though they are not necessarily so Relation therefore occupies 1 Gig. Machine has 50 MB of main mem and one disk; disk blocks 4096 B; can put 40 tuples on a single block; So main mem can hold 12,800 blocks --> 512,000 tuples relation occupies: 250,000 blocks --> 10M tuples If data did fit into main mem, can use quicksort, ...; also would sort only the key fields and those fields would have pointers to the full records; When keys in sorted order, then use pointers to move records If have to use 2ndary mem, though: move each block b/n main mem and 2ndary mem some small # of times; making 'passes' - in one pass, every record read into main mem and written out to disk once -------------- (3) Merge sort: O( n * lg(n) ) -------------- Suppose we have 2 sorted lists, each has 4 records. Compare head of each list and remove smallest item; Time to merge is linear in the combined sizes of the component lists Basis: if there is a list of one element to be sorted, do nothing Inductive step: if the list has more than one element, divide the list arbitrarily into two lists that are the same length; recursively sort the two sublists; then merge the resulting sorted lists into one sorted list ----------------------------------- (4) Two-phase, multi-way merge sort ----------------------------------- Phase 1: sort main-mem-sized pieces of data - every record is part of a sorted list - the sorted list just fits in available main mem Phase 2: merge all sorted sublists into a single sorted list Will use a hybrid: get a main memory fill of data; sort it using quicksort; repeat for as many main-memory fills as have 1. fill all of main mem w/blocks from relation 2. sort the records in main mem (using quicksort, e.g.) 3. write sorted records from main mem onto new blocks of 2ndary mem - is one sorted sublist 4. still need to merge... With example above, will have 20 sorted sublists; - time: read 250k blocks once; wrote 250k blocks once - assume blocks stored at random on disk; so each block read/write takes 15 ms - so first phase takes 15 ms * 500k = 7500 seconds == 125 mins The merging part: - could merge them in pairs; 2 * log(n) where n = # of sorted sublists - better: read first block of each sorted sublist into main mem buff - recall we have 20 sorted sublists; so can easily fit 20 blocks into MM - use a buffer to hold the output block which contains as many of the first elements of the completed sorted list as possible Merge: 1. find smallest key among first elements of all lists 2. move smallest element to first available position of output block 3. if output block is full, write it to disk and reinit the same buff in main mem 4. if block from which smallest element was just taken is empty, read next block from its same sorted sublist into that newly-empty buff; if no blocks remain from that sorted sublist, leave that buff empty; don't look at that buff anymore Blocks read in an unpredictable order; every block holding records from one of the sorted lists is read from disk exactly once; so 250k reads; each record put in an output block; each of these blocks written to disk; so need 250kwrites too So second phase takes 125 mins. too ------------------------------------------------------ (5) Extension of multi-way merging to larger relations ------------------------------------------------------ How large can the sets of records to be sorted be? - block size is B bytes - main mem avail for buffering blocks is M bytes - records take R bytes - # of buffs in main mem: M/B - in 2nd phase, need to reserve 1 block as output block so can read from (M/B) - 1 sorted sublists -- so that's how many sorted sublists we can create from phase 1 - # of times fill main mem with records to be sorted: (M/B) - 1 - each time we fill main mem, we sort M/R records (approximately) - so total # of records can sort: ( (M/B) - 1 ) * (M/R) E.g. M = 50 * 1024 * 1024 B = 4096 R = 100 If need to sort more records than this, can add a third pass. - call above algo as a routine upto (M/B) - 1 times - so able to sort: (M^2)/RB * ( (M/B) - 1 ) == M^3/(R*B^2) ============================================= D. Improving access time of secondary storage ============================================= Earlier: assumed data stored on a single disk and blocks chosen at random from possible locations on the disk - but how about if we put blocks contiguously to amortize cost of seek and rot latency over a larger # of blocks - read more blocks sequentially - is w.r.t. phase 1; for phase 2, we'll be reading one block from each of the sorted sublists (where each sorted sublist is itself written contiguously) so this would seem to force random access - so big picture is want to put the blocks that we're going to read into main mem (during phase 1) contiguously on disk - that's fine for sorting but how can we reduce # of random accesses when system load consists of tons of unrelated (random) queries? Strategies: (a) put blocks that are accessed together on the same cylinder - "blocks accessed together" (?) -- perhaps those for same relation (b) divide data among several small disks; - then have different head assemblies going after parts of blocks at the same time; also divide data among several small disks on same head assembly so can read from same cylinder (using all disk heads on a spindle) simultaneously (c) mirror a disk: make >= 2 copies of a disk; (d) use a disk scheduling algo to select order in which several requested blocks will be read or written (e) prefetch blocks into main mem --what happens if system has to simultaneously support a bunch of queries and a bunch of transactions (i.e. airline reservation system) -------------------------------- (1) Organizing data by cylinders -------------------------------- seek time == 1/2 of average time to access a block want to store data that is likely to be accessed together on a single cylinder if a single cylinder doesn't provide enuf room, use adjacent cylinders can read all blocks on a single track or cylinder consecutively and pay only one seek time and one rotational latency Example: average block xfer time: 0.5 ms average seek time: 6.5 ms average rot latency: 7.8 ms loaded main mem 20 times with 12,800 blocks each time; 8192 cylinders; each stores ~ 1 MB we have 10,000,000 tuples at 100 B each --> 1 GB == 1000 MB; so must store data initially on 1000 cylinders main mem can hold: 50 cylinders can read one cylinder with one seek time; must move heads 49 times to goto adjacent cylinders (1 ms each) Time to fill main mem one time: 6.5 ms (one seek) 49 ms (49 one-cylinder seeks) 6.4 s (xfer of 12,800 blocks) Ignore everything but xfer time --> 20 fills * 6.4 s/fill == 128 s Writing takes same amt of time; so total time is about 4.5 mins (vs. 125 mins from before when we assumed randomly distributed blocks) Storage by cylinders doesn't help with 2nd phase... ------------------------ (2) Using multiple disks ------------------------ Replace one disk (with many heads pointing to same track & sector) with many disks - each of which having independent head assemblies. Best case for phase 2 would be if we could have as many different head assemblies as we have sorted sublists (plus one other for writing the output buffer). And also if there were no interference from other disk read/write requests... As long as disk controller, bus and main mem can handle the aggregate throughput of several independent disks, perfect case would divide time by # of independent disks Example: Imagine instead of 1 disk assembly we have 4 disk assemblies, each of which has only 1 platter so 2 surfaces. - divide records among 4 disks - fill 1/4 of main mem from each disk during phase 1 (presumably striping the data for 1 main mem fill across 4 disks) - so xfer time (which dominated after we organized data cylindrically) would be 1/4 of what it was --> 1.6 seconds vs. 6.4 seconds - same speedup in writing data out to disk if split each sorted sublist over the four disks - so total process takes 1 minute instead of 4 minutes As regards phase 2: can increase speed of reading by 4 since can read from front of four sublists simultaneously (assuming careful data layout); also instead of waiting for a new block (which had been exhausted) to be fully loaded into mem, can resume comparisons among 20 elements when the first element of the being-loaded block gets into mem; blocks still must be read randomly since can't predict which blocks will exhaust in what order; if need to read two blocks which are both located on same disk, then have to wait for first block to be read entirely then 2nd block can begun to be read writing phase: use 4 output buffs; fill each in turn; each buff is written to one disk filling cylinders; so we can fill one buffer while the other three are being written out (to disk) - so fill a buff completely, then fill next buff, then ... but time to write is limited by time to read; and time to write sped up by factor of about 2; so overall speedup is roughly factor of 2 ------------------- (3) Mirroring disks: 2 or more disks which hold identical data ------------------- Increases robustness: since if one disk has head crash, can read from other Can also speed up data access; alleviates problem mentioned in (2) where could have two reads needing to go to the same disk; w/mirroring, can divert one of the reads to the other (identical) disk thus speeding up reads if have n copies of a disk, can read n blocks in parallel plus different disks will have different seek times to the desired data; so if we have two copies of same disk, can preferentially choose the disk with the shorter seek time and read from that one doesn't speed up writing; doesn't slow down writing either, tho can write to all n disks in parallel (so only costs time of one disk - plus opportunity cost of disks-being-written-to not being available); may be slight variations in writing times due to different rotational latencies ----------------------------------------- (4) Disk scheduling and the elevator algo ----------------------------------------- speed up disk accesses by having disk controller make smart decision about ordering of service of those disk requests; only useful when order in which reads/writes proceed need not be sequential; elevator algo: think of disk head sweeping from innermost track to outermost track (cylinder) and then back; as heads pass a cylinder, stop if there are one or more requests for blocks on that cylinder; when heads reach a position where there are no more requests ahead of them (in the direction of travel), they reverse course advantage of elevator algo increases as # of requests increases; when have about as many waiting requests as have cylinders, each seek will cover just a few cylinders --> approaching min; if there more requests than cylinders, then can optimize ordering of requests to minimize rot latency but as # of requests gets too big, our delay increases as well (since we're servicing many requests from when we start at the innermost cylinder til we get to the outermost cylinder); avg delay increases; ----------------------------------------- (5) Prefetching and large-scale buffering ----------------------------------------- prefetching aka "double buffering" for some apps, can predict order will request blocks from disk if so, can load those into mem before they're needed; - so can better schedule the disk - and can hide the disk latency could have been used for two-phase, multi-way merge sort: - devote two block buffs to each sorted sublist; - bring both in; then generally fill one buff while records from the other are being used; when a buff exhausted, use other one (which is full and waiting); rinse, repeat. So store a sorted sublist on a cylinder (or on consecutive cylinders as needed); so consecutive blocks of sorted sublist exist as contiguous blocks in a track; then read a whole track or cylinder whenever we need some records from a given list (filling up a buff of next records needed) Ex 2.14: have room in main mem for two track-sized buffs for each of the sorted sublists; this takes about 5 MB; instead use 40 buffs of 1 MB each; each buff contains 1 MB of records from some sorted sublist; each sorted sublist occupies 2 MB total; each buff can hold about 1 cyl worth of data so since we're using cylinder sized buffs, only need one seek in order to get 1 MB worth of data (amortizing cost of seek); can delay writing of buffered blocks as long as don't need to reclaim that buff space immediately; if use large output buffs then seek time and rot latency become negligible as compared to xfer time; fill one output buff with 1 MB worth of records; then while filling other write out the full buff and in so doing spread it out across a cylinder; so reading time: 2.15 mins & writing time: 2.15 mins so total for phase 2 is 4.3 mins (instead of 1 hour or worse 2 hours) ================ E. Disk failures ================ Types: ----- intermittent failures: most common form; attempt to read or write a sector unsuccessful; try again and is OK media decay: a bit or bits are permanently corrupted; impossible to read a sector correctly; write failure: attempt to write a sector but can't write and can't get previous contents of that sector; disk crash: entire disk becomes unreadable suddenly and permanently how to detect failures? use parity checks; how to recover? use stable storage ------------------------- (1) Intermittent failures ------------------------- keep redundant bits which tell us: (a) whether sector we're reading is correct or not and also (b) whether sector we just wrote was written correctly reading function returns a pair: (w,s) - w == data in the sector that is read - s == status bit; was data read successfully? for an intermittent failure, may repeatedly get a status bad msg but if try read enough times (limit is 100) then eventually will get status OK msg (could of course get false positives but can add redundancy to reduce the probability of this to as small as desired) how to be sure a write proceeded OK? naive: write then read what have written and compare to source better: write then attempt to read to see if status is OK - if so, assume write went OK - else, assume write got screwed up and retry write then recheck status... ------------- (2) Checksums: how one knows whether a read was OK ------------- each sector has some addt'l bits (checksum); this value set for the sector on the basis of the data bits stored in that sector; if when we read we get a different checksum than the one stored, report bad status msg; else report OK status msg; small probability that we get a false positive checksum; increase # of checksum bits to decrease this likelihood simple form of checksum based on parity: if there's an odd number of 1s, we say the bits have odd parity and set the parity bit to 1 giving us total parity of 0 or even parity; if there's an even number of 1s, then we set parity bit to 0 giving us even parity; when writing a sector, disk controller computes parity bit and appends it to the sequence of bits written in the sector (ultimately giving each sector even parity) any 1-bit error in reading or writing bits and their parity bits results in the bits having odd parity; easy for disk controller to see this error if more than one bit corrupted, 50% chance that the # of 1 bits will be even and error won't be detected; if keep more parity bits, then decrease this probability; if keep n independent bits as checksum, then chance of missing an error is 1/(2^n) ------------------ (3) Stable Storage ------------------ Parity helpful for detection; useless for correction; entree redundancy may not even have atomicity: may be overwriting previous contents of sector and cannot read new contents neither can retrieve old contents! stable storage: is a policy; - sectors are paired; each pair represents some sector's contents - two copies: the left copy of X, X_L, and the right copy, X_R - assume that if read returns (w,good) for X_L or X_R, then read was OK - writing policy: - write value of X into X_L; check that status is OK -- if not, repeat write; -- if try write a bunch of times and still don't get OK, then assume X_L is bad sector and pull another one off of a list of free sectors and use THAT as new X_L - repeat same process for X_R - reading policy: - read X_L; if "bad" returned, try again - if try bunch of times and still get "bad" then eventually try X_R ------------------------------------------------- (4) Error-handling capabilities of stable storage ------------------------------------------------- (a) media failures: if store X in X_L and X_R successfully and then one of those becomes permanently unreadable, can read X from the other; if both failed, are screwed but pretty unlikely; if one fails, then will replace it next time try to write one of the X's (b) write failures: - suppose that as we are trying to write X, there's a system failure - X may have been in main mem so it's lost - plus may not have completed write to X_L (so sector partially written) - when sys comes back up, will be able to get the old value of X or the new value of X (since only one of sector writes interrupted -- since perform them sequentially?) (i) maybe failure occurred as we were writing X_L -- will find X_L status == "bad" -- X_R status will be OK, though; -- roll back via copying X_R to X_L (ii) failure occurred after wrote X_L -- will find X_L has status OK -- read new value of X from X_L -- IF X_R has status "bad" then copy X_L into X_R - what about if X_R status OK but X_L data != X_R data, should copy X_L's value into X_R's value then too, right? ============================= F. Recovery from disk crashes: what to do about head crashes (data permanently destroyed) ============================= if data not backed up, nothing can be done; strategies to back data up: mirroring, duplicating sectors, more parity stuff, ... - RAID: Redundant Arrays of Inexpensive/Independent disks Will discuss RAID levels 4,5,6; also handle failures discussed in (E) ------------------------------- (1) The failure model for disks ------------------------------- Statistics of failures: - mean time to failure (MTTF): length of time by which 1/2 of disks will have failed catastrophically; modern disks MTTF == 10 years -- assume failures occur linearly -- 5% will fail in first year -- 5% will fail in 2nd year, ... -- in reality, have some # of failures at beginning then not many for awhile then steady increase in failures - of course may have disk crash w/o data loss; how? via adding redundancy --------------------------------------- (2) Mirroring as a redundancy technique: RAID level 1 --------------------------------------- Simplest scheme; 1 disk is data disk; have copy of that as redundant disk Only way can have data loss is if second disk crashes while first one is being repaired; ----------------- (3) Parity blocks ----------------- too expensive to use RAID level 1 usually; use RAID level 4: one redundant disk no matter how many data disks - # blocks on each of data disks from 1 to n - in redundant disk, i'th block contains parity bits for the all of the data disks' i'th blocks --> so 1st block of redundant disk contains parity checks for the 1st block of every other disk --> so 1st bit of 1st block of every disk is added up and if have odd number of 1s in this group, then 1st bit of 1st block of redundant disk is 1; else if have even # of 1s in this group, then 1st bit of 1st block of redundant disk is 0 --> add up 2nd bit of 1st block of all disks; compute its parity bit and put that in bit position 2 of block 1 of first disk, ... - reading: -- if get two read requests for the same disk, could service one of those requests from that disk THEN service the other requests by going to the other disks and the redundant disks and reconstructing the desired block via the parity info - writing: when write a new block of a data disk, have to update that block in the redundant disk; -- look at old block -- look at new block -- XOR them -- any location where old block had a 1 bit and so does new block, XOR of these two bits will be 0 (no change to parity) -- any location where old block had a 0 bit and so does new block, XOR of these two bits will be 0 (no change to parity) -- else old block had a 1 where new block has a 0 or vice versa, only update those parity bits (the ones for which the XOR is 1) (1) read old value of data block being changed (2) read corresponding block of redundant disk (3) write new data block (4) recalculate and write block of redundant disk new_parity_block == ( old_block XOR new_block ) XOR old_parity_block -- used 4 disk I/Os here, no matter how many addt'l disks add (as contrasted to naive approach which requires a number of disk I/Os that's linear in the number of disks) - failure recovery: - if redundant disk crashes, swaq in new disk; recalculate parity bits - if data disk fails, swap in new disk, recompute data from other disks - so will need to read all data on all viable disks and write single disk