EuroSys
The European Professional Society on Computer Systems

European chapter of the
Special Interest Group on Operating Systems (SIGOPS)
of the Association for Computing Machinery (ACM)
Home Join or renew membership EuroSys for Students EuroSys for Faculty Job Offers Activities Systems Directory Systems Events and Blog Eastern Europe Initiative Member Area Member News Officers and Volunteers Useful links Press releases

Posts Tagged ‘OSDI 2008’

OSDI 2008: Day 2

Wednesday, December 10th, 2008

File Systems

SQCK: A Declarative File System Checker

  • File systems are becoming larger and must be reliable, but they still get corrupted (hardware errors, firmware bugs, FS bugs, etc.), and require immediate repair to get out of an inconsistent state (which can end up causing even more damage).
  • Who should repair this? Is journalling the solution to every problem? Write-ahead log only prevents write inconsistencies due to system crashes, but this doesn’t cover all problems. Should the FS repair itself online? There isn’t enough machinery in the FS to do this: all it can do is, e.g. mount itself read-only. So we need a utility, and we have one: fsck. A trusted utility which is the last line of defense for bringing back consistency. XFS says it never needs fsck, but it uses it in the end, due to these other problems. All file systems have their own variety of fsck.
  • Fsck is complex: it has a big task, must read all FS metadata and turn any corrupt image to a consistent image. e.g. is a data-block being shared by two inodes? Complexity arises from them being implemented in C, which is hard to reason about. Ext2 fsck does 150 checks in 16kloc; XFS does 340 checks in 22klock. Looks like hundreds of if-check statements. But fsck is untouchable: you can’t modify it because it is (i) crucial and (ii) changing it might introduce new problems.
  • Are current checkers really reliable? If not, how should we build robust checkers?
  • Analysed e2fsck, which does inconsistent repair (sometimes), making the file system unreadable in some cases. So fsck makes an incorrect repair, making the FS more inconsistent. Sometimes it is consistent but not “correct”, deleting valid directory entries and loses a huge number of files.
  • Lesson: complexity is the enemy of reliability. A big task with a bad design leads to complexity and unreliability. Want a high-level approach to have simplicty. So use a SQL-based declarative language to write checks, which means fewer lines of code. Evaluated the simplicity (150 queries vs. 16kloc) and reliability. Also showed flexibility and reasonable performance.
  • e2fsck must cross-check all ext2 metadata: an indirect pointer must not point to the superblock; a subdir should only be accessible from one directory. Injected a single corruption at a time and tried to analyse ho e2fsck repairs this single corruption. Only corrupted on-disk pointers (make indirect pointer point to superblock; make directory entry point to another directory). Usually a corrupt pointer is simply cleared to zero.
  • Inode has an indirect pointer that should point to an indirect block. But if it points to a superblock. First e2fsck finds that the point is invalid, and zeroes. But if it does point to an indirect block, it has to check the content of the indirect block (checking if any entry points outside the FS range), and garbage pointers are zeroed. At least, this is the ideal.
  • So in e2fsck, if the inode indirect pointer points to the superblock, it will assume that the superblock is a corrupt indirect block and zero some of the entries in the superblock. Then it will check the indirect pointer, find that it is pointing to the superblock, and zero it. The incorrect ordering fatally corrupts the FS.
  • Can also have consistent but incorrect repair. We have information about the .. pointer of directories, so we can identify when two directories point to the same subdirectory, and can resolve this. But e2fsck actually chooses the directory with a lower inode number, and then rewrites the .. pointer. e2fsck doesn’t use all available information.
  • Also policy-inconsistent and insecure problems (in paper).
  • These are not simple implementation bugs.
  • Fsck must do hundreds of checks, and some complex cross-checks, between single and multiple instances of the same and different structures. But all of these must be ordered correctly.
  • In SQCK, use a declarative query language (i.e. SQL) because it makes high-level intent clear. And it is fit for cross-checking massive amounts of information. Simplified e2fsck into 150 queries. Each check/query is easy to understand, and it is flexible, enabling the plugging in and removal of differnt queries.
  • Using SQCK, take a FS image, load all of its metadata into temporary database tables, created fresh when SQCK starts running. e.g. inodeTable, dirEntryTable. Checks and repairs run on the database tables. Finally flush any modifications and delete tables.
  • Example of cross-checking a single instance of a structure: find a block bitmap that is not located within its block group. Basically a range check on whether the block bitmap is within the first and last block of the block group. Simple SELECT … NOT BETWEEN … query.
  • Example of finding false parents of subdirectories. The e2fsck code is wrong here. Join on parents and children where there are two parents with different inodes.
  • Declarative repairs must also be performed. Some are just update queries, where a table is updated with a new value in some field (and a dirty bit is set). But a repair might also be a series of queries, which can be combined with C code, though the actual work is done in the SQL.
  • 150 queries, 1100 lines of SQL statements. The new checker passes hundreds of corruption scenarios. Easy to add new checks and repairs, to get different versions of e2fsck. Introduce some performance optimisations: total runtime is slightly longer (up to 50% more than e2fsck), based on MySQL prototype. Future optimisations include hierarchical checks and concurrent queries.
  • Question: in paper, the MySQL is held in a 0.5GB ramdisk; given an FS of some size, how big need the ramdisk? Half-full terabyte-sized disk needs a 0.5GB metadata DB.
  • Question: what about the complexity of the scanner-loader which has to understand the entire FS? e2fsck also includes 14kloc which do this scanning, not counted in the comparisons.
  • Question: what about ordering the SQL statements to ensure that correct repairs? Haven’t looked at automating the ordering, but it is easier to reimplement different orders.
  • The times include only the times for loading the DB, scanning and performing the checks, assuming correct FS.
  • Question: could flushing introduce errors? e2fsck doesn’t use journalling to introduce repairs, so a crash of e2fsck would cause a (different) inconsistent state. But could make the DB flusher use a journalling approach quite easily.
  • Question: wouldn’t it be better to declare what the FS should look like, rather than how to check it? Started that way, making our own declarative language, but thought it would be less work to use SQL, which has been honed by the DB community.

Transactional Flash

  • Want to replace disks with solid-state devices, but they are quite different. However today they are used as traditional disk replacements, but this misses an opportunity to export new abstractions that better use the properties of SSDs.
  • TxFlash provides transactional guarantees to higher software layers. WriteAtomic abstraction updates multiple pages atomically. This can simplify the software layer of transactional systems. Today use CoW or logging to provide these guarantees, but TxFlash can move this to hardware.
  • Could use a traditional log-based protocol, but this paper introduces a new protocol: cyclic commit.
  • Key results: implemented TxFlash on a simulator and an emulator running on a real SSD. Want to understand how cyclic commit performs compared to traditional commit. Improves by 95% for small transactions. Also want to know the overhead of transactional Flash. Overhead is <1%: essentially free. And what is the end-to-end performance improvement. The running time is reduced by up to 65% for I/O intensive workloads.
  • Architecture of an SSD: a controller connected to a set of Flash packages. No moving parts, giving fast random accesses. The memory is non-overwrite in nature: must first erase before writing, and cost of erasure is high, so better to use a log-structured file system. Each page has a 4KB data part and 128-byte metadata (checksum, ID, version number, etc.). Traditionally export read and write. TxFlash exports WriteAtomic and Abort.
  • Programming model consists of only writes, and no reads. Straightforward to extend to a more general transactional construct.
  • Suitable for transactional support: non-overwrite pages mean minimal extra overhead in tx-support. Also, CoW and logging lead to fragmentation but Flash has fast random access. Plus, since SSDs are quite new, there is an opportunity for new abstractions.
  • Commit protocol is used to assure atomicity of multiple updates. Must work correctly in the presence of transaction aborts and multiple failures. Need a recovery module (run when system starts to regain consistency) and garbage collector (to remove old versions of pages).
  • Traditional log-based protocol will first write data that must be committed, and then write a log entry to say that the data has been written. Recovery is simple (scan the log to replay transactions if need be) and the garbage collector is simple (just copy valid page versions to fixed locations). Problem is the space overhead with the per-commit block for each transaction. Also a performance overhead of write ordering, which reduces parallelism. Overheads become large for small transactions.
  • T1 updates pages A, B, C. The TxFlash device initialises the pages in memory and stores a point to the next logical page of the transaction (from A to B, B to C and C to A). All pages are written in parallel which means we have a cycle and therefore have a successful commit. If we have a broken cycle, this doesn’t quite imply that the transaction aborted. Could have a point from A to B and B to C, but no version of C (and pointer to C). But what if A, B, C committed, then C, E, F was committed? The previous version of C is obsolete, but this means that the A, B, original-C cycle is broken, which looks like the abort case.
  • Simple Cyclic Commit: if a page version exists, any previous version of the same page present or referenced in the storage must be committed. Aborted pages and their references are erased before writing a newer version. Resolves the broken cycle ambiguity: the presence of C’ means that C has been committed, and if the original version of C is committed, then T1 is committed. In the abort case, the edge from B to C must be deleted.
  • Recovery and garbage collection is easy to understand and implement, and does not add extra code complexity. The recovery module actually performs better.
  • Weakness: need to erase all aborted versions before creating a new version for a page, which is costly under a high rate of transaction aborts.
  • Back-pointer cyclic commit improves this (details in the paper).
  • Implemented on a simulator and an SSD-based emulator. TxFlash simulator extended from previous work on SSD. The emulator is a pseudo-device driver which exports the WriteAtomic() API. TxExt3 is a modified ext3 file system that uses WriteAtomic.
  • Transactions per second improves with cyclic commit (simple and back pointer versions perform similarly): improvement better when small transactions, but approximately equal as tx-size tends to infinity.
  • No appreciable overhead for IOzone, Linux-build, Maildir benchmarks; and very small overhead for TxFlash on TPC-B due to metadata. Using traditional commit, Maildir and TPC-B show larger overhead. (Compared to a regular SSD, non-transactional.)
  • End-to-end impact: compared to regular ext3 on an SSD, TxExt3 on TxFlash improves run time by up to 65%, but for Linux-build the impact is negligible (compute bound).
  • TxExt3’s JBD is less than 50% the size of ext3’s JBD (journalling module). Most reduction from removing journalling and revoke modules.
  • Advances in hardware devices can lead to new abstractions and advances in the software layer. SSDs are suitable for transactional constructs with low overhead and code complexity. TxFlash simplifies software and improves performance of end applications. Cyclic commit is an alternative to traditional commit for systems with small transactions.
  • Question: concerned about the deployment incentives that would enable this to be deployed in the real world (an OS would require the code to support transactions on legacy devices anyway, so the simplicity argument doesn’t hold)? Some specialised OSs (like those for the Xbox on game controllers, or phones) don’t want to pay cost of implementing all this in software, so TxFlash might apply here.
  • Question: have you given thought to what mechanisms might be used to identify transactions over existing interfaces? Or could you export a fuller view of Flash (including metadata) up to a host? Insufficient to export the metadata to a higher-level software layer, so either you export all of the control, or do it at the device level and save the complexity from the software.
  • Question: not sufficient just to make sure there are no versions of the aborted page, but there have to be no pointers. How do you find all the pointers? Maintain metadata structures in the system which let you know the last aborted transaction for a particular page. This is sufficient because it is the only thing you have to erase when committing a new version.
  • Question: the microbenchmarks suggest that the best improvement is 2x, but the macrobenchmarks suggest a 3x improvement. How come? Benefits arise from not doing additional checkpointing (moving data from the log to the actual location). Also from reducing the write ordering problem. Microbenchmarks only show the benefit from eliminating write ordering, or separately for removing the checkpointing. The end-to-end benchmarks showed both.
  • Question: the SSD has software to process the FTL. Would you want to move things to the FTL to improve energy costs or write performance? If you remove code from the FTL, the job of the FTL (like remapping, cleaning, load levelling) has to be moved to the software layer. Doing this might not be a great idea, because not all Flash devices would want this complexity moved to software (especially for resource-constrained devices). Simply return the wrong parents and make the appropriate repairs.

Avoiding File System Micromanagement with Range Writes

  • Today’s file system views a disk as a linear array of blocks. This gives a simple-to-use interface but impacts performance because it hides some details.
  • High-level goal is to place a goal near other related blocks, but it can only make a low-level command to write a block to a specific address. Classic example of micromanagement.
  • Cost of this: unnecessary positioning time, because the FS knows what is free, but the disk knows what is close to the current position.
  • Previous solutions: block remapping, whereby disks select the closest block on the fly and maps it to the requested block, requiring metadata. An object-based interface deals with writing objects to disk but also requires lots of metadata.
  • Want FS to specify a set of options for blocks to write to, and lets the disk select the best from among these. The result is an efficient write with minimal change.
  • How do we specify range writes? Normally would specify an address, the data and its length. Simple way would be a pair: start address and end address. More sophisticated would be to specify a list of pairs, or a large range with an exclusion bitmap. Need to get back from the disk a status (successful?) and the address to which it was written.
  • Case of overlapping ranges for two range writes: the disk must note that it has selected a block from the overlap and ensure that that block is masked out from subsequent writes. Need a write barrier to allow the disk to forget which blocks were written and flush its metadata structures.
  • Disk scheduler should consider all options within each range write, but select only one. Traditional scheduling has a queue which is scanned and blocks are selected to some policy. Expand-and-cancel scheduling changes a range write into several possible writes in a regular disk scheduling queue (then cancels all other options when one is selected), which is easily implemented in standard disk scheduler. Added 1000 loc to Disksim, for a particular HP disk (relatively old, but still applicable).
  • Do multiple requests solve the positioning time problem? Use an SPTF scheduler with workload based on random writes and writes spanning across block group. As the queue size grows, total time gets less. The seek time reduces but not as much as the rotation time. The problem remains with multple requests, and is greater at smaller queue size.
  • Evaluated scheduler with range writes, with random writes, range spanning a track and writes spanning across block group. Now the rotation time is reduced for all queue sizes.
  • What is the benefit of blockgroup-sized range writes? The seek and rotation times are significantly reduced for all queue sizes.
  • How to incorporate range writes into a filesystem? Need to preserve sequentiality, determine proper ranges and handling delayed address notification.
  • Loss of sequentiality may impact read performance. Solution is to issue larger writes.
  • How to determine the size of the range? Larger range implies more benefit. Solution is to report largest range that best matches layout policy.
  • How to handle delayed address notification? The FS knows the address after the write completion. So there are ordering constraints: data block must be written before inode. So only use range writes for data blocks, not metadata, e.g.
  • Evaluated on an ext2 FS simulator on a modified disksim. For first data block of file, give the disk some flexibility. The next blocks are, as far as possible, written sequentially.
  • Evaluated on a synthetic workload of 1000 4kb files on an empty FS. Slightly better performance for the single process case, but much better when multiple processes are writing files at once. The layout that results is similar in both cases (so the high level policy is the same, but we get better performance).
  • Also evaluated on untar, PostMark and Andrew benchmarks. 16–35% improvement on untar and PostMark. Andrew has more reads, so the performance is very slightly worse.
  • Journalling system case study. No seek delay but some rotation delay. See a sequential workload. Approach is to convert a write-ahead log to a write-ahead region. Trick is to write to next rotationally closest location. Add a software layer between ext3 journalling layer and disk, which uses a disk performance model, which spreads out the transaction across the disk and reduces rotational delay.
  • Mounted ext3 journal on Bark (software layer) as a pseudo-device, and ran TPC-B on it. Improves performance in both cached and non-cached cases. Also improves the distribution of write times: 99% of requests have write times less than 2ms (when rotation time is 6ms).
  • Want to look at RAID integration, flash SSD integration, and Range Reads.
  • Question: change the disk interface and change the file-system to improve write latency… shouldn’t I buy a cheap disk and put some Flash in front of it? Existing disks could benefit from this approach, and this could improve the situation on Flash as well.
  • Question: what is the right way to handle dependencies between data writes when you need to know the location of preceding writes? Have seen this problem already in the file-system, so higher levels should take care of this ordering.
  • Question: if you ever do overwrites, you’ll need to do a second write? Don’t address this, but it is an interesting area for future work. Were all evaluations done using empty file systems, which have fortuitous fragmentation properties? Evaluated with half-full systems in the paper.
  • Question: range writes might not be helpful on Flash in front of the disk, because rotational latency becomes less important and you should maybe even use slower-rotating disks for better power consumption.
  • Question: is it true that you’ll get more fragmentation? Haven’t really looked at this.

Programming Language Techniques

Binary Translation Using Peephole Superoptimizers

  • Binary translation is ability to run software for one ISA on hardware implementing a different ISA. Applications include portability, virtualisation, backward- and forward-compatibility, on-chip binary translation and JVMs. Could exist at different layers of the stack: in a hypervisor, in the OS kernel (e.g. Rosetta) or as a user-level application (like a JVM).
  • Difficulties are performance, complexity of ISAs, retargetability (want to target different source or destination ISAs) and OS compatibility. Concentrate on performance, complexity and retargetability.
  • Technique is peephole superoptimisation.
  • Superoptimisation is a term from 1987 to describe a code generator that does brute-force search to find the optimal code for a translation. Enumerate all sequences up to a certain length, then compare each enumerated sequence with target function for equivalence. Originally scaled up to 12 instructions in a sequence. To support all instructions in an ISA, you can typically do 3 or 4 instructions.
  • Peephole superoptimisation is using a superoptimisation to infer peephole optimisations. This is pattern matching, when you search for a pattern and have a replacement template.
  • Get the instruction sequences and scan for sequences that can potentially be optimised. Canonicalise and store them.  Then do brute force optimisation on each of the target sequences, search for optimisations in a brute-force manner and check for equivalence.
  • To test for equivalence (in: two sequences, out: are they equivalence). First do an execution test on random input to see if they result in identical outputs. If you pass this, use a satisfiability solver to check equivalence. Usually fail the first test, so this is efficient.
  • Finally generate a table of peephole optimisations.
  • Approach is to use lots of peephole transformations to translate code from one ISA to another. Need a PowerPC to x86 register map (e.g. r1 -> eax; r2 -> ecx) for each peephole optimisation. This is the main difference between a peephole optimiser and a binary translation, and needs to be chosen wisely. The best code may require changing the register map from one code point to another. The choice of the map affects the choice of instructions and vice-versa.
  • Assume a cost model for instructions (register access versus memory access). Also capture the cost of switching registers to and from memory (because more registers on one ISA).
  • Could use a greedy strategy for register map selection, which minimises switching cost for each subsequent instruction. But this doesn’t give the optimal solution, in terms of instruction costs (and overall). So use dynamic programming to attempt to reach a near-optimal solution. It accounts for translations spanning multiple instructions and simultaneously perform instruction selection and register mapping.
  • Implemented a static user-level translator from ELF 32-bit PPC/Linux binary to ELF 32-bit x86/Linux binary. Translate most (but not all) system calls. Use PPC emulator for execution tets, and zChaff SAT solver for binary test.
  • Implementation issues: endianness, so need to convert all memory writes to big-ending (source) and all memory reads to little-endian (destination). Also compiler optimisations: PPC optimiser staggers data-dependent instructions to reduce pipeline stalls, but the solution is to cluster data-dependent instructions in basic block before translation.
  • Evaluated with the same version of gcc, using soft-float library and statically-linked input executables. Did microbenchmarks and ran SPEC CINT2000. Compared against natively-compiled code and against other binary translators (QEMU and Apple’s Rosetta).
  • 750 rules in the peephole table, but cost of constructing that is amortised across every run. Used same optimisation settings on both architectures.
  • Microbenchmarks include sort algorithms, binary search, Towers of Hanoi. Minimum performance is 61% of native; best performance is 319% of native (for fibonacci, under an unrealistic optimisation). Sometimes gcc generates better code for PPC, due to the preponderance of registers, and the register mapping algorithm does a good job of remapping on the CISC architecture.
  • On SPEC CINT2000, minimum performance is 42% of native, median is 67%, and best is 167% (no optimisation on twolf, using soft-float). Some translations failed under compiler optimisations.
  • On average, 3% faster than Rosetta for -O0; 12% faster for -O2. Included cost of translation when running on QEMU and Rosetta, their system does static translation. Takes 2–6 minutes to translate a 650KB executable with ~100K instructions. Could reduce this to <10s, because for 98K instructions on which 0.01% of the time is spent, we could use any register map.
  • Future work to look at JIT compilation and machine virtualisation.
  • Question: what pattern matching algorithm do you use, and would optimising this improve performance? Use something very simple, and an improvement could be made.
  • Question: is peephole superoptimiser thread-safe? It’s a code generator, so this doesn’t really mean anything. Designed in such a way that the translation should work on multi-threaded applications, but haven’t tested this.
  • Question: what about proper handling of synchronisation between two platforms, such as moving a register to a memory location and vice versa, so do you look at this? Have not been pulling memory locations into register (because x86 has fewer registers), so have not looked at this.
  • Question: how much of the optimisation techniques would work in a dynamic translator? The reason for using a static translator was for simplicity and the separation of translation and run time. But no reason why this couldn’t work in the dynamic case.
  • Question: can you really handle equivalence testing in the case of aliasing? The execution test could suffer problems with aliasing, but the satisfiability solver should catch this. See previous ASPLOS paper.

R2: An Application-Level Kernel for Record and Replay

  • Debugging tool for recording a run of execution and replaying it for debugging. Because it is difficult to reproduce some bugs when re-executing. It’s hard to apply comprehensive analysis with no interference at runtime (e.g. triggering timeout logic).
  • In a replay, exactly the same output will appear, and the same bug will arise. R2 captures nondeterminism, and makes it happen the same during the replay.
  • Have seen virtual machine approaches, which replays the application and the OS, but this is difficult to deploy. Also a library-based approach which replays the application only, but cannot replay some system applications (like asynchronous I/O).
  • Library-based tools inject a library into the application which do syscall interposition, and writes the results to a replay log. On replay, the interposition switches to consulting the log instead. But this doesn’t record all non-deterministic instructions (e.g. spinlock assembly code) or functions with unclear semantics (like socketcall and ioctl), so you have to write application specific stubs. Can also be too heavyweight, generating large logs for I/O intensive applications.
  • R2 allows developers to select functions that can be easy-and efficiently replayed. Abstract assembly code to higher level functions, and identifying calling functions with higher level semantics (recv rather than socketcall).
  • Selecting a replayable set of functions with the goal of capturing nondeterminism. Need to make a cut through the call graph. Need to track data dependencies between functions (e.g. if you call socket(), need to use the replayed value in read()). Possible to make several incorrect cuts, but came to a set of rules that developers can follow. Rule 1: any non-determinism should be below the interposed interface. Rule 2: all instances of unrecorded reads and writes to a variable should be either all below or all above the interface. Can be difficult to follow these rules in a complex call graph. Module APIs are good candidates, as they encapsulate internal state well.
  • Next challenge is to keep a deterministic memory footprint. Need the same memory addresses in both record and replay to ensure that different control flow isn’t followed. Need to make sure malloc returns the same addresses in both executions. Tool and application in same address space, and the tool runs differently during replay, so get different malloc behaviour.
  • Split space approach: system space (libraries and R2) and replay space (app). Split the user memory address space into two memory pools: native memory pool and deterministic memory pool. malloc in each case is appropriately mapped to each of these. R2 allows code annotation to ensure deterministic memory addresses.
  • Another challenge: deterministic execution order in multi-threaded applications. Especially in conjunction with dynamic memory allocation where this is kept deterministic (replay writing into a buffer before it is allocated). Need to maintain an inter-thread happens-before annotation: callback annotation for functions and sync annotation for synchronisation primitives.
  • Three categories of annotations: data transfer, execution order and optimisation.
  • Annotated prototypes converted via a PHP code template to a C++ code snippet.
  • R2 implemented on Windows in 20 kloc. Annotated the Win32, MPI and Sqlite interfaces. Executed on challenging system applications, including a web server, database, distributed system, virtual machine and network client.
  • Evaluated annotation effort, overall performance and effectiveness of customised interface.
  • One week to annotate the first 500 Win32 prototypes. 2 days to annotate MPI and SQLite. Cost of annotation is amortised.
  • 1% overhead of the stub code, 9% overhead when recording. Log size is 300x greater than the SQLite database looked at. Solution is to choose an interface with less I/O (use SQLite query interface rather than filesystem interface). Log size is now comparable to the database.
  • Space split also helps other in-process tools, e.g. in a model checker. Annotation and code generation also used in other projects.
  • Question: the interface choosing rules seem useful, but a human would screw it up, so is there any static checking or automatic for choosing the interface cut? Assumption is that you can use a modular API to select these interface. Latest work is to use compiler techniques to select these. Also to minimse log size.
  • Question: could the tool be used in production for more complex workloads, since the overhead for the microbenchmark is helped by caching? Have seen 10–20% overhead for larger benchmarks. Also, a problem is that the log size may become very large, but they have looked at checkpointing techniques.
  • Question: could you describe the kind of bugs that R2 does not replay faithfully? One problem is that you assume that, as an optimisation, the local disk operation is deterministic for read-only behaviour, but if during replay the file is missing, then you fail. What about data race problem? Assume that programs are data race free, and these may cause some problems to fail. Can always detect these problems, but not always reproduce them.
  • Question: do you need to reannotate new versions of applications? No, just annotate APIs (like Win32) but you have to annotate some interfaces by yourself.

KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs

  • Awarded a best paper award.
  • Hard to write correct systems code, it has tricky control flow, complex dependencies and abusive use of pointer operations. Also lots of environmental dependencies, which the code has to anticipate. Also a possibility of malicious users.
  • KLEE is designed to check such complex systems code. Fully automatic symbolic execution tool, exploring a large number of paths in a program. Finds many bugs, and generates test cases for exploring tasks, which gets over 90% line coverage (for Unix utilities).
  • KLEE runs a program on unconstrained symbolic input. As the program runs, constraints on the symbolic values are generated, and symbolic execution is forked (adopting constraints) when branches are hit. Can then generate a test case for each path that may be followed.
  • KLEE starts with C code, feed it into LLVM compiler to generate LLVM bytecode. KLEE takes the symbolic environment and the LLVM bytecode to execute the code, interacting with the STP constraint solver. Outputs a collection of test cases.
  • Real programs have an exponential number of paths. Need to use search heuristics. Coverage-optimised search selects path closest to an uncovered instruction and favours paths that recently hit new code (as they are likely to hit new code in the future). Also random path search (in the paper).
  • Constraint solving is an inherently hard problem (NP-complete) and needs to be invoked at every branch. Two simple, effective optimisations (15x speedup).
  • Eliminate irrelevant constraints before sending them to the constraint solver, which works because each branch depends on a small number of variables. So eliminate the constraints which have no effect on the branch.
  • Also cache the results of the constraint solver for later branches. With a static set of branches, there are lots of similar constraint sets. If we see a subset of some constraint set, eliminating constraint does not invalidate a solution, so no need to re-solve. Adding constraints often doesn’t invalidate solutions as well.
  • Problem of interacting with the environment (e.g. reading data from the file system or the network or the end user). If every argument is concrete, forward a syscall to the OS. If they are symbolic, provide a model that can handle symbolic inputs: want to explore all possible legal interactions with the environment.
  • Easy to write these models. A symbolic file is represented by a symbolic buffer and an offset initialised to zero. Reads become memcpy()s. Can just write plain C code that is run by KLEE, which allows users to extend or replace the environment with no knowledge of KLEE internals. Currently implement files, command line arguments, pipes, etc.
  • Looked at the GNU coreutils application suite. 89 stand-alone applications (v6.10). Many functions, different authors, major environment interaction, and heavily-tested, mature code. Challenge to find bugs in this code. Most are between 3000–4000 loc. Some up to 10000 loc.
  • Methodology: run KLEE fully automatically on each application for one hour. Run the test cases on uninstrumented utility and measure line coverage using gcov. Coverage measurements are not inflated by potential bugs in KLEE.
  • Minimum coverage is ~63%, overall is 84%, average is 91% and median is 95%. 16 applications were 100% covered.
  • Coreutils also has an effective regression testing suite. Manual testing achieves an average of 68% coverage. In comparison, KLEE loses on 9 applications, and won on 80 of them.
  • Looked also at the Busybox suite for embedded devices (91% overall, average 94%). Lost on a single application and won on everything else. (Regression suite got 44% coverage with manual effort.)
  • Found some bugs in coreutils. 10 crash bugs (more than were found in the last three years, combined). KLEE also generates command lines which expose these crashes. Showed ten sample inputs which would cause segfaults.
  • In 453 applications, with 430 kloc, found 453 bugs (also looked at Minix and the HiStar kernel), and many of these have been reported. Reported 56 bugs in total.
  • KLEE is not limited to finding crash bugs, but can also find higher-level correctness bugs. Can prove asserts on a per-path basis. Constraints in KLEE are complete and bit-accurate. An assert is just a branch, so KLEE can look at the feasibility of each side of the branch. If the assertion failure path is infeasible, then the code is correct.
  • Can also use this to cross-check applications: two functions which implement the same interface, then assert(f(x) == f’(x)). If KLEE terminates without an error, the paths are equivalent. Otherwise, a mismatch is found. Were able to test coreutils and Busybox utilities with conformance with IEEE Std. 1003.1. Many paths were equivalent, but some had output mismatches.
  • Lots of important related work in many research groups. Distinguished KLEE by testing hundreds of benchmarks, and showed extensive coverage. Also the idea of cross-checking is new, as is support for a symbolic environment.
  • Question: about the long term goals, what are the scalability issues with looking at applications that are larger than the ones tested in the paper? A very large program thrown at KLEE will suffer from exponential explosion of paths, and difficult constraint solving. But there is not necessarily a correlation between the size of an application and the difficulty for KLEE of solving it. Can write a 5-line program that really blows up. The comparative advantage over manual testing may even improve.
  • Driver for something like HiStar is necessary, but can be inferred automatically (e.g. symbolic environment of file system can be taken from the contents of the actual file system).
  • Question: how do you generate a database for testing a database server? Interactions with a DB are symbolic. Previous work marked the FS as symbolic.
  • Question: where is the bottleneck as you scale up the size and complexity of the checked system? Constraint solver or number of states? Working close with constraint solving groups to better characterise constraint solving load. Some benign looking problems turn out to take a huge amount of time. KLEE uses an implementation of CoW at the object level to help with states. For coreutils, can have 100000s of concurrent states.
  • Question: did LLVM buy you anything over another symbolic execution technique? Used to do source-to-source translation. LLVM has useful tools and is a real-world compiler which can be helpful with real code.
  • Question: would running KLEE for 24 hours have beaten manual testing for the cases that a one-hour run didn’t? Some interactions with the environment are not effectively supported. Done at the syscall level, but might have been better to do at the libc level (but syscall interface is smaller and easier). Some problems with the constraint solver meant that running for 24 hours wouldn’t nearly be enough, due to exponential explosion in time. Ongoing work to improve this.
  • Question: what about proper checks vs. last-minute checks (in bulletproof code)? Can write checks in C code which gives the developer flexibility.
  • Question: what is the length of the error path for the 56 bugs? For many of them it varies, but 100s of branches must be hit in order to hit the bug. Very complex to find and reason about.
  • Question: did you try to apply tools in the related work to see if they would uncover the same bugs? No. Many of them are not available. Others are not directly comparable, as they aren’t symbolic execution tools.
  • Question: what does KLEE mean? Don’t know! Daniel came up with the name: perhaps a Dutch work. (van Renesse: apparently not a Dutch word.)

Security

Hardware Enforcement of Application Security Policies Using Tagged Memory

  • Software security is in a crisis. No good abstractions to enforce application security policies, so application software has to do it itself, leading to TCB bloat. Large software systems have millions of lines of code in the TCB, which makes it difficult to eliminate bugs or prove correctness. Linux has gone from 170000 lines to over 6 million in the latest version.
  • Virtualisation often used to minimise TCB, which is good for partitioning applications, but not totally adequate.
  • Current hardware is also inadequate. Kernel data structures require fine-grained protection (e.g. process data structures). Some struct members may be globally readable, but not writable. Or permissions should only be given to owners. One possibility is page-alignment, but this is complex and expensive. Paging also does not associate policy with physical resources (problem of aliasing).
  • Proposal is tagged memory. Associate each word of memory with a 32-bit tag, and map tags to RWX permission for protection domain. Gives fine-grained access control, and simplified security enforcement. Helps even with a compromised OS. Policies are tied to physical resources. Allows for a smaller TCB.
  • Evaluated by building a full system on an FPGA.
  • In proposed system, kernel is shrunk to a trusted security monitor (TCB for entire system), responsible for labelling all resources. One or more kernels providing abstractions but no longer fully trusted. System resembles a distributed system.
  • Prototype: LoStar, based on Loki + HiStar. HiStar places all OS abstractions on top of a few types of objects, which can be labelled and control r/w access. Security monitor translates these labels into hardware tags and controls access to shared memory using tags. Loki is the hardware layer, with a tag permission cache in hardware, populated by the monitor.
  • File has an inode, which has some label in the kernel, and a mapping to the pages which comprise it. The monitor maps tags to labels, and the tag permissions are enforced in hardware.
  • Each user’s thread has its own kernel instance, and a tag permission cache. When you try to access something with a tag that isn’t cache, a trap occurs, the kernel looks at the label, and, if appropriate, gives access by loading permissions into the cache. Threads and memory objects have tags associated with them, so can’t directly modify someone else’s runtime state. Monitor provides a yield-to primitive for thread collaboration. Monitor has no scheduler; scheduling done in the kernel, and transitioning between kernel instances is possible. Cannot spoof objects either.
  • For collaboration, could grant access to someone else, and then try to corrupt it. But when a grant is accepted, the tags are changed to belong to the grantee (and this is disallowed).
  • Kernel objects are also tagged. If trying to delete references to a kernel object, because the fields (such as the refcount) are tagged in a fine-grained manner. References can be created by non-owner, and non-owner references can be deleted, but the refcount cannot be written directly (monitor does this).
  • Garbage collector runs in “idle garbage collection” domain, outside the TCB.
  • Can have a hashtable of objects which might individually be modifiable by owners, but not the linked list pointers (so cannot delete someone else’s object).
  • Prototype based on a full-system FPGA. Runs on a modified SPARC processor with an in-order 7-stage pipeline. Loki logic overhead is 19%, which would be a much lower fraction in more complex CPUs. No overhead in clock frequency.
  • Instruction and data caches had to be extended to add cache bits. Permission cache had to be added, accessed on instruction fetches and on loads and stores. Permission checks happen at the trap stage. The pipeline isn’t deepened, which is why there is no clock frequency overhead.
  • Tags tend to exhibit spatial locality. Could use page-level tags and where necessary switch down to word-level tags. Page tag array which can either have a tag in it, or a pointer to an array of word-level tags.
  • Monitor updates tags and permission cache entries. Traditionally have user and kernel modes. Add a monitor mode (ring -1) which has access to all physical memory, and only monitor mode can read and write tags. (Kernel reads tags only; User doesn’t see tags.) This helps to remove the OS from the TCB.
  • TCB reduced from 11600LOC to 5200LOC. Currently based on a RAM disk, but need to extend tags to the hard disk if it is to work with disks (could use a small amount of Flash to store tags).
  • Performance impact is negligible.
  • Significant number of tags are word-level, so fine-grained tags are necessary. Need many tags to protect users, processes, file descriptors.
  • Loki port to a Xilinx XUP costs $300 for academics, $1500 for industry, and the full RTL and HiStar-SPARC distribution is available.
  • Question: can you contrast Loki fine-grained memory protection with Mondriaan? Can you implement LoStar on top of Mondriaan? MMP provides independent protection domains at the hardware level. But LoStar was based on extending application security policies all the way to hardware. Mondriaan relies on the correctness of the translation mechanisms, whereas Loki is orthogonal. Loki and HiStar were designed to work together so they probably play better together. Builds on MMP.
  • Question: does the garbage collector have implicit access to all kernel data structures? It would have to call into the monitor to do this, but tags are used to protect this.

Device Driver Safety Through a Reference Validation Mechanism

  • Cornell is building a new OS called the Nexus. Addressing a key problem in constructing the trusted microkernel.
  • Drivers take up the majority of OS code, but they shouldn’t be trusted. They have a high fault rate and are written by different people from the core kernel authors. They run in the kernel and have privileged access to hardware. Loads to possibilities for trojans, privilege escalation, bugs, etc.
  • Nexus will have secure drivers. Want a small and auditable TCB, which is incompatible with having drivers in the TCB. Too big, too complex, too changeable to audit.
  • Not enough to move drivers to user-space. Only prevents direct attacks by the drivers on the kernel. Devices can also compromise the kernel integrity (using DMA, using interrupts (DoS)). One approach would be to use an IOMMU to stop the device DMAing to the kernel. What if we had a reference monitor in software, which would be more flexible?
  • Nexus puts a reference validation mechanism, acting as a filter, in the kernel, which monitors references by the drivers in user-space. Want to factor out device-specific code from the reference validator. The validation mechanism is parameterised by a device-specific reference monitor, which covers anything device-specific.
  • Device Driver Reference Monitors (DDRMs) can prevent illegal reads/writes by drivers (like an IOMMU), priority elevation, processor starvation, physical damage of the device by the driver, resource exhaustion of the device.
  • I/O model. Nexus is built to run on x86 and use the PCI bus (nothing about the architecture is specific to these). Allowable operations are I/O ports, MMIO, PCI configuration registers, interrupts, shared interrupts, DMA and deep data structures.
  • Have a language for writing specifications, which expresses allowable sequences of operations. A stateful filter, because some operations are illegal sometimes and legal at others. So the reference monitor is a state machine. Each legal operation is a transition in the state machine. Specify variables, alias names and state transitions.
  • Don’t enumerate every state in the state machine, instead encapsulate them in the state variables. Some are device-specific, and some are predefined by the DDRM (e.g. the memory that the driver has access to may be a state variable, but defined at runtime).
  • A transition is a combination of an event and state. Predicate (boolean expression over operation and state variables) followed by actions (modification of state variables). Written in a safe, declarative language (easier to reason about than C).
  • An illegal transition kills the driver, treated much like a segfault. No reason for a good driver to violate the specification: if it does, it isn’t a good driver. The DDRM runs a trusted reset routine from its specification to put the device in a harmless state (otherwise could cause havoc with the system).
  • Drivers are 99% source compatible with Linux. All operations that a Linux driver expects became syscalls in the Nexus kernel. Added traps and macros to capture the actual I/O. Unmonitored DMA mapped directly into driver memory: works for contents of network packets and contents of disk blocks. (Don’t map the data structures which are security-sensitive.) Added an untrusted disable interrupts operation, which only defers its own interrupts, and pauses the execution of all that driver’s other threads. Weaker than sti, but good enough for most drivers (only want to stop reentrancy). This has the bonus of lowering interrupt latency for other drivers.
  • Written drivers for i810 audio, e1000 Ethernet card, USB controller, a USB mouse and a USB disk.
  • Measured TCB, performance and robustness against attacks.
  • Between 0.4–1.3% of lines of code changed in each driver. Added lines as glue code. Specifications are 150–500 loc, which are at lesat an order of magnitude smaller than the drivers themselves. No need for a spec for the USB mouse and disk (the USB controller spec is all that needs to be monitored).
  • Performance evaluation with gigabit throughput test, audio and network responsiveness tests, ping latency, and end-to-end CPU utilisation during video streaming. Compared a Linux kernel, Nexus kernel, Nexus unsafe, Nexus with an empty spec, and Nexus safe (with complete specs and checks).
  • Very little impact on receive throughput; some degradation for small packet sizes when sending.
  • Audio latency adds 46.5 microseconds. But the driver itself has hundreds of milliseconds to respond to these interrupts.
  • CPU latency from 0.8% on Linux to 2% on safe Nexus for streaming video at 1Mbps.
  • Also measured robustness to attacks using perturbation testing (probabilistically modifying driver operations, changing register addresses, length, values read/written). And made a suite of malicious drivers for each execution environment.
  • Perturbation testing: in no case did they crash the kernel or damage the hardware. In a few cases the driver confused itself. With a null specification, after 31 tries, it damaged the hardware, and they stopped retrying this.
  • Safe kernel avoids direct kernel r/w by driver, livelock due to interrupts, DMA kernel crash, DMA kernel read/write, DMA kernel code injection and DMA write device memory.
  • Question: performance data only looked at throughput, which is meaningless because of the CPU utilisation increase. What was the impact of this? The CPU impact was roughly a doubling, captured by the end-to-end benchmark.
  • Question: how would more demanding devices like FireWire or high-speed USB cope, where jitter might be more catastrophic? The latency didn’t have any effect. Did USB 1.1 for this paper and have an experimental USB 2.0, both of which are functionally correct despite a change in latency. But it only degrades performance, never correctness (so no dropped frames when copying video from a camera).
  • Question: who would write DDRMs? Devices manufacturers, or could be open source, or could get a third performance.
  • Question: is having the reference monitor in the kernel necessary for performance? Has to be trusted anyway, so didn’t see a benefit to putting it in user-space, but it could be.
  • Question: were state machines developed by looking at the driver or by looking at what the hardware allows? Looked at both the drivers that we were trying to port, and the specifications of the hardware.
  • Question: how common is damaging the hardware with a malicious driver? e.g. Ask a device to overclock itself, or overwrite flash and abort it halfway which could cause it never to reboot.
  • Question: could you compare the security of the Nexus driver model to the Windows user-mode device model which also runs USB devices in usermode? They are quite similar, but Windows applies it only to devices that don’t do direct I/O.
  • Question: might an IOMMU be useful to enable lazy trapping (like how shadow page tables are used in VMMs) and validation in advance? Problem is that lots of accesses are individual commands, so it is difficult to validate them in advance.
  • Question: did you consider reusing an existing specification language for drivers like “devil”(?) or the MS Singularity language? Took inspiration from the languages but didn’t reuse any of the specifications that were written for them.

Digging for Data Structures

  • Looking at polymorphic viruses. AV programs can’t detect them. You’re running a firewall, you’re still going to get infected, and you’re still going to have to wipe your machines.
  • Signature checkers are basically grep. Can obfuscate by encryption/packing, polymorphism or putting garbage in there. Most of these aren’t even widely used yet!
  • We’d like an AV technique where obfuscation would destroy the human element: we’d be able to tell that code was obfuscated. Reckon that the commonality will be in the data structures (and these won’t be obfuscated).
  • So detect programs based on their data strctures. Emphasis on the field types, not the actual content. Encrypting memory might hide the contents, but the patterns should still be evident.
  • Also need to infer data structures without debugging symbols.
  • Problem: image looks random. We know where the heap is, so we can reasonably reliably infer heap pointers by looking at four-byte intervals. There is a probabilstic mapping between block and atomic types: small integers are usually data, for example.
  • Compute given a value, what is the block type. Is it a string, pointer, zero? No, then it’s probably an integer. Can also divide the heap up into collections of blocks that appear to have the same signature.
  • Can also infer the types of pointers from the memory that is found at those locations.
  • Need some mathematics to determine the number of classes, and whether to put a given object in a given class. Use a standard unsupervised Bayesian classifier, details in the paper.
  • System called Laika, implemented in about 5000 lines of Lisp.
  • Computationally expensive problem. But only 30% of objects contain pointers, there are a large number of strings and typed pointers are necessary. Also some problems with unions (between integers and pointers), and tail accumulator arrays (where the last element is a variable size).
  • Ran programs in GDB, trapping on malloc. 7 real test programs (things like xpdf). Measured probability of correct classification. Want to optimise P(real|laika) and P(laika|real), both around 70% which can improve slightly with some malloc information.
  • Laika virus detector can be written in 50 lines of code. Assume Laika correctly clusters types into classes. Take two programs and weld them together. Measure how mixed each class is and take the weighted average. A mixture ratio close to 50% means that a lot of data structures are shared. If the mixture ratio is less than some threshold (tested on Kraken), then we have a match. Estimated accuracy is 99.4–99.9%. Always correctly classified a program as itself over 150 tests. Expected number of errors was 0.33. Does much better than ClamAV (83–85% detection).
  • How might virus authors adapt to this system? They obviously will, as generic virus detection is undecidable. Mixture ratio is a simple first cut, but both sides can probably do better. Laika synergises well with other approaches, so there’s defence in depth.
  • Simplest attack: memory encryption (but all programs use data structures, so some patterns will still be evident)… XOR all reads and writes with a key. Could shuffle field orders (only removes 50% of the information). Could mimic structures from a benign program (like Firefox, say), but other heuristics should reveal that a program doing this isn’t Firefox.
  • High-level structure requires more structure, and very simple programs don’t have it. But as these botnets become more sophisticated, these will require more structure. Also there is computational expense.
  • Best related work is all from Wisconsin.
  • Question: what are the challenges involved with getting a consistent snapshot of memory? (Some malware is careful not to do all of its evil at once, and how do you figure out when that is?) Just ran for five minutes and looked at the data structures… and did pretty well.
  • Question: class of non-malicious applications that make extensive use of obfuscation for protecting IP, so can you differentiate these? Could sign binaries.
  • Question: how does using different languages and compilers affect this? Well, it would fail on a Lisp virus. On Java, you have a lot of metadata from all the classes. Just looked at C++.
  • Question: attacker code might try to detect the memory-detection sandbox, and try to confuse you, so can you sidestep this? We rely on a working sandbox, and it did for the three viruses considered.
  • Question: what if you detect that several programs share the same data structures, such as the libc ones? The main trick is to train based on actual viruses, and compare to these.
  • Question: should you forget about the bad guys, and use this technique to understand programs that might be buggy? A PoPL paper on doing just this, which assumes knowing the structure definitions. So this would only be useful when you don’t know your data structure definitions. (Might want to do something like failure-oblivious computing so you can fake it?) Might work.
  • Question: will Laika get confused from strings that are encoded in different languages, using UTF-8? Detect UTF-8, and just called it ASCII.
  • Question: what if you serialise and encrypt an object before putting it on a heap, and only keep it in plaintext on the stack? The trick is to be careful about the kind of detection used. This behaviour would be very conspicuous, and Laika would get confused.
  • Question: do you have to run Laika against everything for five minutes, and compare against every possible signature? Run once, take a snapshot, then compare against all signatures. Polynomial time in the number of classes and the number of objects.

Dealing with Concurrency Bugs

Finding and Reproducing Heisenbugs in Concurrent Programs

  • Concurrent executions are highly non-deterministic, and rare thread interleavings can result in Heisenbugs. These are difficult to find, reproduce and debug. Just observing the bug can “fix” it. The likeliness of interleaving changes when adding printfs. In practice, developers can spend weeks on finding and fixing these.
  • CHESS is a specialised user-mode scheduler, attaches to your program and controls all scheduling non-determinism. Guarantees that every program run takes a different thread interleaving, but then lets you reproduce any particular run.
  • Motivating example of a “thread-safe” bank account class, but with a piece of incorrect synchronisation. Can run the program with “stress”: additional threads that do random stuff.
  • Demonstration and of course the bug didn’t turn up: illustrates the point about Heisenbugs.
  • CHESS is a collection of .NET or Win32 wrappers which trap all synchronisation-related calls, and puts the program under the control of CHESS. Every run takes a different interleaving, and the interleaving can be reproduced for every run. Developing a version of CHESS for the NT kernel.
  • Goal: any error found by CHESS is one that is possible in the wild (no false positives, soundness of CHESS, CHESS does not introduce any new behaviours). Don’t want people chasing CHESS bugs! Also want any error that is possible in the wild should be found by CHESS: completeness, no false negatives, but very difficult. Need to capture all sources of non-determinism, and explore that, which leads to a state-space explosion. Goal is to beat stress-testing.
  • CHESS is a scheduler. Its goal is not to find bugs (need to introduce assertions to do that). CHESS can find deadlocks and livelocks. It drives the program along an interleaving of choice.
  • CHESS introduces calls to the CHESS scheduler at potential preemption points. Inserted before every synchronisation opeation. Could insert a random sleep at schedule points. This does not introduce new behaviour, simply modelling possible preemptions, and guarantees starvation-freedom.
  • Two main improvements over random sleeping. First, capture the happens-before graph. Any delays that result in the same happens-before graph are equivalent, so only need to explore one of them. Second, understand the synchronisation semantics, so avoid exploring delays that are impossible and identify when threads can make progress. Does this by maintaining a run queue and a wait queue (mimicking the OS scheduler state).
  • Emulate execution on a uniprocessor, and enable only one thread at a time. This controls the order of data races. Reducing the parallelism is the price we pay to get determinism. What about multiprocessor-only bugs? Any concurrent execution can be emulated on a uniprocessor. So we introduce schedule points before synchronisations, volatile accesses and interlocked operations. But can also introduce these points before every memory access, to get even more accuracy: capture all sequentially-consistent executions. Even have a non-sequentially-consisten execution model, which understands the x86 memory model intimately.
  • What about input nondeterminism? Rely on the user to provide a test harness that makes the input deterministic (can use techniques like KLEE and R2 to do this).
  • Translate all the .NET/Win32 synchronisations (often complex and under-specified) into CHESS scheduler abstractions. Enables handling multiple platforms. e.g. Asynchronous read file is like forking off a child task.
  • Problem of state-space explosion O(n^nk) interleavings (k scheduling points, n threads). CHESS only inserts preemptions at a small number of places, which gives a state-space that is polynomial in k.
  • Found a Heisenbug in  MS Concurrency Coordination Runtime. CHESS found it in a minute. Found many other concurrency bugs, and other people have used the tool to find bugs themselves.
  • Current status: working to get it shipped as an Add-on to Visual Studio. But there is a command-line version available on the website. Now looking at kernel-mode CHESS, search heuristics, and bugs in IE, Windows Shell and graphics libraries.
  • Systematic exploration of scheduling nondeterminism can be more effective than stress-testing.
  • When designing APIs and abstractions, please think about the plight of the programmer who has to program on top of them. Minimise the nondeterminism, and provide good specifications!
  • Question: have you thought about other sources of nondeterminism, such as rand() and time? Time is handled, so e.g. a timeout would happen in a CHESS execution. But gettimeofday() isn’t considered. You would be better off using symbolic execution with something like KLEE. Random numbers are determinised, but the choices are not explored.
  • Question: how do you handle home-grown spinlocks? We do handle spinlocks, as part of guaranteeing starvation-freedom. The backoff in a spinlock, and hyperthread yielding, is handled by CHESS.
  • Question: how do you handle loops and recursions? It is a dynamic analysis tool, and we assume that the user has given a test program that terminates, and it will just run it.
  • Question: how large are the tests that you use, in terms of lines of code? Run on things like Windows Explorer. Usually take existing stress tests, and fork fewer threads. What about other model checkers? Haven’t looked at them, but think CHESS is the only model checker that could deal with things that large.

Gadara: Dynamic Deadlock Avoidance for Multithreaded Programs

  • Parallel programming is tough. It’s hard to parallelise algorithms and data structures. In this talk, looking at deadlocks. Might need to look at a large part of a program to understand why a deadlock occurs. Just composing two bits of deadlock-free code is not guarateend to be deadlock-free. And multicore is making parallel programming inevitable.
  • Goal is to avoid circular mutex wait deadlocks by postponing lock acquisitions.
  • Using discrete control theory to perform a source-to-source translation that inserts extra code to avoid deadlocks.
  • Architecture: start with a C program, get control flow graph, translate it into a Petri net (formal model from discrete control theory), then do control logic synthesis to get control logic, which is compiled together with the original source to get an instrumented executable. The control logic observes and controls interactions between threads.
  • Look at the example of the dining philosophers problem. Random fork-grabbing gives a simple control flow graph with four basic blocks.
  • Petri net is a bipartite graph with two kinds of node. Circle is a place and bar is a transition. Arc connect places to transitions and transitions to places. Tokens are passed through the Petri net. When a transition fires, it takes a token from one input and puts it in every output. Can model pthread_create(), lock acquisition and lock release. A token is a thread of execution or the availability of a lock. A place is a basic block or statement. A transition is a jump or lock acquisition/release.
  • Simple translation from control flow graph to Petri net, augmented with the identity of locks.
  • Showed a huge Petri net for OpenLDAP, but need automated tools.
  • Siphon (deadlock) analysis. Siphon = set of places whose input transitions are a subset of their output transitions. Once empty, a siphon is empty forever. Proven that a deadlocked Petri net has at least one empty siphon, and structural analysis makes it easy to find them. No state-space explosion.
  • Deadlock freedom means no empty siphon, so the total number of tokens in a siphon should always be >= 1. Specify this as linear inequalities. Control logic adds a new control place which guarantees that the linear inequality is always satisfied. Control places are local, independent, concurrent and easy to implement. Adding control places may add new siphons, but this can be prevented by iterating on the Petri net.
  • A control places is implemented by a token variable and a lock/condition variable pair. Control transition checks and updates the tokens in the control places, using a “gadara lock”. Observe transition updates the tokens in control places, using a “gadara replenish” operation.
  • Most lock/unlock operations are unaffected, so no overhead.
  • Gives provably deadlock free code!
  • Moves most of the computationally expensive analysis offline, and includes only modest overhead at runtime for the control places.
  • Did a case study on OpenLDAP. Showed a deadlock bug, and how it results in a siphon in the Petri net. (And how the siphon can be fixed with a control place, and how the code is modified.)
  • Gives maximal permissiveness and maximal concurrency. Don’t want to block any thread or transition unless it will lead to a deadlock state in the worst case.
  • Bug was identified and fixed manually, but newly-added code reintroduced the bug. A second fix required a complete rewrite. Gadara would have had deadlock-free code at all points.
  • Challenges: need function annotations to provide some form of data flow analysis (does calling a function result in acquiring a lock?). Type analysis helps with dynamic locks to show how many locks and which lock instances are acquired. Gadara chooses to be conservative when the model is imperfect. What about uncontrollable programs with blocking syscalls and IPC? These may give incurable deadlocks, but Gadara warns about these. Also C features such as function pointers. Don’t care about function pointers outside critical sections, only those within critical sections. For the rest, use annotations or type signatures to conservatively assume the set of functions that might be used.
  • Evaluated on OpenLDAP, PUBSUB benchmark and BIND. Tool has a 11% overhead with an add/delete workload on OpenLDAP.
  • OpenLDAP: 1795 functions, 495 after pruning irrelevant functions. 41 lock types, 25 of 28 ambiguous functions were annotated in an hour. Four siphons found, including one confirmed deadlock from the bug database, one false positive (required data flow analysis) and two previously unknown deadlock bugs. Performance overhead only when the workload exercises deadlock-prone paths.
  • Gadara eliminates all deadlocks and is provably correct. Only modest runtime overhead because it is maximally permissive and calculations are performed offline (and most locks are not deadlock-prone).
  • Many existing approaches, but all differ because only Gadara uses control theory. DCT is a principled foundation for dynamic deadlock avoidance. Works on real programs, and useful in several scenarios (rapid development, corner-cases in mature code and end-user bug fixing).
  • Question: assuming a program with recursion, and the procedure has a summary of what happens to tokens, is it possible to use this information for the synthesis? (Taken offline.)
  • Question: why not just fix the deadlocks if you know where they are and you know how to fix them? Gadara may have some false positives. Admittedly you could get better performance by doing this, but Gadara is automated.
  • Question: did you consider just adding a simple lock acquire and release before and after the siphon? In some cases this would work, but for e.g. 5 dining philosophers, the control logic is more involved.
  • Question: what would be the scalability if you weren’t limited by the hardware (2 cores, limited memory)? Similar result in the paper for the 16-core case.

Deadlock Immunity: Enabling Systems to Defend Against Deadlocks

  • A real system can be in one of two states, running or failed. When it hits a bug, it fails. Eventually it will come back, another bug will strike and take the system down. But when the first bug occurs again, the system will still fail. It doesn’t learn from past failures.
  • Give it an immune system. On seeing a bug, save an antibody to match the bug, and when you see that bug again, the antibody will neutralise it and let the system continue as if nothing had occurred. Like immunity against a pathogen.
  • Dimmunix gives immunity against deadlock bugs for Java and C++ programs.
  • Run a program, learn the executions that lead to deadlock, save their fingerprints in persistent history and use them to identify deadlock cases in future so as to avoid them.
  • Say you use CHESS and Gadara to rid your web browser of any deadlock, but then you load a plugin that hasn’t been so vetted, and it introduces a deadlock bug. Software vendor can distribute the antibody to end-users, or end-users can run Dimmunix to defend against deadlocks in some buggy code.
  • Lock inversion leads to a deadlock bug. Look at the stack trace which leads to the lock inversion, and abstract away parameters. The call flow is the signature that we will save (a deadlock pattern). Actually we want the list of EIPs from the call stack to identify particular call sites.
  • A deadlock bug leads to a single deadlock pattern, and a deadlock pattern can lead to several hangs. (Only in two cases did they see >1 patterns for a single bugs.)
  • Dimmunic intercepts calls to lock/unlock, detects the deadlocks, saves the signatures and avoids executions that match the patterns. The history of patterns is the immune system.
  • Versions for Java and POSIX threads. Java: instrument the bytecode. PThreads: shim library to add instrumentation. This means we need no assistance from programmers (no annotations) and users, and no need for source code. Works for closed source and commercial software as well.
  • Use a resource allocation graph (RAG) to detect deadlocks. Hold edge from lock to thread holding it. Allow edge from thread to lock that it is blocked on. A deadlock will be a cycle in this graph. But once you hit this cycle, it’s too late to do anything about it, but Dimmunix will save the pattern that lead to it. Add a label to the hold edge which indicates the call stack of the thread when it took the lock. These are collected to form the signature, which is saved in the history.
  • Now you’re hung, so you’ll have to take steps to restart the threads or the program.
  • Next you do avoidance. When there is an attempt to take a lock, see if there are any matches in the history. A partial match is okay, but if you have a complete match (all stacks are present), you do deadlock avoidance, by suspending the last thread that attempted to take a lock (in the signature). So Dimmunix slightly tweaks the thread schedule to avoid deadlock patterns arising.
  • Does it work? Took a number of C/C++/Java systems. Largest was MySQL (million lines of code). Went through their bug databases, looking for user-reported deadlock bugs. Managed to reproduce about 12 of them. Ran the test cases 100 times and they deadlocked 100 times. With Dimmunix, ran 100 times and only deadlocked the first time, worked every other time.
  • What about performance? Dimmunix does a lot of work, and if it is in the critical path, it would slow down the problem. Aimed to make it as asynchronous as possible, moving a lot of work into a separate thread, and used lock-free data structures. Paper has implementation details. Measured overhead on a lock-intensive microbenchmark, looked at lock operations per second. Dimmunix had very little overhead compared to baseline, though it increased at 512 and 1024 threads. Generated a history of 64 synthetic deadlocks. Dimmunix introduced thread suspends, which account for the overhead. 0.6–4.5% overhead for up to 1024 threads.
  • Need to match signatures on the critical path. Therefore hash the call stacks and index into precomputed in-memory tables. Keep a thread-local cache of data structures to reduce contention. Common case is no matches, which is efficient. By varying the number of signatures in history up to 256, the lock operations per second overhead doesn’t really change. (And you wouldn’t expect to have that many signatures anyway.)
  • Suspends can induce starvations because you’re yielding waiting for a blocked thread to make progress. (Haven’t seen this in reality, but can construct programs that do this.) So introduce yield edges to the resource allocation graph from the suspended thread to the one that caused the suspension (by holding a lock): label on edge is the label from the corresponding hold edge. This leads to a cycle in induced starvation cases, and can be detected in just the same way.
  • Dimmunix can have false positives, if you try to avoid a deadlock that would not have occurred. They don’t impact safety, but can impact performance (sometimes positively, by reducing contention). Trick is to adjust the precision of matching the call stack to generalise the pattern: how many call frames should you choose. Use a heuristic what-if analysis post-factum. If it sees no inversion then you have a false positive, and you eventually increase the precision. Dimmunix focuses on control flow and doesn’t look at input dependencies: no concrete program state in the signatures, so you get false positives but also get general, portable signatures.
  • Someone has to experience the first occurrence but can then immunise everyone. It cannot affect deadlock-free programs and cannot induce incorrect outputs in (non-real-time) programs. It must be aware of all synchronisation mechanisms (so homegrown spinlocks will not be identified).
  • Complements any approach that is aimed at reducing deadlocks (including Gadara, and it’s still useful because of bugs outside the code that you have checked like the web browser plugin case), including static analysis, model checking and transactional memory.
  • Question: could you extend Dimmunix to provide a trace for programmers to debug their code? One of my students is working on that: how to reproduce a deadlock given a signature. Trying to make it work for programs where model checking is still not applicable (millions of lines of code).
  • Question: what about when your immune system misbehaves? Say you have some functionality in your program in which all execution paths are deadlock-prone: the conservative approach won’t let you make any progress. One possibility is to log the signature every time you do an avoidance and use a pop-up blocker-like interface to tell the user that a deadlock was avoided, and you can prevent it avoiding it in future. The other possibility is to put an upper bound on the yielding.
  • Question: can this idea be generalised to other symptoms like segfaults? Another student is looking at this idea, applying failure immunity to other situations, like resource leaks, and we hope to be able to generalise it.
  • Question: seems like microbenchmark stopped scaling at a certain point? The graph flattens out and there is a limit to how many lock acquisitions you can do per second. Also the curve seems to be dropping off, but this is amplified by the logarithmic x-axis, so it’s really very slight.
  • Question: how many deadlock bugs that you find can actually be used as a vaccine (presumably with MySQL it would have to be independent of the application running on top)? But the patterns are specific to MySQL, and any application on top could use them.
  • Question: do you have any results with more frequent locking behaviour? Some workloads with 300000 locking operations on that kind of machines? As you vary the interval between lock acquisitions, the overhead is proportional to the number of instructions that you are placing in the critical path. In the scenario where you keep a lock for 1us then wait 1ms and acquire it again, we were trying to be more representative of a real system with I/O, or something like that. The overhead becomes signifcant when the interval is down at the microsecond level.

OSDI 2008: Day 1

Wednesday, December 10th, 2008

Opening Remarks

  • 193 submissions with three reviews each, through a two-level PC (16 heavy load and 17 light load). Three rounds of reviews.
  • At PC meeting, each paper had reviews from at least 5 heavy-load PC members.
  • 26 papers of which all were shepherded
  • Record attendance: 466 participants, 113 student grant awardees (some of whom stood up when asked).
  • 6 co-located workshops.
  • PC drawn half and half from industry and academia.
  • Used HotCRP and Banal for managing submissions.

Cloud Computing

DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language

  • Awarded a best paper award.
  • Problem is how to write distributed data-parallel programs for a compute cluster. Want to provide programmers with the illusion of programming for a single, sequential machine. Want it to scale up to multi-core and distributed resources. Want to use familiar languages and tools.
  • DryadLINQ programs written in Visual Studio, a plan is generated by DryadLINQ, and the program is executed in the cloud.
  • LINQ = Language INtegrated Query, available in current versions of Visual Studio. Provides traditional relational operators to work on datasets. Integrated into .NET languages (can invoke arbitrary .NET functions). Data elements are strongly-typed .NET objects. Extensible by adding custom operators and execution engines. DryadLINQ exploits this last fact.
  • DryadLINQ is one of many LINQ providers (also LINQ-to-object which is sequential, LINQ-to-SQL which uses a SQL database, and PLINQ which exploits multicore). The same program can run on any of these and scaled from single core up to cluster-level.
  • Dryad manages the execution: a general purpose distributed execution engine, in use for several years within Microsoft. A general DAG, with edges representing files, TCP, FIFO, etc. for communication between vertices (representing execution steps). Provides a vertex scheduler and fault tolerance.
  • Simple example: word count. SelectMany operator can take one dataset and return potentially many objects for each element (e.g. words from a document) — like the map function in MapReduce. Also GroupBy (regular SQL interpretation: combines identical keys into one dataset each) and Select (run once for each element).
  • DryadLINQ generates a Dryad execution graph, and code for executing the vertices and managing serialisation between the vertices. Output is stored in the cluster, but the client can obtain it through a DryadTable interface.
  • First step: generate distributed execution plan. Similar to optimisation in a parallel DB. Static optimisations include pipelining and eager aggregation (e.g. MapReduce combiner). Also dynamic optimisations: data-dependent partitioning (range or hash-based), dynamic aggregation, etc.
  • Also code generation: vertex code, serialisation code, callback code (to enable runtime optimisations).
  • LINQ query is separate from its local context, which means that all referenced objects in the query must be serialised and shipped to the cluster machines, as must any referenced DLLs.
  • Reduce is implemented as a sort, group-by and reduce function (local count, in the word count case).
  • DryadLINQ operations are converted a graph of vertices, and then vertices are multiplied so that they may be executed in parallel.
  • More interesting example: PageRank. Written in about 10 lines of code in Dryad LINQ. First join pages with ranks and propagate updates. Then re-accumulate the score for each page. Can see that program invokes DryadLINQ and DryadLINQ invokes program functions.
  • Note that the output of one round is already partitioned, so there is no need for data shuffling on joining in subsequent rounds. A shared memory FIFO is able to pass ranks from one iteration to the next, so no need to go via stable storage.
  • Execution can be combined with PLINQ on a single machine. And with LINQ-to-SQL which can allow retrieving data from a SQL database (and some functionality can be pushed into the database). Also runs locally for debugging purposes.
  • Currently works with any LINQ-enabled language, multiple storage systems. Internally released within Microsoft, and now released to some academic partners.
  • Learned that deep language integration worked out well: allowing programmer to use his favourite languages, tools and libraries. The useful enablers were LINQ extensibility, and .NET reflection and dynamic code generation; also the generality of Dryad (DAG model and runtime callbacks).
  • Future goal is to use a cluster as if it is a single computer. Firstly, what can be written with DryadLINQ. How can performance and usability be improved? How should concurrent jobs be scheduled? How can caching and incremental computation be exploited? What can be done with static checking?
  • Question: what control does the programmer have over the compilation process? Answer is to provide a policy to the Dryad execution runtime. At present, DryadLINQ hasn’t really dealt with this problem.
  • Question: what about continuous queries, instead of batch jobs? Dryad is a batch-processing system, so DryadLINQ inherits this property: would be an interesting research area to look at how to extend Dryad this way.
  • Question: what are the syntax and semantics of the language (like the relational algebra for SQL)? No explicitly parallel instructions are provided: the code is written in the relational algebra, which is highly parallelisable, and the execution plan is generated from these automatically. C# 3.0 has a syntax and semantics specified for LINQ, and these are inherited. Recursion is possible.
  • Question: systems like MapReduce have made people unaware of the sheer cost of a computation (which might use millions of dollars worth of resources) has any headway been made on how to give the developer an intuition about the cost of a computation? At the moment, we are still learning about this: it’s like a new toy. But there has been work into looking at optimisations, which could reduce this cost.
  • Question: have you talked to the ML and data mining people about how to model the performance and cost of DryadLINQ? ML and data mining researchers are using the tool right now, but understanding the performance is ongoing work.

Everest: Scaling Down Peak Loads Through I/O Off-Loading

  • Problem is that of I/O peaks on servers: that which is short, unexpected and high. These are typically uncorrelated across servers, and across volumes on a single server, and can lead to high response times.
  • Example is a trace from a production Exchange server (for 5000 MS employees). 7.2TB across 8 volumes, with hardware RAID, NVRAM and over 100 spindles. Despite all this provisioning, it frequently hits peak load: and the response time and load are 20x the average. Peaks occur but are very bursty.
  • Write off-loading inserts an Everest client in front of volumes that are being managed. Everest stores are placed on other machines in the cluster, which can export spare capacity. Under light load, Everest simply passes through reads and writes. When load gets higher, the writes are sent opportunistically to another Everest store that has low load. The base volume sees lower load and the response time improves. When the load recedes, the base volume now gets reads and writes passed through, and starts to reclaim blocks from the stores to which they were off-loaded. Each node would typically have a client, and a small portion of space set-aside as a store.
  • Peaks are uncorrelated across volumes, so less-loaded stores are typically available. Peaks must involve some writes, so off-loading the writes means reads see less contentions. And we see few foreground reads on off-loaded data: if they are frequently written, they are probably in the buffer cache.
  • Challenges. Wanted to have flexibility to send any write anywhere (always the least-loaded store to maximise load balancing). Wanted reads always to return latest version. State must be consistent and recoverable: plugging in at the block layer means applications above expect persistance — need to track latest and stale versions (for garbage collection). Want no meta-data writes to the base volume (because it would presumably be under stress).
  • Recoverable soft state is the meta-data to track off-loads. A map from block ID to its location and current version. Also need this map for stale versions. This meta-data cached in memory. Each off-loaded write has a header with the range of blocks used and the location. Meta-data is persisted on stores, so no synchronous writes to base volume: the base volume only needs to know the set of stores, which is small and infrequently changing. For a client to recover, it contacts all stores and gets the relevant meta-data; for a store to recover, it need only scan disk.
  • Stores are optimised for short-term, write-optimised storage. Store layed out as a simple, circular log, in a small file or partition on an existing volume. Not log-structured, data will be reclaimed so no cleaner. The store monitors underlying volume load.
  • Reclamation is a background task: read any blocks from a store that have been off-loaded, then write them to disk locally and delete the range. In practice, multiple concurrent reclamation threads are used.
  • Correctness invariants and consistency are maintained. I/O on an off-loaded range is always off-loaded. Reads are sent to the correct location, but writes do so to ensure that the latest version is always recoverable. Version deletion is only allowed if the new version is stable and written to a store, or if the data has been reclaimed and all older versions are deleted.
  • Evaluation based on replaying the Exchange server trace (24 hours long). Chose time segments with peaks, and extended them forward to see the effect of reclamation. Local server had 14 disks and 2TB of storage, so used a 3-volume subset of the Exchange trace. Chose max, min and median loaded volumes. A store was created on each volume, with 3% of disk space reserved. Clients were symmetric (could off-load to the other two volumes). Response time is significantly reduced for both reads and writes, both for mean and 99th percentile (numbers are 10 times higher).
  • Substantial improvement in latency for a real I/O server workload. Cannot tell end-to-end effects from an I/O trace however, so ran an OLTP benchmark with a SQL server binary that used an Everest client for I/O. 10′ warmup for caches, followed by 10′ measurement run. Transactions per minute improved by 3x by write off-loading, with only one extra disk. Just using log-structured FS gives 2x improvement, using 2-disks striped also gives about 2x improvement. By striping and log-structuring, you get the same 3x improvement as write off-loading (but it is easier to manage write off-loading).
  • Not a panacea: only works for short-term peaks, relying on burst behaviour. Cannot improve performance 24/7, because idleness is necessary for reclamation to work: in the long term this could lead to contention. Also presume that data is reclaimed before the store is filled up.
  • Question: synchronisation problem between updating the metadata on the client and persistent data on the server? Wait for the data to become persistent, and then can be sure that we have the version.
  • Question: can a sys-admin specify certain data ranges that should not be off-loaded because you might want to read them back straight away? Can be configured on a per-volume level, or a per-file level. Conceptually possible to do at either granularity.
  • Question: any analysis of how much correlation between volumes can be adapted to before the system breaks down? Not yet, and it’s not clear what kind of synthetic tests could be run to vary correlation.
  • Question: need to do anything special to support journalling file system, where the order of writes is important? Assume that if concurrent I/Os are sent in, then you are agnostic about the ordering, and write barriers could be put in at a higher level for serialisation.
  • Question: could larger caches achieve the same benefits? They’d have to be very large caches, because it deals with bursts of gigabytes.
  • Question: does the interposition of hardware RAID introduce any additional failure modes? The new dependency is between the client and the stores to which the writes are off-loaded, and these may or may not have hardware RAID.

Improving MapReduce Performance in Heterogeneous Environments

  • MapReduce is becoming popular in industry: even outside the usual suspects, due to open-source Hadoop. Promise is to allow running large scale jobs on large clusters of machines (20PB/day at Google).
  • Also see utility computing becoming more popular, especially Amazon EC2, which lets you rent machines at a cost of 10 cents/VM/hour. Good for letting you try out a system at scale before paying for physical resources. But performance is not as predictable as if you would be running it in your own data centre. Lets you run experiments on 1000 machines for a price of $100 per hour.
  • Main result is that Hadoop is challenged by performance heterogeneity between nodes, due to contention. This breaks task scheduler assumptions. The new scheduler, LATE, reduces response time by a factor of 2.
  • MapReduce splits computations into independent parallel tasks. Hides the complexity of fault tolerance at this scale. One FT mechanism is responsible for the problems we see.
  • If a task fails, re-run it on a different node. If a node is very slow (a straggler), you run backup tasks. Stragglers could be due to a failing disk (recoverable read errors), or other reasons.
  • Backup task races with the original to see which wins, and when one finishes, the others are killed. Very easy to decide who the stragglers are in a homogeneous environment. But less obvious in the heterogeneous case.
  • Heterogeneity. VM is good at isolating the CPU and memory access, but I/O bandwidth is inherently shared, so disk and network contention means that everyone gets an equal share. The performance differs by 2.5x between machines that run with a single VM, and machines that share with others.
  • Backup tasks are run after all primary tasks are started, and nodes begin to become free. Tasks report a progress score from 0 to 1, and a backup task is launched if a task has progress more than a fixed threshold below the average.
  • This can lead to starting too many backups, which thrashes shared network bandwidth. It also might mean that (under heterogeneity), the wrong tasks are backed up, which stalls the backing up of the tasks which are actually stragglers. And backups could be started on the slow nodes. And if tasks start at different times, this can break (since tasks started later will obviously have less progress). For example, ~80% of reduce tasks arer backed up, most losing to the original tasks.
  • Idea (patented) is to use the progress rate instead of the progress values. This improves things, but can still select the wrong tasks as backups.
  • Key idea is to back up the task with the largest estimated finish time: LATE = Longest Approximate Time to End. Look forward instead of looking backward. On top of this, put in some sanity checks, capping the number of backup tasks, launching backups on fast nodes and only backing up tasks that are “sufficiently slow”.
  • To estimate the finish time, estimate progress rate as progress score over execution time. Estimated time left is the amount of progress left to make over the progress rate. Put at 10% cap on backups and use the 25th percentile to select slow nodes. These values were validated by sensitivity analysis.
  • Evaluated on EC2 (200–250 nodes) and a small local testbed. Self-contention generated by VM placement. Stragglers generated by background processes.
  • Ran a sort on EC2 with stragglers. Gain a speedup of 58% over native Hadoop, and 220% speedup over no-backup case (on average). 93% maximum speedup over native.
  • Heterogeneity is a growing challenge for parallel applications. Since MapReduce abstracts away the problems of programming at a large scale, it would be nice to abstract away heterogeneity as well.
  • LATE contributed back to Hadoop.
  • Question: surely the only variable that matters are the configurations of the machines on which the job are run, so why don’t you just use a greedy local strategy? More sophisticated strategies are possible.
  • Question: what about a small cluster where you have more tasks than nodes? First launch all the primary copies before considering backups.
  • Question: what if the individual tasks are heterogeneous? Currently assuming that tasks take the same amount of time, and that Hadoop is taking care of that. Would need to normalise by the end of the task, and prioritised based on expected gain.
  • Question: any advice for people who build utility computing services? Two things would have helped: would be good to have more monitoring and visibility into the system; and also to know the network and rack topology (to aim to place tasks close to the input data).

OS Architecture

Corey: An Operating System for Many Cores

  • About new OS interfaces to help applications scale with number of cores. Applications spend time in the kernel: e.g. Apache doing directory lookups and TCP processing. Even MapReduce spends time in the kernel. Want this time to scale with the number of cores.
  • Problem is contention on shared data structures in the kernel. Need serialisation and copying between caches. OS semantics requires these to be shared, unfortunately.
  • Current practice is to redesign kernel subsystems with finer-grained locking, RCU, etc., which has been lots of work for kernel developers.
  • Solution is to change the OS interface: applications don’t always need to share all this data. Want to allow applications to control how this is shared.
  • Three new interfaces: shares, address ranges and (not discussed) kernel cores.
  • Shares. Kernel must map an application-visible reference into the address of a kernel object: e.g. file descriptors. Measured the cost of using the FD table on a 16-core Opteron. Would expect or hope that dup()/close() operations (on different FDs) would scale linearly with the number of cores. But the drop in throughput from one to two cores is almost an order of magnitude. Throughput drops because of the high latency of retrieving data from a remote cache. On one core, it’s in L1; on an additional core, it has to go to the remote L1, which takes 40x as long. Also, the shared FD table is a bottleneck as accesses are serialised.
  • Can performance be improved? Some of this sharing is unnecessary, and could improve performance for these applications. How to figure out when it is and isn’t appropriate to share these structures? Solution is shares.
  • Shares let applications allocate a global table for FDs that are shared amongst cores, and allows FD allocation either in the shared or the global table.
  • Shares were added to Linux. FD system calls (sys_open, sys_dup, etc.) were extended to take a share ID argument. Can have threads with their own private FD table (using share_alloc() call), which gives the linear scaling we expected.
  • Few modifications to the kernel and applications required to get scalability.
  • Address ranges. Multiprocessor shared-memory application can be written to use a shared address space, or shared memory with mmap(). With a shared address struct, contention on mm_struct, even when modifying private mappings (e.g. cores growing their own stack). With private address spaces, one mm_struct per core, so no contention on updating private mappings. But, first time a page is touched, a soft page fault happens, which allocates physical memory if necessary: this will happen once per core with private address spaces, because there is no PTE sharing.
  • Neither option accurately represents how kernel data structures are being used.
  • Address ranges avoid contention while sharing PTEs. Each core has a private, root address range. For shared memory, you can add a child address range to the root, and map it into each root. Contention only when both manipulating the shared AR. Page tables corresponding to address ranges can also be shared. Therefore the soft page fault only happens once.
  • Also good for complex memory sharing patterns: not just shared or private.
  • Building an inverted index with MapReduce. Each core gets a partition of the input, builds up buckets with the locations of each term (map task). Reduce task on each core generates the inverted index. Goal is to have no contention when growing the address space using map (due to buckets growing). PTEs should be shared between map and reduce tasks.
  • Each map task’s bucket array is a different address range. The PTEs are then shared with all the reduce tasks, which means no additional page faults.
  • Address ranges implemented in Corey, with a low-level kernel interface for mapping memory and address ranges. Performance evaluated by building an inverted index of a 1Gb file. Throughput improves better on Corey than Linux after 8 cores, then Corey is 25% better than Linux on 16 cores.
  • Built on research on NUMA operating systems and multicore.
  • In future, finish Linux interface changes, to experiment with larger workloads, and answer how much of the interface needs to be changed. Also want to look at how to use caches better to reduce the cost of manipulating shared data. How can the large aggregate cache capacity be exploited.
  • Also see kernel cores, in the paper.
  • Question: memory subsystems at the moment are dumb, but what if you had an ALU in your memory subsystem which could do atomic operations for you? Probably a lot.
  • Question: were the shares evaluated using more-realistic workloads? In Corey, shares are used for everything, and there was lots of contention on these shared data structure. Some Apache benchmarks scaling up to 16 cores were showing lots of FD table operations.
  • Question: how could the ideas of reducing sharing be applied to other systems like databases or libraries? Haven’t thought about it.
  • Question: as the number of cores increases, wouldn’t any sharing kill you? In HPC systems, message-passing is used, so wouldn’t this be a better approach? The focus is on system level applications (not HPC workloads), but interested to look at how difficult it would be to implement applications without shared memory. Seems like you have sharing globally across the whole chip, or keeping things private, so as that chip gets bigger, is having global sharing a scalable model? The Corey model allows semi-private shares, as well as global or private.
  • Question: when there are private copies of things, there are issues with consistency, when doing something like a global unmap, aren’t there? Is there an operation to globally unmap a page? Not dealing with private copies, rather private address spaces, so that operation wouldn’t exist.
  • Question: reminded of early NUMA work, but is it possible for the OS to automatically pull out the locality? A topic for future work, but it is in some ways to tell the OS exactly what you want, and then get it. But it might be possible to extract some benefit.

CuriOS: Improving Reliability through Operating System Structure

  • Looking at faults in systems: hardware (stuck-at faults, bit-flips, unresponsive devices, soft error rate a growing problem) and software faults (race conditions, invalid pointers, uninitialised variables) with something like 20–30 bugs per 1000LOC.
  • Could manage OS errors by writing OS in a type-safe language, or prove using a formal model. The other approach would be to let an error happen and then recover from it (microkernel approach).
  • Restart-based recovery: when a microkernel server fails, it dies and is recreated — hopefully everything is fine. Tested this theory on Minix3 and L4/Iguana with fault injection. Minix3 has inter-component isolation, and has a reincarnation server to restart failed services. Injected faults in the network service, which can restart, but all existing network connections fail. On L4, killed the timer service, but all OS components registered to receive periodic interrupts stop working. On Chorus, restarted the file system service, and all file handles became invalid and the system is unusable. On EROS, there is no protection between OS components, but it does periodic checkpointing of complete system state, so can recover by loading from a checkpoint. This works, but there is overhead with checkpointing.
  • Problem is that there is server state that is lost when the server is restarted. The fresh state doesn’t reflect the existing connections, handles, registrations etc.
  • Restart-recovery works for stateless components. To get client-transparency in stateful components, you need: transparency of addressing, client suspension during the recovery, client-related state persistence, and client-related state isolation (errors in the service mustn’t corrupt it).
  • No microkernel tested provides all four of these. None provides isolation of client-related state.
  • CuriOS partitions server state per client and moves client-based information in separate regions of memory (notionally with the client), persisted across restarts. “Server state regions”: isolated from the clients and the server.
  • When a client interacts with the server, the client info is mapped temporarily into the server’s address space. On response, it gets unmapped. This gives error containment: it can only affect the client which is currently interacting with the service.
  • CuriOS is object-oriented, written in C++, and uses hardware-based protection to isolate the components. CuiK kernel runs in privileged mode, all else in unpriviliged mode.
  • Protected objects are the unit of isolation (process), with restricted privileges, and private page tables, heap and stack. Implemented in C++ using inheritance.
  • Single address space OS with different access permissions for different components (protected objects).
  • Server state management: SSR created per client-server binding, destroyed during unbinding, and mapped using PTEs. Service is restarted if an internal exception is raised. Can access SSRs during recover to recreate internal state, and check SSRs for consistency. If multiple restart-retry attempts fail, need to escalate the exception to the client. Might sacrifice one client (e.g. that gave a malformed request) to keep the rest of the system running.
  • Gives all four necessary properties for restart-recovery of stateful services.
  • Periodic timer manager provides notifications to clients. In SSR, store timer period and start time. Server stores list of pending notifications. On restart, server recreates its queue by looking at all of the SSRs.
  • Network service based on LWIP, converted to C++. All failed assertions (checking data structure consistency) throw exceptions, and state management code modified to use SSRs. Each socket object is a client, and the server is a TCP service protected objects. After crash, the IP configuration is reloaded from configuration: in the worst case you might miss a couple of packets, but TCP will obviously recover from that.
  • Implemented for mobile devices on ARM (due to cellphone company funding). Run on TI hardware, and QEMU. Did fault injection using QEMU, with memory aborts and register bit-flips. System is recovered if it can run processes and access its file system. Also recovered if at most one client was affected by the recovery. Get over 87% recovery. 100% recovery for memory aborts, but less for register bit-flips with error propagation.
  • Performance evaluation, measured number of additional instructions (microbenchmark on a protected call) and the time overhead. ~4000 extra instructions and 200–400us time overhead.
  • Memory overhead due to additional page tables per protected objects, but memory is cheap (could use Mondriaan memory protection).
  • CuriOS built upon ChoicesOS code base. The services each required 50–100 extra lines of code.
  • Focus was on error confinement and reducing error propagation. Error detection was orthogonal, and a better scheme would give better recovery.
  • Question: how does exception escalation work? Only propagated to a single client, which is allowed to fail.
  • Question: is it always safe to reuse client data from before restart, after the restart? Any inconsistency should be resolved by the server on restart, when it has access to all the SSRs.
  • Question: SSR management is implemented as a kernel service, but could it be a user-level service, and how would you handle failures of the SSR manager? Assume the SSR manager is a small service, so it won’t fail (and hence it would be fatal). Could be implemented at a user-level, as it is just a mapping store.
  • Question: how many of the evaluated bit-flips ended up corrupting memory? They ended up causing bad memory accesses, which is how they were detected. Some of the bit-flips ended up being no-ops. About 70% of the bit-flips don’t manifest as any error. In the evaluation, only the visible ones were counted.
  • Question: type-safe OS would not help in many of these cases (myth: busted). Is it overly weak to rely on a component detecting its own faults? Just using a fault detector that we got for free. Should the fault detector be isolated itself? Yes, it could be. Detection is basically an orthogonal problem, and we’re only talking about recovery.

Redline: First Class Support for Interactivity in Commodity Operating Systems

  • Want better responsiveness. Applications are increasingly interactive (GUIs, multimedia). However, if you have a concurrent kernel compilation while watching a video, you get massive jitter in the frames per second. Redline only drops a couple of frames, and only occasionally.
  • Problem is resource management. Resource managers don’t take response time into account. e.g. LRU page replacement is bad for an interactive process, which doesn’t use memory often. Also CPU, memory and I/O contention.
  • Interactive systems are between real-time systems and general purpose systems. Some slow-down can be tolerated, as long as the system reacts quickly. Relatively low performance guarantee but better resource isolation (than a general purpose system).
  • Want to maintain system responsiveness even when there is heavy overload.
  • Redline does this by coordinating multiple resources (CPU, memory and disk I/O) with a lightweight specification of which processes are interactive. Uses adaptive admission and load control, using recent history as feedback.
  • A specification first passes through admission control, and then hits the CPU scheduler, I/O management and memory management. Feedback then from these units through the load manager.
  • Specification management by associating binaries with a few parameters. A single mouse click can touch many processes (X, kernel threads, daemons, user-selected application). Anything that might directly or indirectly impact the response time is included in Redline’s reckoning. A specification is a pathname, a binary interactive type, a CPU reservation (ratio of time reserved), memory protection period, disk I/O priority and inheritable/revocable flag. Revocable means that an interactive application can be made best effort under load.
  • Currently, CPU bandwidth allocated in a relative manner. If you launch lots of processes, priority boost becomes less useful. Redline allows reservation of CPU bandwidth, then uses a earliest deadline first scheduler. Interactive tasks go into EDF and proportional-share runqueues. So after the reservation is used, you can still use proportional sharing. EDF requests are processed first, before proportional shares.
  • Redline tries to be permissive, accepting the task first, and solving overload when necessary. The admission test is permissive, based on actual load and specifications. Overload is managed by revocation (which can be reversed under better conditions).
  • In standard virtual memory, interactive apps are vulnerable because they don’t touch pages very fast, so they are vulnerable to LRU, so the OS will shrink the working set size, leading to more pages being evicted, more page faults and therefore an even lower memory reference speed: a vicious circle.
  • Therefore Redline protects the working set size of interactive applications. Pages are protected for some period, by default 30 minutes. If the working set can’t be held in memory, revoke the interactivity.
  • Applications that occasionally allocate memory will run into page reclamation, affecting response time, so a small reservation is made for applications that need this: fast small memory requests (but can’t take too much). If best-effort applications request memory too much, a speed bump is inserted, to slow memory reference.
  • Disk I/O management stretches through page cache, journalling file system and the block device layer. The page cache caches dirty data that goes to disk. A journalling file system writes data first, followed by metadata. If an editor saves a small file, it will block because the pages must be laundered; fsync() will give a long delay before committing (20–30s). Solution is to keep different dirty thresholds for each type of process. Best-effort tasks must launder pages earlier which prevents large compound transactions.
  • Evaluated a prototype in Linux 2.6.20.5 with CFS patch. Specifications for 100+ applications, and 70+ kernel threads.
  • Looked at a fork-bomb. Both a best-effort or interactive fork-bomb will barely introduce jitter into a running video. Linux CFS will barely show any frames.
  • Same for a malloc-bomb: Redline doesn’t do quite so well (still much better than CFS), and shows a couple of large jitters, lasting a couple of seconds each (worse when the bomber is interactive).
  • Also for heavy direct I/O, Redline does very well.
  • Question: is the application at the right granularity to address this: some applications might have both background (best-effort) and foreground (interactive) tasks? These applications would be split into threads, so could do this at the thread level.
  • Question: what is the effect on the end-user of rejecting requests at the specification level? Will the computer stop doing what you want it to do? Or should I restart my laptop to get it back to a usable state? Redline never kills anything. It’ll never be worse than a normal Linux system. No tasks rejected, but it might not get the guarantee that you hoped for.
  • Question: why were different specifications used for each resource when in a unified model, you would expect to use the same thing for all kinds of resource? Different kinds of resources need different kinds of reservations. Redline attempts to have sensible defaults (memory protection and I/O was not varied in the shown experiments).
  • Question: though a mouse click touches many processes, how does this priority propagate through the system? init is specified as an interactive application, allowing all applications to be interactive for a short period of time. So after a process has forked, you don’t know what it will try to execute. So it gets to be interactive for a millisecond. When the exec is made, the specification is consulted.
  • Question: lots of opportunities to starve best-effort processes, so how would this compare to the Real-Time Linux scheduler (since CFS is supposed to be more fair)? CFS is based on fair queuing, but doesn’t guarantee how much resource will be received. So we need some way of isolating this: some starvation might be necessary.

Monitoring

Network Imprecision: A New Consistency Metric for Scalable Monitoring

  • Challenge is how to safeguard accuracy during network failures. Monitoring is important.
  • Motivating example: detecting DDoS attacks from PlanetLab: want to scale to a large number of nodes and high data volume. Want to detect attacks in real-time. And need accuracy despite failures: the key challenge addressed by this talk.
  • Example: computing traffic rate to each destination IP from all PL nodes. Half of the reports deviate by >30% from the true value. 20% of the reports deviate by >65%.
  • Network imprecision bounds the inaccuracy/uncertainty due to failures. Converts best-effort results into reliable results. It will tell applications whether or not to trust the accuracy of the result. By quantifying the system stability, it’s possible to improve accuracy by 5–10x. Implemented efficiently even with active probing of nodes, by explointing overlay symmetry, making it O(N).
  • Current techniques for scalable monitoring. Aggregation trees spanning all nodes in the system (using DHTs) can be used to summarise information. If an internal node fails, the entire subtree is lost; reconfiguration can cause double-counting. So improving scalability risks failure amplification.
  • Or could do arithmetic filtering. But if a subtree doesn’t send updates, is this because it hasn’t significantly changed, or it’s suddenly become unreachable?
  • Or could do temporal batching, by collecting individual updates at internal nodes, and combining them into a single report. But a short disruption can lead to a large blocking period across a large part of the tree.
  • Want to find out the top 100 destination IPs receiving highest traffic from PlanetLab. As the batching interval increases from 15–60 seconds, the load reduces by 5x, but there are diminishing returns. As the arithmetic filtering (deviation from average) becomes more liberal, there are 10–30x reductions in the probing bandwidth.
  • These scalability techniques imperil accuracy.
  • Do we face the CAP dilemma? Need to accept that the system is unreliable, due to node failures, link failures and overlay reconfigurations. Instead, quantify system stability. If the system is stable (all nodes are up), we can trust the monitoring system absolutely. Otherwise, the result should be distrusted. But in large-scale systems, stability is not binary.
  • Many sources of instability that manifest themselves as missing/delayed updates, and double-counted updates.
  • Missing/delayed inputs: quantify N_reachable (lower bound on number of nodes whose recent inputs are guaranteed to be in the result) and N_all (estimated number of nodes in the system). So now we know how much of the answer comprises stale inputs.
  • Double-counted updates. Define N_dup, an upper bound on the number of nodes whose inputs may be double counted. So now we know how much of the answer is impacted by overlay reconfigurations.
  • Together, these metrics are useful to expose the impact of disruptions on monitoring accuracy.
  • NI is application-independent, inexpensive (efficient to compute) and flexible: i.e. it is a mechanism. Applications can set the policy themselves (by filtering inconsistent results, performing redundant aggregation or on-demand reaggregation). No one-size-fits-all.
  • By applying NI-based filtering, can get 80% of reports to have <15% error. Do this by choosing results where NI is less than a particular value. NI is the proportion of unreachable nodes plus the proportion of duplicated results. So the technique here is to discard results when imprecision is high.
  • Another technique is redundant aggregation, using e.g. k trees, and then pick the best result. One of the trees is likely to route round disruption and hence give a more precise result. Can measure the value of k which gives the best improvement in NI (k = 4 gives a 5x improvement in accuracy). By combining the two techniques (filtering and k-aggregation), can get a 10x improvement in accuracy.
  • Implementing NI is simple: just a count aggregate. However it is difficult to implement these efficiently. Requires active probing of each edge in each tree, which would naïvely require a lot of messages.
  • Insight is that DHT-trees form a butterfly network. Common calculations can be reused across different trees, which reduces load from O(N * d) messages/node to O(d * log N) messages per node. For N = 1024, this requires about 5 messages/node/second.
  • Question: an extension to quality of information from the DB community. Wouldn’t it be nice if I could ask a monitoring system for a particular level of accuracy or NI or whatever, and could you construct me a system that gives me that level of accuracy? The approach has been to separate the measurement mechanism from the policy that sets the accuracy bounds, but you can’t guarantee an accuracy bound due to the CAP problem.
  • Question: looks like aggregation techniques from sensor-networks, so why not look at order- and duplicate-insensitive aggregation functions from that literature? These are complementary because in principle these try to minimise the impact of disruptions on accuracy. They are a policy that can be applied in this framework.

Lightweight, High-Resolution Monitoring for Troubleshooting Production Systems

  • System called Chopstix. Motivation is that there will always be bugs in production systems, aggravated by increases in system complexity, despite analysis and testing frameworks (tools are far from perfect). Need to be able to deal with them when they do happen.
  • There are easy and hard bugs. Easy bugs come with a precise description of the circumstances in which they happen. (e.g. trying to do a ping and finding that an address family is not supported by protocol.) The hard kind would be when the ping time is all over the place.
  • Hard bugs are hard to characterise, reproduce, trace to a root cause (ambiguous or temporally removed from the root cause) or intermittent and unpredictable.
  • Example, the kernel crashing intermittently with an out-of-memory bug on PlanetLab. Or the kernel freezing every 1–7 days. Or ping latencies increasing dramatically.
  • Medical analogy: easy problems are single, localised injuries, while a hard one might be a vague sense of malaise.
  • Chopstix treats systems as patients (using vital signs, symptoms and samples), and correlate all this information to pronounce a diagnosis. Monitor the whole system (holistic approach), continuously.
  • Strategy is to monitor low-level events (scheduling, I/O, syscalls, memory allocation, cache misses, etc.) which are omnipresent and capture changes in system behaviour. The magnitudes of these events are called “vital signs”. A deviation is a “symptom”.
  • Obviously can’t log all of these events, so could do uniform sampling, which is biased in favour of high-frequency events. So use frequency-dependent sampling, which uses a sketch (approximation of the frequency distribution of a set of events). Sampling rate is a function of the instantaneous frequency (in this case, logarithm of the frequency).
  • On an event trigger, extract a signature of the event, hash it, and update the sketch (which approximates the instantaneous frequency). If a sampling decision is made, collect the sample, and perform logging. Everything apart from extracting the detailed information is lightweight. This is efficient because of the principle of locality (for e.g. syscall used, it’s likely to be the same as the last one). At the end of an epoch (60s), store the log on disk, and update the sketch.
  • Chopstix server runs a polling process, and a time-series DB that enables monitoring via a GUI.
  • Event set is fine-grained, but the overhead is 1% of CPU utilisation and <50MB/day of disk consumption on each node.
  • Can drill down to interesting parts of the vital sign graph, which lets you see what is running, including processes, EIPs, stack traces, etc.
  • Might be able to use intuition (e.g. high CPU use means busy loop) or better use a library of diagnosis rules which combines different vital signs (e.g. low CPU, high scheduling delay, high system CPU might mean kernel bottleneck).
  • Example of how Chopstix tracked down an elusive bug. Some nodes would freeze every few days (or not), with no information on the console, no printk messages from KDB, stalled SSH sessions and heavy I/O reports from vmstat.
  • Chopstix said blocking and I/O vital signs shot up. Looking at stack traces revealed that journalling daemon was trying to commit a transaction at the same time as many transactions were being started. Thought this might be a deadlock. However, I/O performance was stable and both resource-blocking and I/O vital signs showed drops back to a normal, low value. So this behaviour might not be responsible for the stalls and the freeze.
  • But the spikes in the scheduling delay, coincident with spikes in kernel CPU utilisation mean that a loop in the scheduler was actually responsible for the freezes.
  • Implemented as a library for generating events (i.e. making additional vital signs), an efficient sketch data structure with adaptive sampling, various optimisations for collecting large amounts of event data (e.g. stack traces, etc.), and a time-series database in 4000 lines of OCaml.
  • Resolved a number of PlanetLab bugs.
  • Evaluated performance overhead using lmbench microbenchmark. The slowdown is between 0% and 2.6% (for getpid). Also with kernel compile and apache macrobenchmarks, which show almost no overhead in the benchmarks (much better than oprofile, and managed to collect more data).
  • Probability of error is a function of the distribution of counters in a sketch. By varying the size of the sketch, the probability of error is between 0.0001 and 0.00001%.
  • Question: compare work with DCPI from SOSP 1997? Similar to oprofile, which this beats.
  • Question: is it possible to capture the thinking of a systems guru looking at this output by data mining? Various knowledge-base systems for mining source code, and they’re not perfect: you really need real intelligence to analyse these data.
  • Question: useful for developer as a kernel hacker, but how could it be extended to abstract the vital signs to a higher level and give signals of more abstract health, so that a system could say, “I’m sick, I need to see a doctor”? A lot of intelligence could be coded into rules, and the rule database could grow over time (to give a SysMD, comparable with WebMD) that allows diagnosis by less experienced people.
  • Question: how scalable can this be made if you want hundreds or thousands of metrics? Haven’t carried out these experiments. The <1% utilisation is between 0.1 and 1%, so there should be more room to squeeze in other metrics.
  • Question: in AI in the 1970’s and 1980’s medical diagnosis with AI was prevalent, but there was an explosion after 20 or so rules, so there was a move to other systems such as Bayesian filtering. One of the reviewers was excited about rule-based systems, and wanted it cited, but it was hard to find. Maybe that should tell us something.

Automating Network Application Dependency Discovery: Experiences, Limitations, and New Solutions

  • Enterprise networks are attracting research interest. A bunch of clients, a server farm, router, firewall and connection to the IP backbone. Various applications run on these networks, such as VoIP, which may be critical to the business. In a large network, there are usually thousands of applications running simultaneously, thousands of people are doing IT support and lots of money is spent thereon.
  • Management is complicated because talking to single server is rarely sufficient to implement an application (e.g. MS Office Communicator, a VoIP/messenger app, uses DNS, Kerberos, VoIP, Director and many back-end (file, SQL) servers), so the correct functioning of an application depends on the correct functioning of all these servers.
  • Extracting dependency information is hard. Applications are heterogeneous, in terms of functionality and deployment. The knowledge of these dependencies is distributed across layers and locations. Also, applications continuously evolve, adding new services periodically, and reconfiguring/consolidating others.
  • e.g. if a Kerberos service is down, you can no longer use Communicator, but you don’t know why this is. You would typically check the Communicator backend servers, and it might not be obviously to go to the Kerberos one. Dependency information would accelerate and simplify the support process.
  • Current solutions are based on human knowledge and understanding of the system and its dependencies. This is expensive, error-prone and difficult to keep up-to-date. Want instead to automate this.
  • Sherlock and eXpose (MS, SIGCOMM papers) are related, based on co-occurrent dependency discovery. Also Project 5 and WAP5 (HP Labs) extract execution causality paths.
  • Introduce a new technique to discover service dependencies based on delay distribution. Identify the limits of this based on temporal analysis. Evaluated on MS’s enterprise network (first thorough evaluation).
  • Design goals: a generic solution for many applications. So do passive packet sniffing. Unintrusive way of obtaining data, and only parse TCP/IP headers (no application-specific knowledge). Minimise false dependencies without losing true dependencies, but it is hard to recover missing true dependencies. So minimise the filtering of false dependencies.
  • System called Orion. The time delay between dependent services reflects the typical processing and network delay. So identify these.
  • Identify service based on IP address, port and protocol. This leads to too many service pairs (so ignore ephemeral ports). Also problem is that dependencies exist between application messages, so only rely on TCP/IP on headers and aggregate packets into flow. This reduces bias introduced by long flows, and reduces the number of pairs.
  • What about a lack of samples? Orion needs a fair number of samples to infer dependencies. The solution is aggregation, so clients have similar dependencies, applications may be hosted on a cluster of servers, and the same service may be hosted on several ports. Can aggregate these.
  • Use FFT-based filtering to remove noise in the delay distributions.
  • Deployed Orion in MS enterprise network. Looked at Exchange, Office Communicator, Source Depot (like CVS), Distributed File System (software binaries held here) and Web (intranet sites).
  • Evaluation criteria, want to see missed dependencies (false negatives), incorrectly-inferred dependencies (false positives) and reduction ratio. Orion has no false negatives, and far fewer false positives than Sherlock and eXpose (though all non-zero).
  • Showed an example of 11 dependencies for MS Office Communicator.
  • As we see more flows, the false positives increase, but false negatives decrease. True positives eventually converge to the correct value.
  • Limitations: Orion requires training, isn’t applicable to P2P applications, and may miss certain kinds of interactions (like periodic backup tasks). It may include false positives.
  • Temporal analysis is limited (as no application-specific knowledge), but false positives can be made manageable.
  • Question: why did filtering induce a small peak in the delay distribution at the in the rightmost bins? This was an artifact of the filtering, but doesn’t affect the result.
  • Question: in a data centre, may have services dependent on services and so on, so a client action may go through lots of servers indirectly, so is the resolution enough to identify this without deep packet inspection? Could definitely do deep packet inspection, but temporal analysis can be made usable.
  • Question: with the prevalance of virtualisation technology, services could be migrated between machines, so does the technique take that into account? Service migration is a potential application of this information: could migrate dependent servers together.

Work-in-Progress Reports

  • Multikernel: an architecture for scalable multi-core operating systems. Multi-core scalability of OSs is a problem, because of locking and shared data. The interconnect on large-scale multi-core systems looks a bit like a network, so use distributed systems ideas to make things scale. Traditional approach is reduce lock granularity and partitioning state; blocking operations are the norm. In a multikernel, have no sharing by default (everything either partitioned or replicated), and cores run an agreement protocol, based on message passing between cores. Long-running operations become split-phase, and sharing is re-introduced as an optimisation between some cores (e.g. betwen cores that share a cache). In a multikernel, the kernel is replaced with a per-core CPU driver, and a monitor on each core that run an agreement protocol. So if an application wants to globally unmap some memory, it will ask its local monitor to do that, the monitors will communicate among themselves, and will have consensus before the unmap is implemented by the CPU driver. Barrelfish project is looking at how to make OSs scale to multi-core and with heterogeneous channel. Might give the opportunity to handle partial failures (through fault tolerance algorithms).
  • Transcendent memory. Memory is cheap, but should it be wasted? Many petabytes of memory across the world are sitting idle and wasted, but it is costing us money in power, cooling, etc. Virtualisation has shown us how to improve CPU utilisation, but we have still treated memory as cheap. So revoke ownership of memory from VMs that hoard it, and better utilise chunks left fallow by fragmentation. Want to improve ballooning and look at better ways to share money between VMs. We know that unused memory is used as a page cache, so can we improve cache management? Sometimes the hypervisor has memory unassigned to any VM. Solution is transcendent memory. Collect all memory in the system into a pool, and provide a PV interface to that to give indirect access strictly controlled by the hypervisor. Requires changes to or paravirtualisation of the OS. Can use the memory as a second-chance cache, a fast swap device (working today), or inter-domain shared memory, or a filesystem cache. If it works in a VMM, could it be used in a single OS, e.g. for autocompression of transcendent memory.
  • Towards Less Downtime of Commodity Operating Systems’ Reboot using Virtualization. OS reboots are essential (for patches etc.), even when something like iTunes is updated. The downtime is long, however. Solution is to use a VMM-level approach, applicable to commodity OSs without modifying the kernels. Uses snapshot technology, which restores the OS execution state from an image taken in advance. Restoring a snapshot is 6x faster than an OS reboot. Using VirtualBox at present. On reboot, clone the target VM and reboot the clone in the background. Once the clone finishes booting up, take a snapshot of the old OS. Then copy the state of the old OS into the rebooted clone. Provides a much shorter downtime.
  • Texen: Virtualization for HTM-Aware Operating Systems. TM is a simpler programming than locks, and they believe OSs will be using HTM in future. Solaris 10 is rumoured to be using it, and there was the SOSP 2007 work on TxLinux. Problem: a VMM will violate the assumptions in HTM because they deschedule virtual CPUs. Opportunity: paravirtualisation is an opportunity because of the uniform I/O model they impose. HTM is a replacement for locks: speculate that it is safe for critical sections to execute concurrently and detect conflicts in hardware. If a VCPU gets descheduled, it might not be possible to detect a conflict. But since transactions and I/O don’t mix well, we can extend the ISA to deal with I/O in critical sections. TeXen is TxLinux + Xen. Uses HTM for synchronisation.
  • SnowFlock. How do we instantiate VMs to deal with flash crowds in the popularity of a service running in the cloud? Takes minutes to start up a VMs, by which time you’ve missed the crowd. SnowFlock can clone VMs very quickly (in less than a second) and it imposes negligible runtime overhead. And it makes it easier to program because there is an API for forking new VMs. Available as free and open-source software.
  • Toward Differentiated Services for Data Centers. Why do we care about differentiated services? This is a hot topic. Cloud computing should offer availability, reliability, durability and performance. We think that replication is the key to provide differentiated services. Targeting application developers, not end users. Objective is to build a read/write only storage system that provides different availability to users with better server resource utilisation. Want to handle dynamic and heterogeneous hardware and software. What is a fine-grained approach: let client pick number of replicas, replication algorithm, network topology, etc. A coarse-grained approach is differentiated replication, which uses a reputation system for nodes (based on past behaviour) to choose them in the system in order to match the user’s requirements. Implemented a prototype based on Chord/DHash on a cluster of 22 nodes. Looked at synthetic and real traces (including one from Microsoft based on PC failure).
  • TCP Incast Throughput Collapse in Internet Datacenters. Problem that happens a lot in typical datacentre workloads, like the shuffle stage in MapReduce, in web search, and many other scenarios. Requests are broadcast to many nodes, they process for a while, and then they come back simultaneously, causing congestion. No clear TCP or network level solutions. Throughput collapses for even a small number of senders. Look at sequence number and RTT dynamics. In 10-to-1 case there is huge variation and the absolute values become very large. Looking at TCP solutions, like breaking synchronisation, less aggressive exponential backoff, better SRTT and RTO computation, and minor modifications to an OS kernel. Also non-TCP workarounds at application, network and link layers, but they all have limitations. Open to suggestions. (Switches that cap buffers at the order of an RTT might be worth looking at. Could just add more buffer space.)
  • Gridmix: A Toolset for Benchmarking Hadoop. Motivation is to find useful benchmarks for keeping track of Hadoop performance, and track how this changes across releases and different deployments (clusters). Observe that different people have studied the performance from different perspectives. Need to understand the performance impact of a patch. Also need to validate the performance of new clusters. Come up with five different types of MapReduce jobs of different sizes and different types, which try to push a Hadoop cluster to different extremes. One use case at Yahoo! has been profiling the change as more releases have come out: show that the execution time has decreased with each new version (from 108 minutes on 0.15.3 to 46 minutes on 0.18.0). Also used for performance engineering.
  • CrystalBall: Predicting and Preventing Inconsistencies in Deployed Distributed Systems. Reliable distributed systems are notoriously difficult to build: hard to get out all the errors that arise from realistic environments (asynchronous networks, node failures, race conditions). Even with a provably correct algorithm, programmer errors still happen. Can lead to DoS, data loss, etc. These are typically violations of safety properties. Use increased computing power to have nodes predict the consequences of its actions in the background. Run a state exploration algorithm on a consistent snapshot of the live neighbourhood, to predict what actions could lead to violations of safety properties. Compared to state of the art in distributed debugging: classic model checking is intrinsically limited to how deep it can go; as against log- or replay-based/live predicate checking. CrystalBall is deep online debugging. It can also do execution steering to alter predicted-to-fail nodes’ behaviours. Results in online debugging (found 13 bugs with small performance impact).
  • Co-scheduling of I/O and Computation in Commodity Based Storage Clusters. Looking at Hadoop scheduler, and how to schedule computation near different data sets. Map tasks currently scheduled based on data locality of a single chunk. Concerned with scientific applications, where there are dependence between chunks: a Map task will have to work with chunks stored across multiple nodes. Applications include bioinformatics, which assume each string is as long as a chunk. Current solution is to pre-process data to move all chunks touched by a single Map task to the same storage location (with a MapReduce task). Then do a second MapReduce job to analyse and process data. This might have limitations for productivity and performance. Want to introduce a scheduler that will do this for us automatically. Will quantify local and remote I/O, and work out where best to schedule a Map task.
  • An I/O Serializing Disk. Problem is that random writes are becoming more dominant. But serial writes are getting much faster. mmap isn’t scalable due to the dirty-page write-back flushing thread schedules effectively random writes (even though the elevator is used), especially if datasets are larger than RAM. Want that to go faster, with a log-structured file-system-style approach. Get better locality for large, indexed files. Solution called Honor. Sorting + I/O Redirecting + Logging. Logging is a simplified write ordering technique, used in a transactional file system (Valor). Before a process dirties a page, it registers undo and redo pages in the transactional file system, and then use a sorting disk which serialises disk I/O and stores metadata about the off-loaded writes. Reads are redirected to the sorting disk to get most recent update. Implemented as one large, virtual block device. Eventually the writes are written back to the underlying disks.
  • Zeno: Eventually Consistent Byzantine Fault Tolerance. Designed for storage backends used today in datacentres, like GFS and Dynamo (or Facebook’s Cassandra, or PNUTS). Rely on replication of data and computation to get fault tolerance and availability. Get geographical diversity as well. Favour availability over consistency at present, and work with weaker consistency semantics. The crash fault model is not sufficient: can have state corruption… database bugs are >50% non-crash faults. The Byzantine fault moel is a better candidate. Requires 3f + 1 replicas to tolerate f faults, and existing protocols favour strong consistency. Existing protocols do not solve the availability requirements of storage backends. Pick consistency level on a per-operation basis: linearisable consistency (strong) or eventual consistency (weak). Example of a shopping cart with a network partition: reconcile state by taking union.
  • Zzyzx: Scalable fault tolerance through Byzantine Locking. Goal is to build services that can handle arbitrary faults. Start seeing greater asynchrony and a variety of faults. The new technique is Byzantine locking, which gets higher throughput and lower latency than prior protocols, and hence greater scalability. Challenge is maintaining a consistent order of operations. Normally lots of communication, several message delays, and lots of cryptography. In Zzyzx, let clients order the operations. The Byzantine lock cordons off a portion of a state machine, and can then issue requests to 2f + 1 replicas with a single roundtrip and you’re done. Not like leasing. Throughput scales better with the number of responsive severs (for f = 1); Zyzzva doesn’t speed up at all, due to a primary server bottleneck. Lets you implement arbitrary distributed services, with better throughput, latency and scalability.
  • Fault Tolerance for Free. FT services are too expensive to program, run and maintain. Multi-* systems are becoming common and cheap. So the difficulty of programming and maintenance is now the real cost. Lagniappe is a programming environment and runtime that targets high-performance, request-processing applications. Maps applications to multi-* hardware. The interface for adapation is the same interface for FT replication. If the mix of tasks changes, need to adapt to run on different machines etc. Lagniappe FT enables the addition of a FT protocol to applications specified this way. Works for request processing applications. Currently building more FT protocols.
  • Device driver reliability. Problem is that they are buggy and cause OS crashes. Driver = device datasheet + OS documentation/headers. Create a driver implementation from these, obviously introducing bugs in the process. All the knowledge of how to program the device is contained in the device and the operating system. If we could represent this in an explicit and formal way, we could automate the process. All we need is a formal spec of the device protocol and the driver/OS protocol. Example of a simple network device with a control and data register. Can combine protocol state machines into a single state machine automatically, and prune this to a correct implementation. Finally simple to translate this into C code for the actual driver. Actually done this for a USB-to-Ethernet controller driver. The device vendor has to provide a formal specification.
  • CitySense: An Urban-Scale Open Sensor Testbed. Building a 100-node wireless sensor testbed deployed on streetlights and rooftops across Cambridge, MA. Monitoring air quality, weather, noise pollution, road traffic, etc. (with various sensors). Intended to be open like PlanetLab so that anybody can use the data and program against it. Nodes are small, single-board computers with 256Mb of RAM, a CF drive, dual 802.11 radios with high powered 900MHz radio and standard 802.11a/b/g. Applications include public health, urban sensing, homeland security, novel distributed systems… open-ended. About 15 nodes deployed between Harvard and BBN, and want to span the gap between them along Concord Avenue. Challenges: wireless mesh doesn’t work! Need disruption tolerant networking. Need slicing and experimental control methods (can’t just run PlanetLab on them). And need automatic failure recovery as can’t go back up streetlights every time the nodes fail.
  • Wifi-Reports: Improving Wireless Network Selection with Collaboration. Problem is how to select between a number of commercial APs (are ports blocked? How is performance?) when we have little information about each. Can we annotate this database with more information. Rely on user-submitted reports. Could do automatic measurements upon connection and provide this information to other clients. Challenges: location privacy (reports should not be linked to users, otherwise users can be tracked) but want to avoid fraud: need 1 report per AP per user. Solution is a new e-cash-like reporting protocol. Also need location context, as performance will depend on the location of the user. Need to estimate different loss regimes with distributed measurements. Did a measurement study of commecial APs at Seattle hotspot locations: there are a huge number of APs, and very large variance in performance. Often the official AP is not the best one (should use the one next door).
  • S3: Securing Sensitive Stuff. Need a TARP program (securiTy Against Rogue Pilferers) program for the virtual economy. Need to prevent data theft. Want to regulate flow of sensitive data according to high level security policy. Look at information flow control systems, which ensure confidentiality and integrity, and limit covert channels. But they require a new OS, new applications and sometimes new hardware. Simple goal: prevent data being stolen from a network. So just need to ensure that data is not able to leave the network. Only need to interpose at the exit points (network access, peripherals, etc.). Could therefore track sensitive data flow below the OS without modifying the OS itself. S3 is a network of hypervisors to track and enforce sensitive data flow in an enterprise. User OS is diskless, with a filesystem maintainted in Dom0. Host OS stores the sensitive files. Users tag sensitive files with labels and policies. Shadow memory is used to track world-level taint. Sensitive data flow and computation is tracked using taint-tracking techniques. Use HW virtualisation support, and use speculative user OS execution.
  • Communities as a first-class abstraction for information sharing. Online social networks are a popular way for people to connect and share content. Increasing variety and volume of content being shared on these sites. First challenge is of privacy and access control. Second is the issue of relevance — what is relevant to a user? Think communities can help us here: groups of densely connected users. Members of a community often share interest and enjoy mutual trust. In practice, we can automatically infer these from the social graph. Today, access control is limited to friends (or subsets thereof) or the entire world. Communities are a natural middle ground, enabling sharing beyond just friends. As for relevance, search facilities return aggregated global option. Communities represent shared interest and can natually be leveraged to improve relevance. PeerSpective is a social network-based web search tool. Indexes browsed pages, and web search collects results from Google and from friends. Results aggregated by community, to allow community-based sharing. And also enables finding more relevant results.

HotDep 2008

Wednesday, December 10th, 2008

To kick off my blog posts from OSDI, I’ll start with summaries from HotDep 2008. The order of the day is system dependability, some controversial papers have been included, and active audience participation has been encouraged. My notes presented below are attempts at an objective summary for each talk, and are fairly raw and unedited. If you spot any errors, let me know in the comments!

Software Reliability

Reverse-Engineering Drivers for Safety and Portability

  • Drivers aren’t portable. New OSs aren’t always backwards compatible. And drivers are a safety nightmare.
  • Techniques proposed for portability, isolation, static analysis, model checking, driver languages.
  • Idea is to take a log of hardware interaction (by recording the protocol between the device and the driver) and the original binaries, analyse them, and extract the protocol into a safe device driver. Uses driver templates (OS-specific boilerplate), fleshed out with the protocol.
  • View a driver as a state machine. State is heap data, local variables and CPU registers. Look at effect of a syscall on the state.
  • Use QEMU to snoop communication between the device and the driver. Present prototype only works on devices that are emulated by QEMU. 2–3x overhead during reverse engineering.
  • Generate a trace log of EIP, and writes to memory locations: associate this with the disassembled binary. Work out which instructions are relevant for controlling the device (i.e. that modify hardware state). These are then converted to C code, and can be combined with OS-specific boilerplate template. GIves you source code which can be compiled into a kernel module and loaded as usual.
  • Tested by porting the NE2K driver from Windows to Linux, and from Linux source to safe Linux source. Generated a 500kb trace by sending, receiving and setting the MAC address, which covered 50% of the basic blocks in the original driver. No performance overhead in the synthesised driver: even though it is incomplete, it was still usable.
  • Required half a day to specialise the driver template manually, but this could be automated. Much harder to do when porting between operating systems, because of API difference. Errors occur in manual translation.
  • Can formally prove the correctness of synthesised drivers: the driver template will not crash the OS, and the specialisations are safe by construction. Can also prove liveness.
  • Some legal concerns about reverse engineering (specifically decompiling). Can mitigate these by including no original binary code in the synthesised driver.
  • Want to trace real hardware, deal with real-time constraints, handle OS API mismatches better.
  • Aim is to get hardware vendors to provide formal specifications.

“Otherworld”: Giving Applications a Chance to Survive OS Kernel Crashes

  • Cannot avoid software faults. 30–50% of TCO is spent on recovering from faults. Major vendors are now looking at dependability rather than performance.
  • Concentrate on kernel faults because these are hardest to recover from. Can minimise consequences by: checkpointing (expensive), hardware redundancy (expensive and complex) or microkernels/SFI (not widely available).
  • Afterwards, the kernel state is corrupted and cannot be trusted. This affects all running applications, but rarely corrupts application state: only 18% of cases. Application specific recovery reduces this to 1–2% of cases. So give the application a chance to restore its data.
  • Can we reboot the kernel without destroying everything on the system? We should be able to. Kernel is logically well isolated from other components.
  • Challenge: kernel contains critical data belonging to applications. e.g. the page tables, file descriptors, sockets, etc. Need a component that can manage these after a fault.
  • Idea is to have two kernels: a main kernel and a crash kernel. Main kernel is active by default, but transfers control to crash kernel on fault. Crash kernel sucks information from the main kernel when it is initialised, and takes over the system. Processes need to register a crash procedure which is analogous to an application exception handler: called when the kernel crashes.
  • On crash, we have an initialised, undamaged kernel, but the main kernel data is preserved and accessible to the main kernel.
  • Main and crash kernels are the same version of Linux. Hardware state is not currently preserved. Crashes detected are limited to those detectable by the OS itself.
  • The crash kernel recovers state for each process (could use checksums to detect corruption here), creates a new process descriptor, copies the address space from the main kernel and restores resources belonging to the process.
  • Crash kernel calls the resurrection function in the application to notify it that a kernel crash has occurred.
  • Finally, crash kernel reclaims the resources occupied by the main kernel, and loads a new crash kernel. Now we again have a fully functional system.
  • Automatic resurrection of: application physical memory, pages swapped out, mmap’ed files and open files. Currently don’t restore: network connections, pipes, framebuffer and threads.
  • Benefits: no runtime speed overhead, fixed small memory overhead, applies to monolithic and microkernel OSs, requires no special hardware, and restores (an unlimited amount of) the most recent state.
  • Evaluated on Linux 2.6.18, using kdump to load crash kernel. Tested with four classes of application (an editor, database serer, web server and Berkeley Lab Checkpoint/Restart). Considered kernel crashes triggered by BUG_ON(). Faults are injected by triggering the failed assertions (not corrupting memory).
  • Added 25 lines of code as a crash procedure to the JOE text editor. Simply saves all opened documents and restarts the editor with the same open documents. Totally transparent to the editor’s user.
  • Added 50-line crash procedure to MySQL that retrieves in-memory tables, saves them to disk and restarts the server. Also 25 lines to the start-up code to retrieve this saved data. Gives reliable in-memory databases, which are 1.5–140 times quicker than disks.
  • Added 1500 lines of code to the Linux kernel.
  • Still need to work on detecting and preventing data corruption, keeping applications running without interruptions, resurrecting other resources and resurrecting groups of interacting applications.

Deprogramming Large Software Systems

  • We already have programming and compiling, and decompiling. Can we extract programmer thought by deprogramming?
  • Use case: a programmer has to maintain a system that he/she has to written, which contains hundres of classes. May have access to documentation, but has access to source code. Goal is to understand how it works and be able to fix it.
  • One approach is to recognise patterns in class naming (e.g. NetworkConnection). Or higher-level design patterns (e.g. MVC).
  • Programmer needs a high-level description of the software: i.e. what the designer would say when describing the system. Deprogramming extracts these patterns automatically.
  • Algorithm: 1. Statically extract dependencies from Java bytecode to get dependency graph. 2. Augment graph with dynamic analysis. 3. Analyse the graph to do automatic pattern matching of “ideas”. 4. Generate documentation.
  • Dependency graph: highly expressive abstraction of the code. Nodes are code constructs (classes, methods, fields), and edges are directed relations. Omit details of variable names and sequencing of operations. Should be sufficient to detect design patterns.
  • Dynamic analysis can e.g. show what concrete class is invoked through an abstract interface; track uses of reflection.
  • Patterns are defined in a DSL that allows the specification of subgraphs. Allows the expression of any subgraph and can match a wide range of patterns.
  • Pattern matching is equivalent to subgraph isomorphism problem. Uses the VF2 algorithm with some modifications for type-checking.
  • Can search for anti-patterns and identify targets for refactoring. Potential for auto-refactoring of these cases (not done at present).
  • Can also detect semantic-level copy-paste. Compare the structures of two programs by comparing their dependency graphs.
  • Can generate documentation from this, e.g. UML diagrams from the patterns that exist.
  • Can identify the fingerprint of a particular programmer: he/she will have preferences for particular patterns or combinations of patterns, or modularisations. Could extract this and compare to a database to find the author (e.g. for finding out who wrote malicious code). Assumes everyone is unique, though.
  • At present, don’t infer patterns (but that would be useful). Only identify known patterns.
  • Plan to evaluate the speed of the analyses and the pattern matching. Pattern matching efficiency is on the order of a couple of minutes (for small patterns). Also want to measure precision: no false negatives because algorithm is exact, but some false positives arise from an insufficiently precise definition of the pattern graph.
  • Possibility that anti-patterns do not appear in such consistent forms in the dependency graph.

Redundancy

[Unfortunately, I was presenting in this session, so I didn't take notes on these talks.]

A Spin-Up Saved Is Energy Earned: Achieving Power-Efficient, Erasure-Coded Storage

  • [A nice summary can be found here.]

Spread-Spectrum Computation

  • [Hard to be objective about this one, as I was the first author and presenter, but you can read the paper, and a summary of the presentation here.]

Modeling

Toward Quantifying System Manageability

  • Operations cost 50–80% of IT budget, many billions of dollars a year. Should make manageability a first-order design goal: currently we have too many knobs to tweak. Oracle 10 has 220 initialisation parameters, 1477 tables of system parameters and an 875-page admin manual.
  • Engineering requires measurement. Want to be able to compare systems to one another and assess progress. Today, proposing an objective, quantitative metric that is directly measurable. Not qualitative. It’s easy to show that Oracle 10g is better than IBM DB2 and vice versa, depending on the reviewer.
  • Manageability is a predictor of the amount of human effort and risk of mistake involved in keeping a system operating at a satisfactory level. Management is a set of tasks, a task is a sequence of atomic steps. Task complexity is estimated by the number of atomic steps (intuition is that many steps means many opportunities for mistake).
  • Do tasks include troubleshooting? It’s possible that the majority of management time and effort is spent there. Might be hard to quantify this.
  • Manageability is time period over the sum for each task of the product of the number of steps, task duration and a weight which is based on the frequency that a particular task arises (opportunity for a standardisation body to come in here and suggest these). Non-dimensional measure, defined as being a “Management Unit” (MU). Relies on being able to certify atomicity: perhaps using an online reputation system.
  • Choice taken to exclude many things from the formula: want to keep the manageability metric manageable. Want it to be easy to remember and apply. Principle is to skip aspects whose absence simplifies the metric more than the loss of accuracy (e.g. system size, data volume, using averages instead of distributions).
  • Metric is: objective, directly measurable, simple, intuitive and widely applicable. (Question of whether the set of tasks is really objective, but could be standardised.)
  • Need to come up with lists of tasks that define the management workloads for each class of systems. The list would have a count of the number of times each task would be carried out over a specific period of time (e.g. 3 years): these give you the weights.
  • Example of installing Oracle on Linux. This is a single task, with 3 sub-tasks, all defined in the admin manual.
  • Possible to include MU measurements in business decisions, assuming the cost of an MU (or the MU capacity of e.g. a DBA). (Question of whether you could spend less management time for a less optimised system, and how that would affect the cost trade-off.)
  • Patterson quote: “For better or worse, benchmarks shape a field.” So this is an attempt to shape the field.

Towards Automatic Inference of Task Hierarchies in Complex Systems

  • What do we mean by complex systems? One that is so complex that we can’t reason about it intuitively. So we tend to simplify and model them. Developers can represent a system as a hierarchical task model, which allows the encapsulation of implementation details as high-level tasks: e.g. in map-reduce we care about the map and reduce functions, not invidual message sends and receives (or the details of fault-tolerance, etc.).
  • The work is to try to infer these models with minimal manual assistance from real systems. Tool called Scalpel, evaluated on Apache and the PacificA storage system.
  • Give a number of function calls, want to identify task boundaries, and correctly associate dependencies between tasks.
  • Tracing of calls and the parameters to signal, wait, send and recv. Use R2 in the implementation.
  • Identifying leaf tasks as the smallest unit of work in the task model. Look at synchronisation points in the trace, and reason the dependency only happens on these boundaries.
  • Constructing a causal graph, using the happens-before relation, but want to distinguish between causal dependency and things like mutual exclusion. Use heuristics, like observing queues and event-based code.
  • Want then to identify frequently-occurring patterns in the causal graph, which are then recursively replaced with super nodes. This identification done by using hash functions for fingerprinting or by canonicalising the sub-graph and serialising it deterministically.
  • Case study of a performance bug in PacificA. Discovered an issue during stress tests. Used task-level profiling based on the inferred task model, and took a top-down approach to identifiying the problem. Saw that the committing task could not saturate network bandwidth while its CPU usage remained low. Discovered that the sender was sleeping for a second during the send-packet task, to approximate flow control.

Consistability: Describing Usually Consistent Systems

  • Distributed storage systems implemented as a group of replicas acting as a data store. Ideally the system would behave like a single replica.
  • Consistency semantics define the guarantee provided by the storage system about what values can be read from it, both with and without concurrent operations. Existing notions focus on the worst-case guarantees, but don’t look at how often the worst case might arise.
  • Opportunity exists to improve these notions to include a dependency on the operating conditions (e.g. the number of failures that might be seen), and the probability that these conditions might arise.
  • Consistability: consistency guarantee dependant on operating conditions. Similar notion to performability. Enables the specification of SLAs that quantify provided consistency.
  • Formal model of consistability as a mapping from set of failure scenarios to the set of achieved consistency levels. Consistability is the probability that a given consistency class will be satisfied given the expected distribution of failures.
  • Performability is an open problem. So are transitions: what happens to consistency when you have reads and writes happening in different failure scenarios? Assume that these transitions are rare, otherwise the guarantees can be inaccurate. Also, can we verify SLAs based on consistability? This might require an oracle.

Distributed Systems

Byzantium: Byzantine-Fault-Tolerant Database Replication Providing Snapshot Isolation

  • Database systems are pervasive and used for many applications. But they may exhibit Byzantine faults. Want to build a middleware system that uses an off-the-shelf DBMS and gives BFT.
  • First version aims to be as efficient as possible, using an existing BFT replication engine.
  • Usually talk in terms of ACID properties, but there are actually four ANSI-defined kinds of isolation, based on the set of phenomena that might be observed (e.g. dirty read, non-repeatable read, phantom read). Vendors have added a few more as well.
  • Snapshot isolation: (logically) take a snapshot of the database, run transaction on the snapshot, and, on commit, abort transaction if has a write-write conflict with any other committed transaction since the transaction began (otherwise apply it). Supported by many vendors, easy for read-only transactions, and gives serialisability under common database workloads.
  • BFT replication algorithms exist, but tend to have a high overhead.
  • Don’t want to change the DBMS or applications (hide details in the JDBC driver), and build system on top of PBFT.
  • Goal is to be efficient, but BFT operations are costly. Need to minimise the number of operations that run the BFT algorithm. Transactions help us: BEGIN and COMMIT are BFT operations, but the regular DB operations in the transaction are not.
  • On BEGIN, client chooses one replica as the coordinator, and does a BFT operation to ensure that all replicas have started a transaction. On a read, only the coordinator is invoked. Same for a write. For commit, a list of operations and the results is sent using a BFT operation to all replicas. The coordinator checks if snapshot isolation properties hold; the other replicas execute the operations, and, if the results are different from the logged ones, it aborts; otherwise commit only if snapshot isolation is maintained.
  • Correct replicas compute the same results because they execute in the same snapshot (due to total order on BEGIN and COMMIT). If there is a Byzantine replica, a quorum of correct replicas will still commit. If the coordinator is Byzantine, all correct replicas will abort on obtaining different results.
  • What about Byzantine clients? Just need to do some simple checks: that IDs are valid, that BEGIN occurs before COMMIT/ROLLBACK, and that (at the coordinator) the list of operations is the expected one.
  • Handling ROLLBACK could just do a BFT ROLLBACK operation and make no change to the database, but a Byzantine coordinator might caused the client to observe incorrect results that led to the ROLLBACK decision, so instead run a ROLLBACK in the same way as a COMMIT (which informs the client if it was rolling back due to incorrect results from the coordinator).
  • First database that tolerates Byzantine faults while providing snapshot isolation (previously only serialisability) which gives better concurrency.

Pretty Good Packet Authentication

  • At present, packets can’t be authenticated. It’s possible to spoof the source address of packets, or change the contents of a packet in the network. This causes problems with e.g. spam blacklists, false accusations (of contacting an illegal BitTorrent tracker), unverifiable complaints (of e.g. port scanning) and plausible deniability.
  • There is a spectrum of possible solutions, from weak to strong: status quo, additional ingress filtering, IP traceback, clean-slate designs (AIP), cryptography and biometrics, brain scanner! But these strongest solutions are infeasible, and have privacy issues. Clean-slate designs would be nice, but it’s too expensive to replace all the routers in the network. PGPA is in the middle of the spectrum.
  • Given a packet, a source address and a timestamp, the ISP owning the source address can verify whether the packet was sent at approximately that time. Requires trust in the ISP.
  • PGPA protects user’s privacy. You must have the entire packet to ask a question about it. (And there is sufficient information in a packet that it is difficult to guess what the packet might look like for, e.g., connecting to a given website.)
  • For spam blacklists: the recipient can ask the ISP that a packet appears to come from if it actually came from there (so no point in spoofing the source address).
  • For false accusations: it’s now possible to check if a packet actually came from the accused’s machine (although we don’t know who sent it).
  • For unverifiable complaints: can now verify if the port-scan packets were actually sent.
  • Question of what if the packets come from a hijacked, innocent machine, e.g. via a botnet. So attacks would be launched at the application level, and the IP address is immaterial.
  • PGPA limitations: associates packets with addresses, not users; reveals only that packets were sent but not why; and assumes no collusion between ISPs and users. But very simple, effective against real-world problems, compatible with anonymity, privacy-preserving, straightforward to implement and plausible to deploy.
  • PGPA needs to keep a record of past traffic: a timestamp and a hash. These monitors should be placed on the access link, making deployment easier because the backbone is not changed. Could be at the user’s premises, which is inexpensive and scalable, but makes it possible for the user to destroy the device (and hence evidence). Could be at the ISP, which is easy to deploy, but relies on user trusting the ISP to report correct results. Could be at both, which requires less trust, but more overhead. Device could be tamper-evident.
  • Storage requirement for a DSL connection: 1Mbps upstream fully utilised with 40-byte packets, means 3125 packets/s. Storing a SHA-1 hash and a 32-bit timestamp per packet would need 187Gb/month. Would require a single hard disk per user in the worst case.

Dependable Self-Hosting Distributed Systems Using Constraints

  • Cloud computing infrastructure is becoming more prevalent. These provide different computation and communication resources, with their own pricing policies. More and more applications are getting deployed on these clouds because there is no need to buy physical resources. The scenario is trying to deploy an online service, but how do we select appropriate providers and resources? Want to save money here.
  • One possibility is to deploy a service with a single provider, renting three instances in the US, and two in Europe. As the load from different parts of the world, you could rent additional resources in each market. But there’s a risk of correlated failure (e.g. all the machines in the US go down at once).
  • Another challenge then is how to respond to outage from a particular provider.
  • What if the requirements change? For example, the service could suddenly become big in China (but the service is hosted in the US or Europe). So we need to adapt to changes in application load.
  • The pricing policy might change. For example, a cheaper hosting service might come online in China, but if you just use a single provider then you might miss this benefit.
  • Approach is to bundle the management logic with the application instances. Use P2P-style self-organisation.
  • Need a management runtime that lets application deploy itself, by monitoring load, resources, pricing, etc. It should adapt to changing conditions, acquire and release VMs as needed, and deploy new instances dynamically.
  • Has been seen before in “worms”, intelligent agents, and autonomic systems.
  • Question of whether the system would be trusted with your credit card number: the eponymous constraints will constrain the budget.
  • The addressed challenge is how to specify autonomous application behaviour. For example (PsEPR): 6–10 well-connected nodes, geographically distributed and lightly-loaded (fixed constraints). Aim to maximise CPU cycles, minimise network diameter and reduce budget and migration cost (dynamic constraints). This becomes a constraint optimisation problem, so apply Constraint Logic Programming.
  • Constraint Programming: specify constraints, utility function and cost function (i.e. objective function). Logic Programming: heterogeneous resource information.
  • System called Rhizoma. Nodes running it (each node that a customer is using) form an overlay: a single node acts as coordinator. Need a monitoring system like CoMon in PlanetLab: could be provided by a third party.
  • Initially deployed on PlanetLab: more challenging than cloud providers due to heterogeneity and flakiness. (Soon aim to do this on EC2.) Measured utility over time, in terms of CPU utility over time.
  • Key ideas: distributed systems can be self-hosted, and use CLP to express desired behaviour. This should lead to better expressiveness, optimisation (of performance and cost) and responsiveness (to changes).
  • Open questions: scalability of the CLP solver (use different reasoning algorithms, but don’t waste time on getting optimal solution (which might be more costly than having a suboptimal solution), and just use some heuristic knowledge to speed things up), application usage of CLP (reporting application and platform integration (trying different clouds, and running on personal networks (collection of personally-owned resources)). Need to have applications that are able to request more resources.
  • Question of how to expressive the requirements of the application: other approaches have foundered at this point, will CLP be better? CLP appears to be quite high-level for specifying the utility and cost functions, and the constraints.

EuroSys student travel grant winners announced

Wednesday, November 26th, 2008

We are delighted to announce that the winners of the 2008 EuroSys Student Travel Grant awards are Horatiu Jula from EPFL and Derek Murray from Cambridge University. The applications from these two candidates were outstanding, and it was not an easy task to select the two winners from an extremely strong set of entries. We wish to thank all the applicants for entering the competition, and we hope they enjoyed writing the pieces about their research and their favorite paper from previous OSDI proceedings as much as we enjoyed reading them.

From the winning entries, Horatio chose “Securing Software by Enforcing Data-flow Integrity” (Castro et al, OSDI 2006), calling the technique to prevent software attacks such as buffer overflows “elegant and effective” and contrasting with “XFI: Software Guards for System Address Spaces” (Erlingsson et al, OSDI 2006). Derek selected “MapReduce: Simplified Data Processing on Large Clusters” (Dean & Ghemawat, OSDI 2004) which he says is “particularly interesting because it describes a technique and a system that successfully operates at massive scale”, and he observes that “its publication heralded the start of a trend”. He hopes to see more papers in the future that deal with the challenge of massive-scale data sets, although he would also like “more emphasis on comparative evaluation”.

Horatiu and Derek will be attending the OSDI 2008 conference from December 8th to December 10th, and will report on the conference in this blog. If you cannot attend OSDI, or if you will attend the conference and want to know Horatiu and Derek’s thoughts, stay tuned!