EuroSys
The European Professional Society on Computer Systems

European chapter of the
Special Interest Group on Operating Systems (SIGOPS)
of the Association for Computing Machinery (ACM)
Home Join or renew membership EuroSys for Students EuroSys for Faculty Job Offers Activities Systems Directory Systems Events and Blog Eastern Europe Initiative Member Area Member News Officers and Volunteers Useful links Press releases

OSDI 2008: Day 2

File Systems

SQCK: A Declarative File System Checker

  • File systems are becoming larger and must be reliable, but they still get corrupted (hardware errors, firmware bugs, FS bugs, etc.), and require immediate repair to get out of an inconsistent state (which can end up causing even more damage).
  • Who should repair this? Is journalling the solution to every problem? Write-ahead log only prevents write inconsistencies due to system crashes, but this doesn’t cover all problems. Should the FS repair itself online? There isn’t enough machinery in the FS to do this: all it can do is, e.g. mount itself read-only. So we need a utility, and we have one: fsck. A trusted utility which is the last line of defense for bringing back consistency. XFS says it never needs fsck, but it uses it in the end, due to these other problems. All file systems have their own variety of fsck.
  • Fsck is complex: it has a big task, must read all FS metadata and turn any corrupt image to a consistent image. e.g. is a data-block being shared by two inodes? Complexity arises from them being implemented in C, which is hard to reason about. Ext2 fsck does 150 checks in 16kloc; XFS does 340 checks in 22klock. Looks like hundreds of if-check statements. But fsck is untouchable: you can’t modify it because it is (i) crucial and (ii) changing it might introduce new problems.
  • Are current checkers really reliable? If not, how should we build robust checkers?
  • Analysed e2fsck, which does inconsistent repair (sometimes), making the file system unreadable in some cases. So fsck makes an incorrect repair, making the FS more inconsistent. Sometimes it is consistent but not “correct”, deleting valid directory entries and loses a huge number of files.
  • Lesson: complexity is the enemy of reliability. A big task with a bad design leads to complexity and unreliability. Want a high-level approach to have simplicty. So use a SQL-based declarative language to write checks, which means fewer lines of code. Evaluated the simplicity (150 queries vs. 16kloc) and reliability. Also showed flexibility and reasonable performance.
  • e2fsck must cross-check all ext2 metadata: an indirect pointer must not point to the superblock; a subdir should only be accessible from one directory. Injected a single corruption at a time and tried to analyse ho e2fsck repairs this single corruption. Only corrupted on-disk pointers (make indirect pointer point to superblock; make directory entry point to another directory). Usually a corrupt pointer is simply cleared to zero.
  • Inode has an indirect pointer that should point to an indirect block. But if it points to a superblock. First e2fsck finds that the point is invalid, and zeroes. But if it does point to an indirect block, it has to check the content of the indirect block (checking if any entry points outside the FS range), and garbage pointers are zeroed. At least, this is the ideal.
  • So in e2fsck, if the inode indirect pointer points to the superblock, it will assume that the superblock is a corrupt indirect block and zero some of the entries in the superblock. Then it will check the indirect pointer, find that it is pointing to the superblock, and zero it. The incorrect ordering fatally corrupts the FS.
  • Can also have consistent but incorrect repair. We have information about the .. pointer of directories, so we can identify when two directories point to the same subdirectory, and can resolve this. But e2fsck actually chooses the directory with a lower inode number, and then rewrites the .. pointer. e2fsck doesn’t use all available information.
  • Also policy-inconsistent and insecure problems (in paper).
  • These are not simple implementation bugs.
  • Fsck must do hundreds of checks, and some complex cross-checks, between single and multiple instances of the same and different structures. But all of these must be ordered correctly.
  • In SQCK, use a declarative query language (i.e. SQL) because it makes high-level intent clear. And it is fit for cross-checking massive amounts of information. Simplified e2fsck into 150 queries. Each check/query is easy to understand, and it is flexible, enabling the plugging in and removal of differnt queries.
  • Using SQCK, take a FS image, load all of its metadata into temporary database tables, created fresh when SQCK starts running. e.g. inodeTable, dirEntryTable. Checks and repairs run on the database tables. Finally flush any modifications and delete tables.
  • Example of cross-checking a single instance of a structure: find a block bitmap that is not located within its block group. Basically a range check on whether the block bitmap is within the first and last block of the block group. Simple SELECT … NOT BETWEEN … query.
  • Example of finding false parents of subdirectories. The e2fsck code is wrong here. Join on parents and children where there are two parents with different inodes.
  • Declarative repairs must also be performed. Some are just update queries, where a table is updated with a new value in some field (and a dirty bit is set). But a repair might also be a series of queries, which can be combined with C code, though the actual work is done in the SQL.
  • 150 queries, 1100 lines of SQL statements. The new checker passes hundreds of corruption scenarios. Easy to add new checks and repairs, to get different versions of e2fsck. Introduce some performance optimisations: total runtime is slightly longer (up to 50% more than e2fsck), based on MySQL prototype. Future optimisations include hierarchical checks and concurrent queries.
  • Question: in paper, the MySQL is held in a 0.5GB ramdisk; given an FS of some size, how big need the ramdisk? Half-full terabyte-sized disk needs a 0.5GB metadata DB.
  • Question: what about the complexity of the scanner-loader which has to understand the entire FS? e2fsck also includes 14kloc which do this scanning, not counted in the comparisons.
  • Question: what about ordering the SQL statements to ensure that correct repairs? Haven’t looked at automating the ordering, but it is easier to reimplement different orders.
  • The times include only the times for loading the DB, scanning and performing the checks, assuming correct FS.
  • Question: could flushing introduce errors? e2fsck doesn’t use journalling to introduce repairs, so a crash of e2fsck would cause a (different) inconsistent state. But could make the DB flusher use a journalling approach quite easily.
  • Question: wouldn’t it be better to declare what the FS should look like, rather than how to check it? Started that way, making our own declarative language, but thought it would be less work to use SQL, which has been honed by the DB community.

Transactional Flash

  • Want to replace disks with solid-state devices, but they are quite different. However today they are used as traditional disk replacements, but this misses an opportunity to export new abstractions that better use the properties of SSDs.
  • TxFlash provides transactional guarantees to higher software layers. WriteAtomic abstraction updates multiple pages atomically. This can simplify the software layer of transactional systems. Today use CoW or logging to provide these guarantees, but TxFlash can move this to hardware.
  • Could use a traditional log-based protocol, but this paper introduces a new protocol: cyclic commit.
  • Key results: implemented TxFlash on a simulator and an emulator running on a real SSD. Want to understand how cyclic commit performs compared to traditional commit. Improves by 95% for small transactions. Also want to know the overhead of transactional Flash. Overhead is <1%: essentially free. And what is the end-to-end performance improvement. The running time is reduced by up to 65% for I/O intensive workloads.
  • Architecture of an SSD: a controller connected to a set of Flash packages. No moving parts, giving fast random accesses. The memory is non-overwrite in nature: must first erase before writing, and cost of erasure is high, so better to use a log-structured file system. Each page has a 4KB data part and 128-byte metadata (checksum, ID, version number, etc.). Traditionally export read and write. TxFlash exports WriteAtomic and Abort.
  • Programming model consists of only writes, and no reads. Straightforward to extend to a more general transactional construct.
  • Suitable for transactional support: non-overwrite pages mean minimal extra overhead in tx-support. Also, CoW and logging lead to fragmentation but Flash has fast random access. Plus, since SSDs are quite new, there is an opportunity for new abstractions.
  • Commit protocol is used to assure atomicity of multiple updates. Must work correctly in the presence of transaction aborts and multiple failures. Need a recovery module (run when system starts to regain consistency) and garbage collector (to remove old versions of pages).
  • Traditional log-based protocol will first write data that must be committed, and then write a log entry to say that the data has been written. Recovery is simple (scan the log to replay transactions if need be) and the garbage collector is simple (just copy valid page versions to fixed locations). Problem is the space overhead with the per-commit block for each transaction. Also a performance overhead of write ordering, which reduces parallelism. Overheads become large for small transactions.
  • T1 updates pages A, B, C. The TxFlash device initialises the pages in memory and stores a point to the next logical page of the transaction (from A to B, B to C and C to A). All pages are written in parallel which means we have a cycle and therefore have a successful commit. If we have a broken cycle, this doesn’t quite imply that the transaction aborted. Could have a point from A to B and B to C, but no version of C (and pointer to C). But what if A, B, C committed, then C, E, F was committed? The previous version of C is obsolete, but this means that the A, B, original-C cycle is broken, which looks like the abort case.
  • Simple Cyclic Commit: if a page version exists, any previous version of the same page present or referenced in the storage must be committed. Aborted pages and their references are erased before writing a newer version. Resolves the broken cycle ambiguity: the presence of C’ means that C has been committed, and if the original version of C is committed, then T1 is committed. In the abort case, the edge from B to C must be deleted.
  • Recovery and garbage collection is easy to understand and implement, and does not add extra code complexity. The recovery module actually performs better.
  • Weakness: need to erase all aborted versions before creating a new version for a page, which is costly under a high rate of transaction aborts.
  • Back-pointer cyclic commit improves this (details in the paper).
  • Implemented on a simulator and an SSD-based emulator. TxFlash simulator extended from previous work on SSD. The emulator is a pseudo-device driver which exports the WriteAtomic() API. TxExt3 is a modified ext3 file system that uses WriteAtomic.
  • Transactions per second improves with cyclic commit (simple and back pointer versions perform similarly): improvement better when small transactions, but approximately equal as tx-size tends to infinity.
  • No appreciable overhead for IOzone, Linux-build, Maildir benchmarks; and very small overhead for TxFlash on TPC-B due to metadata. Using traditional commit, Maildir and TPC-B show larger overhead. (Compared to a regular SSD, non-transactional.)
  • End-to-end impact: compared to regular ext3 on an SSD, TxExt3 on TxFlash improves run time by up to 65%, but for Linux-build the impact is negligible (compute bound).
  • TxExt3’s JBD is less than 50% the size of ext3’s JBD (journalling module). Most reduction from removing journalling and revoke modules.
  • Advances in hardware devices can lead to new abstractions and advances in the software layer. SSDs are suitable for transactional constructs with low overhead and code complexity. TxFlash simplifies software and improves performance of end applications. Cyclic commit is an alternative to traditional commit for systems with small transactions.
  • Question: concerned about the deployment incentives that would enable this to be deployed in the real world (an OS would require the code to support transactions on legacy devices anyway, so the simplicity argument doesn’t hold)? Some specialised OSs (like those for the Xbox on game controllers, or phones) don’t want to pay cost of implementing all this in software, so TxFlash might apply here.
  • Question: have you given thought to what mechanisms might be used to identify transactions over existing interfaces? Or could you export a fuller view of Flash (including metadata) up to a host? Insufficient to export the metadata to a higher-level software layer, so either you export all of the control, or do it at the device level and save the complexity from the software.
  • Question: not sufficient just to make sure there are no versions of the aborted page, but there have to be no pointers. How do you find all the pointers? Maintain metadata structures in the system which let you know the last aborted transaction for a particular page. This is sufficient because it is the only thing you have to erase when committing a new version.
  • Question: the microbenchmarks suggest that the best improvement is 2x, but the macrobenchmarks suggest a 3x improvement. How come? Benefits arise from not doing additional checkpointing (moving data from the log to the actual location). Also from reducing the write ordering problem. Microbenchmarks only show the benefit from eliminating write ordering, or separately for removing the checkpointing. The end-to-end benchmarks showed both.
  • Question: the SSD has software to process the FTL. Would you want to move things to the FTL to improve energy costs or write performance? If you remove code from the FTL, the job of the FTL (like remapping, cleaning, load levelling) has to be moved to the software layer. Doing this might not be a great idea, because not all Flash devices would want this complexity moved to software (especially for resource-constrained devices). Simply return the wrong parents and make the appropriate repairs.

Avoiding File System Micromanagement with Range Writes

  • Today’s file system views a disk as a linear array of blocks. This gives a simple-to-use interface but impacts performance because it hides some details.
  • High-level goal is to place a goal near other related blocks, but it can only make a low-level command to write a block to a specific address. Classic example of micromanagement.
  • Cost of this: unnecessary positioning time, because the FS knows what is free, but the disk knows what is close to the current position.
  • Previous solutions: block remapping, whereby disks select the closest block on the fly and maps it to the requested block, requiring metadata. An object-based interface deals with writing objects to disk but also requires lots of metadata.
  • Want FS to specify a set of options for blocks to write to, and lets the disk select the best from among these. The result is an efficient write with minimal change.
  • How do we specify range writes? Normally would specify an address, the data and its length. Simple way would be a pair: start address and end address. More sophisticated would be to specify a list of pairs, or a large range with an exclusion bitmap. Need to get back from the disk a status (successful?) and the address to which it was written.
  • Case of overlapping ranges for two range writes: the disk must note that it has selected a block from the overlap and ensure that that block is masked out from subsequent writes. Need a write barrier to allow the disk to forget which blocks were written and flush its metadata structures.
  • Disk scheduler should consider all options within each range write, but select only one. Traditional scheduling has a queue which is scanned and blocks are selected to some policy. Expand-and-cancel scheduling changes a range write into several possible writes in a regular disk scheduling queue (then cancels all other options when one is selected), which is easily implemented in standard disk scheduler. Added 1000 loc to Disksim, for a particular HP disk (relatively old, but still applicable).
  • Do multiple requests solve the positioning time problem? Use an SPTF scheduler with workload based on random writes and writes spanning across block group. As the queue size grows, total time gets less. The seek time reduces but not as much as the rotation time. The problem remains with multple requests, and is greater at smaller queue size.
  • Evaluated scheduler with range writes, with random writes, range spanning a track and writes spanning across block group. Now the rotation time is reduced for all queue sizes.
  • What is the benefit of blockgroup-sized range writes? The seek and rotation times are significantly reduced for all queue sizes.
  • How to incorporate range writes into a filesystem? Need to preserve sequentiality, determine proper ranges and handling delayed address notification.
  • Loss of sequentiality may impact read performance. Solution is to issue larger writes.
  • How to determine the size of the range? Larger range implies more benefit. Solution is to report largest range that best matches layout policy.
  • How to handle delayed address notification? The FS knows the address after the write completion. So there are ordering constraints: data block must be written before inode. So only use range writes for data blocks, not metadata, e.g.
  • Evaluated on an ext2 FS simulator on a modified disksim. For first data block of file, give the disk some flexibility. The next blocks are, as far as possible, written sequentially.
  • Evaluated on a synthetic workload of 1000 4kb files on an empty FS. Slightly better performance for the single process case, but much better when multiple processes are writing files at once. The layout that results is similar in both cases (so the high level policy is the same, but we get better performance).
  • Also evaluated on untar, PostMark and Andrew benchmarks. 16–35% improvement on untar and PostMark. Andrew has more reads, so the performance is very slightly worse.
  • Journalling system case study. No seek delay but some rotation delay. See a sequential workload. Approach is to convert a write-ahead log to a write-ahead region. Trick is to write to next rotationally closest location. Add a software layer between ext3 journalling layer and disk, which uses a disk performance model, which spreads out the transaction across the disk and reduces rotational delay.
  • Mounted ext3 journal on Bark (software layer) as a pseudo-device, and ran TPC-B on it. Improves performance in both cached and non-cached cases. Also improves the distribution of write times: 99% of requests have write times less than 2ms (when rotation time is 6ms).
  • Want to look at RAID integration, flash SSD integration, and Range Reads.
  • Question: change the disk interface and change the file-system to improve write latency… shouldn’t I buy a cheap disk and put some Flash in front of it? Existing disks could benefit from this approach, and this could improve the situation on Flash as well.
  • Question: what is the right way to handle dependencies between data writes when you need to know the location of preceding writes? Have seen this problem already in the file-system, so higher levels should take care of this ordering.
  • Question: if you ever do overwrites, you’ll need to do a second write? Don’t address this, but it is an interesting area for future work. Were all evaluations done using empty file systems, which have fortuitous fragmentation properties? Evaluated with half-full systems in the paper.
  • Question: range writes might not be helpful on Flash in front of the disk, because rotational latency becomes less important and you should maybe even use slower-rotating disks for better power consumption.
  • Question: is it true that you’ll get more fragmentation? Haven’t really looked at this.

Programming Language Techniques

Binary Translation Using Peephole Superoptimizers

  • Binary translation is ability to run software for one ISA on hardware implementing a different ISA. Applications include portability, virtualisation, backward- and forward-compatibility, on-chip binary translation and JVMs. Could exist at different layers of the stack: in a hypervisor, in the OS kernel (e.g. Rosetta) or as a user-level application (like a JVM).
  • Difficulties are performance, complexity of ISAs, retargetability (want to target different source or destination ISAs) and OS compatibility. Concentrate on performance, complexity and retargetability.
  • Technique is peephole superoptimisation.
  • Superoptimisation is a term from 1987 to describe a code generator that does brute-force search to find the optimal code for a translation. Enumerate all sequences up to a certain length, then compare each enumerated sequence with target function for equivalence. Originally scaled up to 12 instructions in a sequence. To support all instructions in an ISA, you can typically do 3 or 4 instructions.
  • Peephole superoptimisation is using a superoptimisation to infer peephole optimisations. This is pattern matching, when you search for a pattern and have a replacement template.
  • Get the instruction sequences and scan for sequences that can potentially be optimised. Canonicalise and store them.  Then do brute force optimisation on each of the target sequences, search for optimisations in a brute-force manner and check for equivalence.
  • To test for equivalence (in: two sequences, out: are they equivalence). First do an execution test on random input to see if they result in identical outputs. If you pass this, use a satisfiability solver to check equivalence. Usually fail the first test, so this is efficient.
  • Finally generate a table of peephole optimisations.
  • Approach is to use lots of peephole transformations to translate code from one ISA to another. Need a PowerPC to x86 register map (e.g. r1 -> eax; r2 -> ecx) for each peephole optimisation. This is the main difference between a peephole optimiser and a binary translation, and needs to be chosen wisely. The best code may require changing the register map from one code point to another. The choice of the map affects the choice of instructions and vice-versa.
  • Assume a cost model for instructions (register access versus memory access). Also capture the cost of switching registers to and from memory (because more registers on one ISA).
  • Could use a greedy strategy for register map selection, which minimises switching cost for each subsequent instruction. But this doesn’t give the optimal solution, in terms of instruction costs (and overall). So use dynamic programming to attempt to reach a near-optimal solution. It accounts for translations spanning multiple instructions and simultaneously perform instruction selection and register mapping.
  • Implemented a static user-level translator from ELF 32-bit PPC/Linux binary to ELF 32-bit x86/Linux binary. Translate most (but not all) system calls. Use PPC emulator for execution tets, and zChaff SAT solver for binary test.
  • Implementation issues: endianness, so need to convert all memory writes to big-ending (source) and all memory reads to little-endian (destination). Also compiler optimisations: PPC optimiser staggers data-dependent instructions to reduce pipeline stalls, but the solution is to cluster data-dependent instructions in basic block before translation.
  • Evaluated with the same version of gcc, using soft-float library and statically-linked input executables. Did microbenchmarks and ran SPEC CINT2000. Compared against natively-compiled code and against other binary translators (QEMU and Apple’s Rosetta).
  • 750 rules in the peephole table, but cost of constructing that is amortised across every run. Used same optimisation settings on both architectures.
  • Microbenchmarks include sort algorithms, binary search, Towers of Hanoi. Minimum performance is 61% of native; best performance is 319% of native (for fibonacci, under an unrealistic optimisation). Sometimes gcc generates better code for PPC, due to the preponderance of registers, and the register mapping algorithm does a good job of remapping on the CISC architecture.
  • On SPEC CINT2000, minimum performance is 42% of native, median is 67%, and best is 167% (no optimisation on twolf, using soft-float). Some translations failed under compiler optimisations.
  • On average, 3% faster than Rosetta for -O0; 12% faster for -O2. Included cost of translation when running on QEMU and Rosetta, their system does static translation. Takes 2–6 minutes to translate a 650KB executable with ~100K instructions. Could reduce this to <10s, because for 98K instructions on which 0.01% of the time is spent, we could use any register map.
  • Future work to look at JIT compilation and machine virtualisation.
  • Question: what pattern matching algorithm do you use, and would optimising this improve performance? Use something very simple, and an improvement could be made.
  • Question: is peephole superoptimiser thread-safe? It’s a code generator, so this doesn’t really mean anything. Designed in such a way that the translation should work on multi-threaded applications, but haven’t tested this.
  • Question: what about proper handling of synchronisation between two platforms, such as moving a register to a memory location and vice versa, so do you look at this? Have not been pulling memory locations into register (because x86 has fewer registers), so have not looked at this.
  • Question: how much of the optimisation techniques would work in a dynamic translator? The reason for using a static translator was for simplicity and the separation of translation and run time. But no reason why this couldn’t work in the dynamic case.
  • Question: can you really handle equivalence testing in the case of aliasing? The execution test could suffer problems with aliasing, but the satisfiability solver should catch this. See previous ASPLOS paper.

R2: An Application-Level Kernel for Record and Replay

  • Debugging tool for recording a run of execution and replaying it for debugging. Because it is difficult to reproduce some bugs when re-executing. It’s hard to apply comprehensive analysis with no interference at runtime (e.g. triggering timeout logic).
  • In a replay, exactly the same output will appear, and the same bug will arise. R2 captures nondeterminism, and makes it happen the same during the replay.
  • Have seen virtual machine approaches, which replays the application and the OS, but this is difficult to deploy. Also a library-based approach which replays the application only, but cannot replay some system applications (like asynchronous I/O).
  • Library-based tools inject a library into the application which do syscall interposition, and writes the results to a replay log. On replay, the interposition switches to consulting the log instead. But this doesn’t record all non-deterministic instructions (e.g. spinlock assembly code) or functions with unclear semantics (like socketcall and ioctl), so you have to write application specific stubs. Can also be too heavyweight, generating large logs for I/O intensive applications.
  • R2 allows developers to select functions that can be easy-and efficiently replayed. Abstract assembly code to higher level functions, and identifying calling functions with higher level semantics (recv rather than socketcall).
  • Selecting a replayable set of functions with the goal of capturing nondeterminism. Need to make a cut through the call graph. Need to track data dependencies between functions (e.g. if you call socket(), need to use the replayed value in read()). Possible to make several incorrect cuts, but came to a set of rules that developers can follow. Rule 1: any non-determinism should be below the interposed interface. Rule 2: all instances of unrecorded reads and writes to a variable should be either all below or all above the interface. Can be difficult to follow these rules in a complex call graph. Module APIs are good candidates, as they encapsulate internal state well.
  • Next challenge is to keep a deterministic memory footprint. Need the same memory addresses in both record and replay to ensure that different control flow isn’t followed. Need to make sure malloc returns the same addresses in both executions. Tool and application in same address space, and the tool runs differently during replay, so get different malloc behaviour.
  • Split space approach: system space (libraries and R2) and replay space (app). Split the user memory address space into two memory pools: native memory pool and deterministic memory pool. malloc in each case is appropriately mapped to each of these. R2 allows code annotation to ensure deterministic memory addresses.
  • Another challenge: deterministic execution order in multi-threaded applications. Especially in conjunction with dynamic memory allocation where this is kept deterministic (replay writing into a buffer before it is allocated). Need to maintain an inter-thread happens-before annotation: callback annotation for functions and sync annotation for synchronisation primitives.
  • Three categories of annotations: data transfer, execution order and optimisation.
  • Annotated prototypes converted via a PHP code template to a C++ code snippet.
  • R2 implemented on Windows in 20 kloc. Annotated the Win32, MPI and Sqlite interfaces. Executed on challenging system applications, including a web server, database, distributed system, virtual machine and network client.
  • Evaluated annotation effort, overall performance and effectiveness of customised interface.
  • One week to annotate the first 500 Win32 prototypes. 2 days to annotate MPI and SQLite. Cost of annotation is amortised.
  • 1% overhead of the stub code, 9% overhead when recording. Log size is 300x greater than the SQLite database looked at. Solution is to choose an interface with less I/O (use SQLite query interface rather than filesystem interface). Log size is now comparable to the database.
  • Space split also helps other in-process tools, e.g. in a model checker. Annotation and code generation also used in other projects.
  • Question: the interface choosing rules seem useful, but a human would screw it up, so is there any static checking or automatic for choosing the interface cut? Assumption is that you can use a modular API to select these interface. Latest work is to use compiler techniques to select these. Also to minimse log size.
  • Question: could the tool be used in production for more complex workloads, since the overhead for the microbenchmark is helped by caching? Have seen 10–20% overhead for larger benchmarks. Also, a problem is that the log size may become very large, but they have looked at checkpointing techniques.
  • Question: could you describe the kind of bugs that R2 does not replay faithfully? One problem is that you assume that, as an optimisation, the local disk operation is deterministic for read-only behaviour, but if during replay the file is missing, then you fail. What about data race problem? Assume that programs are data race free, and these may cause some problems to fail. Can always detect these problems, but not always reproduce them.
  • Question: do you need to reannotate new versions of applications? No, just annotate APIs (like Win32) but you have to annotate some interfaces by yourself.

KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs

  • Awarded a best paper award.
  • Hard to write correct systems code, it has tricky control flow, complex dependencies and abusive use of pointer operations. Also lots of environmental dependencies, which the code has to anticipate. Also a possibility of malicious users.
  • KLEE is designed to check such complex systems code. Fully automatic symbolic execution tool, exploring a large number of paths in a program. Finds many bugs, and generates test cases for exploring tasks, which gets over 90% line coverage (for Unix utilities).
  • KLEE runs a program on unconstrained symbolic input. As the program runs, constraints on the symbolic values are generated, and symbolic execution is forked (adopting constraints) when branches are hit. Can then generate a test case for each path that may be followed.
  • KLEE starts with C code, feed it into LLVM compiler to generate LLVM bytecode. KLEE takes the symbolic environment and the LLVM bytecode to execute the code, interacting with the STP constraint solver. Outputs a collection of test cases.
  • Real programs have an exponential number of paths. Need to use search heuristics. Coverage-optimised search selects path closest to an uncovered instruction and favours paths that recently hit new code (as they are likely to hit new code in the future). Also random path search (in the paper).
  • Constraint solving is an inherently hard problem (NP-complete) and needs to be invoked at every branch. Two simple, effective optimisations (15x speedup).
  • Eliminate irrelevant constraints before sending them to the constraint solver, which works because each branch depends on a small number of variables. So eliminate the constraints which have no effect on the branch.
  • Also cache the results of the constraint solver for later branches. With a static set of branches, there are lots of similar constraint sets. If we see a subset of some constraint set, eliminating constraint does not invalidate a solution, so no need to re-solve. Adding constraints often doesn’t invalidate solutions as well.
  • Problem of interacting with the environment (e.g. reading data from the file system or the network or the end user). If every argument is concrete, forward a syscall to the OS. If they are symbolic, provide a model that can handle symbolic inputs: want to explore all possible legal interactions with the environment.
  • Easy to write these models. A symbolic file is represented by a symbolic buffer and an offset initialised to zero. Reads become memcpy()s. Can just write plain C code that is run by KLEE, which allows users to extend or replace the environment with no knowledge of KLEE internals. Currently implement files, command line arguments, pipes, etc.
  • Looked at the GNU coreutils application suite. 89 stand-alone applications (v6.10). Many functions, different authors, major environment interaction, and heavily-tested, mature code. Challenge to find bugs in this code. Most are between 3000–4000 loc. Some up to 10000 loc.
  • Methodology: run KLEE fully automatically on each application for one hour. Run the test cases on uninstrumented utility and measure line coverage using gcov. Coverage measurements are not inflated by potential bugs in KLEE.
  • Minimum coverage is ~63%, overall is 84%, average is 91% and median is 95%. 16 applications were 100% covered.
  • Coreutils also has an effective regression testing suite. Manual testing achieves an average of 68% coverage. In comparison, KLEE loses on 9 applications, and won on 80 of them.
  • Looked also at the Busybox suite for embedded devices (91% overall, average 94%). Lost on a single application and won on everything else. (Regression suite got 44% coverage with manual effort.)
  • Found some bugs in coreutils. 10 crash bugs (more than were found in the last three years, combined). KLEE also generates command lines which expose these crashes. Showed ten sample inputs which would cause segfaults.
  • In 453 applications, with 430 kloc, found 453 bugs (also looked at Minix and the HiStar kernel), and many of these have been reported. Reported 56 bugs in total.
  • KLEE is not limited to finding crash bugs, but can also find higher-level correctness bugs. Can prove asserts on a per-path basis. Constraints in KLEE are complete and bit-accurate. An assert is just a branch, so KLEE can look at the feasibility of each side of the branch. If the assertion failure path is infeasible, then the code is correct.
  • Can also use this to cross-check applications: two functions which implement the same interface, then assert(f(x) == f’(x)). If KLEE terminates without an error, the paths are equivalent. Otherwise, a mismatch is found. Were able to test coreutils and Busybox utilities with conformance with IEEE Std. 1003.1. Many paths were equivalent, but some had output mismatches.
  • Lots of important related work in many research groups. Distinguished KLEE by testing hundreds of benchmarks, and showed extensive coverage. Also the idea of cross-checking is new, as is support for a symbolic environment.
  • Question: about the long term goals, what are the scalability issues with looking at applications that are larger than the ones tested in the paper? A very large program thrown at KLEE will suffer from exponential explosion of paths, and difficult constraint solving. But there is not necessarily a correlation between the size of an application and the difficulty for KLEE of solving it. Can write a 5-line program that really blows up. The comparative advantage over manual testing may even improve.
  • Driver for something like HiStar is necessary, but can be inferred automatically (e.g. symbolic environment of file system can be taken from the contents of the actual file system).
  • Question: how do you generate a database for testing a database server? Interactions with a DB are symbolic. Previous work marked the FS as symbolic.
  • Question: where is the bottleneck as you scale up the size and complexity of the checked system? Constraint solver or number of states? Working close with constraint solving groups to better characterise constraint solving load. Some benign looking problems turn out to take a huge amount of time. KLEE uses an implementation of CoW at the object level to help with states. For coreutils, can have 100000s of concurrent states.
  • Question: did LLVM buy you anything over another symbolic execution technique? Used to do source-to-source translation. LLVM has useful tools and is a real-world compiler which can be helpful with real code.
  • Question: would running KLEE for 24 hours have beaten manual testing for the cases that a one-hour run didn’t? Some interactions with the environment are not effectively supported. Done at the syscall level, but might have been better to do at the libc level (but syscall interface is smaller and easier). Some problems with the constraint solver meant that running for 24 hours wouldn’t nearly be enough, due to exponential explosion in time. Ongoing work to improve this.
  • Question: what about proper checks vs. last-minute checks (in bulletproof code)? Can write checks in C code which gives the developer flexibility.
  • Question: what is the length of the error path for the 56 bugs? For many of them it varies, but 100s of branches must be hit in order to hit the bug. Very complex to find and reason about.
  • Question: did you try to apply tools in the related work to see if they would uncover the same bugs? No. Many of them are not available. Others are not directly comparable, as they aren’t symbolic execution tools.
  • Question: what does KLEE mean? Don’t know! Daniel came up with the name: perhaps a Dutch work. (van Renesse: apparently not a Dutch word.)

Security

Hardware Enforcement of Application Security Policies Using Tagged Memory

  • Software security is in a crisis. No good abstractions to enforce application security policies, so application software has to do it itself, leading to TCB bloat. Large software systems have millions of lines of code in the TCB, which makes it difficult to eliminate bugs or prove correctness. Linux has gone from 170000 lines to over 6 million in the latest version.
  • Virtualisation often used to minimise TCB, which is good for partitioning applications, but not totally adequate.
  • Current hardware is also inadequate. Kernel data structures require fine-grained protection (e.g. process data structures). Some struct members may be globally readable, but not writable. Or permissions should only be given to owners. One possibility is page-alignment, but this is complex and expensive. Paging also does not associate policy with physical resources (problem of aliasing).
  • Proposal is tagged memory. Associate each word of memory with a 32-bit tag, and map tags to RWX permission for protection domain. Gives fine-grained access control, and simplified security enforcement. Helps even with a compromised OS. Policies are tied to physical resources. Allows for a smaller TCB.
  • Evaluated by building a full system on an FPGA.
  • In proposed system, kernel is shrunk to a trusted security monitor (TCB for entire system), responsible for labelling all resources. One or more kernels providing abstractions but no longer fully trusted. System resembles a distributed system.
  • Prototype: LoStar, based on Loki + HiStar. HiStar places all OS abstractions on top of a few types of objects, which can be labelled and control r/w access. Security monitor translates these labels into hardware tags and controls access to shared memory using tags. Loki is the hardware layer, with a tag permission cache in hardware, populated by the monitor.
  • File has an inode, which has some label in the kernel, and a mapping to the pages which comprise it. The monitor maps tags to labels, and the tag permissions are enforced in hardware.
  • Each user’s thread has its own kernel instance, and a tag permission cache. When you try to access something with a tag that isn’t cache, a trap occurs, the kernel looks at the label, and, if appropriate, gives access by loading permissions into the cache. Threads and memory objects have tags associated with them, so can’t directly modify someone else’s runtime state. Monitor provides a yield-to primitive for thread collaboration. Monitor has no scheduler; scheduling done in the kernel, and transitioning between kernel instances is possible. Cannot spoof objects either.
  • For collaboration, could grant access to someone else, and then try to corrupt it. But when a grant is accepted, the tags are changed to belong to the grantee (and this is disallowed).
  • Kernel objects are also tagged. If trying to delete references to a kernel object, because the fields (such as the refcount) are tagged in a fine-grained manner. References can be created by non-owner, and non-owner references can be deleted, but the refcount cannot be written directly (monitor does this).
  • Garbage collector runs in “idle garbage collection” domain, outside the TCB.
  • Can have a hashtable of objects which might individually be modifiable by owners, but not the linked list pointers (so cannot delete someone else’s object).
  • Prototype based on a full-system FPGA. Runs on a modified SPARC processor with an in-order 7-stage pipeline. Loki logic overhead is 19%, which would be a much lower fraction in more complex CPUs. No overhead in clock frequency.
  • Instruction and data caches had to be extended to add cache bits. Permission cache had to be added, accessed on instruction fetches and on loads and stores. Permission checks happen at the trap stage. The pipeline isn’t deepened, which is why there is no clock frequency overhead.
  • Tags tend to exhibit spatial locality. Could use page-level tags and where necessary switch down to word-level tags. Page tag array which can either have a tag in it, or a pointer to an array of word-level tags.
  • Monitor updates tags and permission cache entries. Traditionally have user and kernel modes. Add a monitor mode (ring -1) which has access to all physical memory, and only monitor mode can read and write tags. (Kernel reads tags only; User doesn’t see tags.) This helps to remove the OS from the TCB.
  • TCB reduced from 11600LOC to 5200LOC. Currently based on a RAM disk, but need to extend tags to the hard disk if it is to work with disks (could use a small amount of Flash to store tags).
  • Performance impact is negligible.
  • Significant number of tags are word-level, so fine-grained tags are necessary. Need many tags to protect users, processes, file descriptors.
  • Loki port to a Xilinx XUP costs $300 for academics, $1500 for industry, and the full RTL and HiStar-SPARC distribution is available.
  • Question: can you contrast Loki fine-grained memory protection with Mondriaan? Can you implement LoStar on top of Mondriaan? MMP provides independent protection domains at the hardware level. But LoStar was based on extending application security policies all the way to hardware. Mondriaan relies on the correctness of the translation mechanisms, whereas Loki is orthogonal. Loki and HiStar were designed to work together so they probably play better together. Builds on MMP.
  • Question: does the garbage collector have implicit access to all kernel data structures? It would have to call into the monitor to do this, but tags are used to protect this.

Device Driver Safety Through a Reference Validation Mechanism

  • Cornell is building a new OS called the Nexus. Addressing a key problem in constructing the trusted microkernel.
  • Drivers take up the majority of OS code, but they shouldn’t be trusted. They have a high fault rate and are written by different people from the core kernel authors. They run in the kernel and have privileged access to hardware. Loads to possibilities for trojans, privilege escalation, bugs, etc.
  • Nexus will have secure drivers. Want a small and auditable TCB, which is incompatible with having drivers in the TCB. Too big, too complex, too changeable to audit.
  • Not enough to move drivers to user-space. Only prevents direct attacks by the drivers on the kernel. Devices can also compromise the kernel integrity (using DMA, using interrupts (DoS)). One approach would be to use an IOMMU to stop the device DMAing to the kernel. What if we had a reference monitor in software, which would be more flexible?
  • Nexus puts a reference validation mechanism, acting as a filter, in the kernel, which monitors references by the drivers in user-space. Want to factor out device-specific code from the reference validator. The validation mechanism is parameterised by a device-specific reference monitor, which covers anything device-specific.
  • Device Driver Reference Monitors (DDRMs) can prevent illegal reads/writes by drivers (like an IOMMU), priority elevation, processor starvation, physical damage of the device by the driver, resource exhaustion of the device.
  • I/O model. Nexus is built to run on x86 and use the PCI bus (nothing about the architecture is specific to these). Allowable operations are I/O ports, MMIO, PCI configuration registers, interrupts, shared interrupts, DMA and deep data structures.
  • Have a language for writing specifications, which expresses allowable sequences of operations. A stateful filter, because some operations are illegal sometimes and legal at others. So the reference monitor is a state machine. Each legal operation is a transition in the state machine. Specify variables, alias names and state transitions.
  • Don’t enumerate every state in the state machine, instead encapsulate them in the state variables. Some are device-specific, and some are predefined by the DDRM (e.g. the memory that the driver has access to may be a state variable, but defined at runtime).
  • A transition is a combination of an event and state. Predicate (boolean expression over operation and state variables) followed by actions (modification of state variables). Written in a safe, declarative language (easier to reason about than C).
  • An illegal transition kills the driver, treated much like a segfault. No reason for a good driver to violate the specification: if it does, it isn’t a good driver. The DDRM runs a trusted reset routine from its specification to put the device in a harmless state (otherwise could cause havoc with the system).
  • Drivers are 99% source compatible with Linux. All operations that a Linux driver expects became syscalls in the Nexus kernel. Added traps and macros to capture the actual I/O. Unmonitored DMA mapped directly into driver memory: works for contents of network packets and contents of disk blocks. (Don’t map the data structures which are security-sensitive.) Added an untrusted disable interrupts operation, which only defers its own interrupts, and pauses the execution of all that driver’s other threads. Weaker than sti, but good enough for most drivers (only want to stop reentrancy). This has the bonus of lowering interrupt latency for other drivers.
  • Written drivers for i810 audio, e1000 Ethernet card, USB controller, a USB mouse and a USB disk.
  • Measured TCB, performance and robustness against attacks.
  • Between 0.4–1.3% of lines of code changed in each driver. Added lines as glue code. Specifications are 150–500 loc, which are at lesat an order of magnitude smaller than the drivers themselves. No need for a spec for the USB mouse and disk (the USB controller spec is all that needs to be monitored).
  • Performance evaluation with gigabit throughput test, audio and network responsiveness tests, ping latency, and end-to-end CPU utilisation during video streaming. Compared a Linux kernel, Nexus kernel, Nexus unsafe, Nexus with an empty spec, and Nexus safe (with complete specs and checks).
  • Very little impact on receive throughput; some degradation for small packet sizes when sending.
  • Audio latency adds 46.5 microseconds. But the driver itself has hundreds of milliseconds to respond to these interrupts.
  • CPU latency from 0.8% on Linux to 2% on safe Nexus for streaming video at 1Mbps.
  • Also measured robustness to attacks using perturbation testing (probabilistically modifying driver operations, changing register addresses, length, values read/written). And made a suite of malicious drivers for each execution environment.
  • Perturbation testing: in no case did they crash the kernel or damage the hardware. In a few cases the driver confused itself. With a null specification, after 31 tries, it damaged the hardware, and they stopped retrying this.
  • Safe kernel avoids direct kernel r/w by driver, livelock due to interrupts, DMA kernel crash, DMA kernel read/write, DMA kernel code injection and DMA write device memory.
  • Question: performance data only looked at throughput, which is meaningless because of the CPU utilisation increase. What was the impact of this? The CPU impact was roughly a doubling, captured by the end-to-end benchmark.
  • Question: how would more demanding devices like FireWire or high-speed USB cope, where jitter might be more catastrophic? The latency didn’t have any effect. Did USB 1.1 for this paper and have an experimental USB 2.0, both of which are functionally correct despite a change in latency. But it only degrades performance, never correctness (so no dropped frames when copying video from a camera).
  • Question: who would write DDRMs? Devices manufacturers, or could be open source, or could get a third performance.
  • Question: is having the reference monitor in the kernel necessary for performance? Has to be trusted anyway, so didn’t see a benefit to putting it in user-space, but it could be.
  • Question: were state machines developed by looking at the driver or by looking at what the hardware allows? Looked at both the drivers that we were trying to port, and the specifications of the hardware.
  • Question: how common is damaging the hardware with a malicious driver? e.g. Ask a device to overclock itself, or overwrite flash and abort it halfway which could cause it never to reboot.
  • Question: could you compare the security of the Nexus driver model to the Windows user-mode device model which also runs USB devices in usermode? They are quite similar, but Windows applies it only to devices that don’t do direct I/O.
  • Question: might an IOMMU be useful to enable lazy trapping (like how shadow page tables are used in VMMs) and validation in advance? Problem is that lots of accesses are individual commands, so it is difficult to validate them in advance.
  • Question: did you consider reusing an existing specification language for drivers like “devil”(?) or the MS Singularity language? Took inspiration from the languages but didn’t reuse any of the specifications that were written for them.

Digging for Data Structures

  • Looking at polymorphic viruses. AV programs can’t detect them. You’re running a firewall, you’re still going to get infected, and you’re still going to have to wipe your machines.
  • Signature checkers are basically grep. Can obfuscate by encryption/packing, polymorphism or putting garbage in there. Most of these aren’t even widely used yet!
  • We’d like an AV technique where obfuscation would destroy the human element: we’d be able to tell that code was obfuscated. Reckon that the commonality will be in the data structures (and these won’t be obfuscated).
  • So detect programs based on their data strctures. Emphasis on the field types, not the actual content. Encrypting memory might hide the contents, but the patterns should still be evident.
  • Also need to infer data structures without debugging symbols.
  • Problem: image looks random. We know where the heap is, so we can reasonably reliably infer heap pointers by looking at four-byte intervals. There is a probabilstic mapping between block and atomic types: small integers are usually data, for example.
  • Compute given a value, what is the block type. Is it a string, pointer, zero? No, then it’s probably an integer. Can also divide the heap up into collections of blocks that appear to have the same signature.
  • Can also infer the types of pointers from the memory that is found at those locations.
  • Need some mathematics to determine the number of classes, and whether to put a given object in a given class. Use a standard unsupervised Bayesian classifier, details in the paper.
  • System called Laika, implemented in about 5000 lines of Lisp.
  • Computationally expensive problem. But only 30% of objects contain pointers, there are a large number of strings and typed pointers are necessary. Also some problems with unions (between integers and pointers), and tail accumulator arrays (where the last element is a variable size).
  • Ran programs in GDB, trapping on malloc. 7 real test programs (things like xpdf). Measured probability of correct classification. Want to optimise P(real|laika) and P(laika|real), both around 70% which can improve slightly with some malloc information.
  • Laika virus detector can be written in 50 lines of code. Assume Laika correctly clusters types into classes. Take two programs and weld them together. Measure how mixed each class is and take the weighted average. A mixture ratio close to 50% means that a lot of data structures are shared. If the mixture ratio is less than some threshold (tested on Kraken), then we have a match. Estimated accuracy is 99.4–99.9%. Always correctly classified a program as itself over 150 tests. Expected number of errors was 0.33. Does much better than ClamAV (83–85% detection).
  • How might virus authors adapt to this system? They obviously will, as generic virus detection is undecidable. Mixture ratio is a simple first cut, but both sides can probably do better. Laika synergises well with other approaches, so there’s defence in depth.
  • Simplest attack: memory encryption (but all programs use data structures, so some patterns will still be evident)… XOR all reads and writes with a key. Could shuffle field orders (only removes 50% of the information). Could mimic structures from a benign program (like Firefox, say), but other heuristics should reveal that a program doing this isn’t Firefox.
  • High-level structure requires more structure, and very simple programs don’t have it. But as these botnets become more sophisticated, these will require more structure. Also there is computational expense.
  • Best related work is all from Wisconsin.
  • Question: what are the challenges involved with getting a consistent snapshot of memory? (Some malware is careful not to do all of its evil at once, and how do you figure out when that is?) Just ran for five minutes and looked at the data structures… and did pretty well.
  • Question: class of non-malicious applications that make extensive use of obfuscation for protecting IP, so can you differentiate these? Could sign binaries.
  • Question: how does using different languages and compilers affect this? Well, it would fail on a Lisp virus. On Java, you have a lot of metadata from all the classes. Just looked at C++.
  • Question: attacker code might try to detect the memory-detection sandbox, and try to confuse you, so can you sidestep this? We rely on a working sandbox, and it did for the three viruses considered.
  • Question: what if you detect that several programs share the same data structures, such as the libc ones? The main trick is to train based on actual viruses, and compare to these.
  • Question: should you forget about the bad guys, and use this technique to understand programs that might be buggy? A PoPL paper on doing just this, which assumes knowing the structure definitions. So this would only be useful when you don’t know your data structure definitions. (Might want to do something like failure-oblivious computing so you can fake it?) Might work.
  • Question: will Laika get confused from strings that are encoded in different languages, using UTF-8? Detect UTF-8, and just called it ASCII.
  • Question: what if you serialise and encrypt an object before putting it on a heap, and only keep it in plaintext on the stack? The trick is to be careful about the kind of detection used. This behaviour would be very conspicuous, and Laika would get confused.
  • Question: do you have to run Laika against everything for five minutes, and compare against every possible signature? Run once, take a snapshot, then compare against all signatures. Polynomial time in the number of classes and the number of objects.

Dealing with Concurrency Bugs

Finding and Reproducing Heisenbugs in Concurrent Programs

  • Concurrent executions are highly non-deterministic, and rare thread interleavings can result in Heisenbugs. These are difficult to find, reproduce and debug. Just observing the bug can “fix” it. The likeliness of interleaving changes when adding printfs. In practice, developers can spend weeks on finding and fixing these.
  • CHESS is a specialised user-mode scheduler, attaches to your program and controls all scheduling non-determinism. Guarantees that every program run takes a different thread interleaving, but then lets you reproduce any particular run.
  • Motivating example of a “thread-safe” bank account class, but with a piece of incorrect synchronisation. Can run the program with “stress”: additional threads that do random stuff.
  • Demonstration and of course the bug didn’t turn up: illustrates the point about Heisenbugs.
  • CHESS is a collection of .NET or Win32 wrappers which trap all synchronisation-related calls, and puts the program under the control of CHESS. Every run takes a different interleaving, and the interleaving can be reproduced for every run. Developing a version of CHESS for the NT kernel.
  • Goal: any error found by CHESS is one that is possible in the wild (no false positives, soundness of CHESS, CHESS does not introduce any new behaviours). Don’t want people chasing CHESS bugs! Also want any error that is possible in the wild should be found by CHESS: completeness, no false negatives, but very difficult. Need to capture all sources of non-determinism, and explore that, which leads to a state-space explosion. Goal is to beat stress-testing.
  • CHESS is a scheduler. Its goal is not to find bugs (need to introduce assertions to do that). CHESS can find deadlocks and livelocks. It drives the program along an interleaving of choice.
  • CHESS introduces calls to the CHESS scheduler at potential preemption points. Inserted before every synchronisation opeation. Could insert a random sleep at schedule points. This does not introduce new behaviour, simply modelling possible preemptions, and guarantees starvation-freedom.
  • Two main improvements over random sleeping. First, capture the happens-before graph. Any delays that result in the same happens-before graph are equivalent, so only need to explore one of them. Second, understand the synchronisation semantics, so avoid exploring delays that are impossible and identify when threads can make progress. Does this by maintaining a run queue and a wait queue (mimicking the OS scheduler state).
  • Emulate execution on a uniprocessor, and enable only one thread at a time. This controls the order of data races. Reducing the parallelism is the price we pay to get determinism. What about multiprocessor-only bugs? Any concurrent execution can be emulated on a uniprocessor. So we introduce schedule points before synchronisations, volatile accesses and interlocked operations. But can also introduce these points before every memory access, to get even more accuracy: capture all sequentially-consistent executions. Even have a non-sequentially-consisten execution model, which understands the x86 memory model intimately.
  • What about input nondeterminism? Rely on the user to provide a test harness that makes the input deterministic (can use techniques like KLEE and R2 to do this).
  • Translate all the .NET/Win32 synchronisations (often complex and under-specified) into CHESS scheduler abstractions. Enables handling multiple platforms. e.g. Asynchronous read file is like forking off a child task.
  • Problem of state-space explosion O(n^nk) interleavings (k scheduling points, n threads). CHESS only inserts preemptions at a small number of places, which gives a state-space that is polynomial in k.
  • Found a Heisenbug in  MS Concurrency Coordination Runtime. CHESS found it in a minute. Found many other concurrency bugs, and other people have used the tool to find bugs themselves.
  • Current status: working to get it shipped as an Add-on to Visual Studio. But there is a command-line version available on the website. Now looking at kernel-mode CHESS, search heuristics, and bugs in IE, Windows Shell and graphics libraries.
  • Systematic exploration of scheduling nondeterminism can be more effective than stress-testing.
  • When designing APIs and abstractions, please think about the plight of the programmer who has to program on top of them. Minimise the nondeterminism, and provide good specifications!
  • Question: have you thought about other sources of nondeterminism, such as rand() and time? Time is handled, so e.g. a timeout would happen in a CHESS execution. But gettimeofday() isn’t considered. You would be better off using symbolic execution with something like KLEE. Random numbers are determinised, but the choices are not explored.
  • Question: how do you handle home-grown spinlocks? We do handle spinlocks, as part of guaranteeing starvation-freedom. The backoff in a spinlock, and hyperthread yielding, is handled by CHESS.
  • Question: how do you handle loops and recursions? It is a dynamic analysis tool, and we assume that the user has given a test program that terminates, and it will just run it.
  • Question: how large are the tests that you use, in terms of lines of code? Run on things like Windows Explorer. Usually take existing stress tests, and fork fewer threads. What about other model checkers? Haven’t looked at them, but think CHESS is the only model checker that could deal with things that large.

Gadara: Dynamic Deadlock Avoidance for Multithreaded Programs

  • Parallel programming is tough. It’s hard to parallelise algorithms and data structures. In this talk, looking at deadlocks. Might need to look at a large part of a program to understand why a deadlock occurs. Just composing two bits of deadlock-free code is not guarateend to be deadlock-free. And multicore is making parallel programming inevitable.
  • Goal is to avoid circular mutex wait deadlocks by postponing lock acquisitions.
  • Using discrete control theory to perform a source-to-source translation that inserts extra code to avoid deadlocks.
  • Architecture: start with a C program, get control flow graph, translate it into a Petri net (formal model from discrete control theory), then do control logic synthesis to get control logic, which is compiled together with the original source to get an instrumented executable. The control logic observes and controls interactions between threads.
  • Look at the example of the dining philosophers problem. Random fork-grabbing gives a simple control flow graph with four basic blocks.
  • Petri net is a bipartite graph with two kinds of node. Circle is a place and bar is a transition. Arc connect places to transitions and transitions to places. Tokens are passed through the Petri net. When a transition fires, it takes a token from one input and puts it in every output. Can model pthread_create(), lock acquisition and lock release. A token is a thread of execution or the availability of a lock. A place is a basic block or statement. A transition is a jump or lock acquisition/release.
  • Simple translation from control flow graph to Petri net, augmented with the identity of locks.
  • Showed a huge Petri net for OpenLDAP, but need automated tools.
  • Siphon (deadlock) analysis. Siphon = set of places whose input transitions are a subset of their output transitions. Once empty, a siphon is empty forever. Proven that a deadlocked Petri net has at least one empty siphon, and structural analysis makes it easy to find them. No state-space explosion.
  • Deadlock freedom means no empty siphon, so the total number of tokens in a siphon should always be >= 1. Specify this as linear inequalities. Control logic adds a new control place which guarantees that the linear inequality is always satisfied. Control places are local, independent, concurrent and easy to implement. Adding control places may add new siphons, but this can be prevented by iterating on the Petri net.
  • A control places is implemented by a token variable and a lock/condition variable pair. Control transition checks and updates the tokens in the control places, using a “gadara lock”. Observe transition updates the tokens in control places, using a “gadara replenish” operation.
  • Most lock/unlock operations are unaffected, so no overhead.
  • Gives provably deadlock free code!
  • Moves most of the computationally expensive analysis offline, and includes only modest overhead at runtime for the control places.
  • Did a case study on OpenLDAP. Showed a deadlock bug, and how it results in a siphon in the Petri net. (And how the siphon can be fixed with a control place, and how the code is modified.)
  • Gives maximal permissiveness and maximal concurrency. Don’t want to block any thread or transition unless it will lead to a deadlock state in the worst case.
  • Bug was identified and fixed manually, but newly-added code reintroduced the bug. A second fix required a complete rewrite. Gadara would have had deadlock-free code at all points.
  • Challenges: need function annotations to provide some form of data flow analysis (does calling a function result in acquiring a lock?). Type analysis helps with dynamic locks to show how many locks and which lock instances are acquired. Gadara chooses to be conservative when the model is imperfect. What about uncontrollable programs with blocking syscalls and IPC? These may give incurable deadlocks, but Gadara warns about these. Also C features such as function pointers. Don’t care about function pointers outside critical sections, only those within critical sections. For the rest, use annotations or type signatures to conservatively assume the set of functions that might be used.
  • Evaluated on OpenLDAP, PUBSUB benchmark and BIND. Tool has a 11% overhead with an add/delete workload on OpenLDAP.
  • OpenLDAP: 1795 functions, 495 after pruning irrelevant functions. 41 lock types, 25 of 28 ambiguous functions were annotated in an hour. Four siphons found, including one confirmed deadlock from the bug database, one false positive (required data flow analysis) and two previously unknown deadlock bugs. Performance overhead only when the workload exercises deadlock-prone paths.
  • Gadara eliminates all deadlocks and is provably correct. Only modest runtime overhead because it is maximally permissive and calculations are performed offline (and most locks are not deadlock-prone).
  • Many existing approaches, but all differ because only Gadara uses control theory. DCT is a principled foundation for dynamic deadlock avoidance. Works on real programs, and useful in several scenarios (rapid development, corner-cases in mature code and end-user bug fixing).
  • Question: assuming a program with recursion, and the procedure has a summary of what happens to tokens, is it possible to use this information for the synthesis? (Taken offline.)
  • Question: why not just fix the deadlocks if you know where they are and you know how to fix them? Gadara may have some false positives. Admittedly you could get better performance by doing this, but Gadara is automated.
  • Question: did you consider just adding a simple lock acquire and release before and after the siphon? In some cases this would work, but for e.g. 5 dining philosophers, the control logic is more involved.
  • Question: what would be the scalability if you weren’t limited by the hardware (2 cores, limited memory)? Similar result in the paper for the 16-core case.

Deadlock Immunity: Enabling Systems to Defend Against Deadlocks

  • A real system can be in one of two states, running or failed. When it hits a bug, it fails. Eventually it will come back, another bug will strike and take the system down. But when the first bug occurs again, the system will still fail. It doesn’t learn from past failures.
  • Give it an immune system. On seeing a bug, save an antibody to match the bug, and when you see that bug again, the antibody will neutralise it and let the system continue as if nothing had occurred. Like immunity against a pathogen.
  • Dimmunix gives immunity against deadlock bugs for Java and C++ programs.
  • Run a program, learn the executions that lead to deadlock, save their fingerprints in persistent history and use them to identify deadlock cases in future so as to avoid them.
  • Say you use CHESS and Gadara to rid your web browser of any deadlock, but then you load a plugin that hasn’t been so vetted, and it introduces a deadlock bug. Software vendor can distribute the antibody to end-users, or end-users can run Dimmunix to defend against deadlocks in some buggy code.
  • Lock inversion leads to a deadlock bug. Look at the stack trace which leads to the lock inversion, and abstract away parameters. The call flow is the signature that we will save (a deadlock pattern). Actually we want the list of EIPs from the call stack to identify particular call sites.
  • A deadlock bug leads to a single deadlock pattern, and a deadlock pattern can lead to several hangs. (Only in two cases did they see >1 patterns for a single bugs.)
  • Dimmunic intercepts calls to lock/unlock, detects the deadlocks, saves the signatures and avoids executions that match the patterns. The history of patterns is the immune system.
  • Versions for Java and POSIX threads. Java: instrument the bytecode. PThreads: shim library to add instrumentation. This means we need no assistance from programmers (no annotations) and users, and no need for source code. Works for closed source and commercial software as well.
  • Use a resource allocation graph (RAG) to detect deadlocks. Hold edge from lock to thread holding it. Allow edge from thread to lock that it is blocked on. A deadlock will be a cycle in this graph. But once you hit this cycle, it’s too late to do anything about it, but Dimmunix will save the pattern that lead to it. Add a label to the hold edge which indicates the call stack of the thread when it took the lock. These are collected to form the signature, which is saved in the history.
  • Now you’re hung, so you’ll have to take steps to restart the threads or the program.
  • Next you do avoidance. When there is an attempt to take a lock, see if there are any matches in the history. A partial match is okay, but if you have a complete match (all stacks are present), you do deadlock avoidance, by suspending the last thread that attempted to take a lock (in the signature). So Dimmunix slightly tweaks the thread schedule to avoid deadlock patterns arising.
  • Does it work? Took a number of C/C++/Java systems. Largest was MySQL (million lines of code). Went through their bug databases, looking for user-reported deadlock bugs. Managed to reproduce about 12 of them. Ran the test cases 100 times and they deadlocked 100 times. With Dimmunix, ran 100 times and only deadlocked the first time, worked every other time.
  • What about performance? Dimmunix does a lot of work, and if it is in the critical path, it would slow down the problem. Aimed to make it as asynchronous as possible, moving a lot of work into a separate thread, and used lock-free data structures. Paper has implementation details. Measured overhead on a lock-intensive microbenchmark, looked at lock operations per second. Dimmunix had very little overhead compared to baseline, though it increased at 512 and 1024 threads. Generated a history of 64 synthetic deadlocks. Dimmunix introduced thread suspends, which account for the overhead. 0.6–4.5% overhead for up to 1024 threads.
  • Need to match signatures on the critical path. Therefore hash the call stacks and index into precomputed in-memory tables. Keep a thread-local cache of data structures to reduce contention. Common case is no matches, which is efficient. By varying the number of signatures in history up to 256, the lock operations per second overhead doesn’t really change. (And you wouldn’t expect to have that many signatures anyway.)
  • Suspends can induce starvations because you’re yielding waiting for a blocked thread to make progress. (Haven’t seen this in reality, but can construct programs that do this.) So introduce yield edges to the resource allocation graph from the suspended thread to the one that caused the suspension (by holding a lock): label on edge is the label from the corresponding hold edge. This leads to a cycle in induced starvation cases, and can be detected in just the same way.
  • Dimmunix can have false positives, if you try to avoid a deadlock that would not have occurred. They don’t impact safety, but can impact performance (sometimes positively, by reducing contention). Trick is to adjust the precision of matching the call stack to generalise the pattern: how many call frames should you choose. Use a heuristic what-if analysis post-factum. If it sees no inversion then you have a false positive, and you eventually increase the precision. Dimmunix focuses on control flow and doesn’t look at input dependencies: no concrete program state in the signatures, so you get false positives but also get general, portable signatures.
  • Someone has to experience the first occurrence but can then immunise everyone. It cannot affect deadlock-free programs and cannot induce incorrect outputs in (non-real-time) programs. It must be aware of all synchronisation mechanisms (so homegrown spinlocks will not be identified).
  • Complements any approach that is aimed at reducing deadlocks (including Gadara, and it’s still useful because of bugs outside the code that you have checked like the web browser plugin case), including static analysis, model checking and transactional memory.
  • Question: could you extend Dimmunix to provide a trace for programmers to debug their code? One of my students is working on that: how to reproduce a deadlock given a signature. Trying to make it work for programs where model checking is still not applicable (millions of lines of code).
  • Question: what about when your immune system misbehaves? Say you have some functionality in your program in which all execution paths are deadlock-prone: the conservative approach won’t let you make any progress. One possibility is to log the signature every time you do an avoidance and use a pop-up blocker-like interface to tell the user that a deadlock was avoided, and you can prevent it avoiding it in future. The other possibility is to put an upper bound on the yielding.
  • Question: can this idea be generalised to other symptoms like segfaults? Another student is looking at this idea, applying failure immunity to other situations, like resource leaks, and we hope to be able to generalise it.
  • Question: seems like microbenchmark stopped scaling at a certain point? The graph flattens out and there is a limit to how many lock acquisitions you can do per second. Also the curve seems to be dropping off, but this is amplified by the logarithmic x-axis, so it’s really very slight.
  • Question: how many deadlock bugs that you find can actually be used as a vaccine (presumably with MySQL it would have to be independent of the application running on top)? But the patterns are specific to MySQL, and any application on top could use them.
  • Question: do you have any results with more frequent locking behaviour? Some workloads with 300000 locking operations on that kind of machines? As you vary the interval between lock acquisitions, the overhead is proportional to the number of instructions that you are placing in the critical path. In the scenario where you keep a lock for 1us then wait 1ms and acquire it again, we were trying to be more representative of a real system with I/O, or something like that. The overhead becomes signifcant when the interval is down at the microsecond level.

Tags:

Comments are closed.