EuroSys
The European Professional Society on Computer Systems

European chapter of the
Special Interest Group on Operating Systems (SIGOPS)
of the Association for Computing Machinery (ACM)
Home Join or renew membership EuroSys for Students EuroSys for Faculty Job Offers Activities Systems Directory Systems Events and Blog Eastern Europe Initiative Member Area Member News Officers and Volunteers Useful links Press releases

Archive for the ‘OSDI 2008’ Category

NSDI 2009: Day 1: Trust and Privacy

Thursday, April 23rd, 2009

Best paper awards for the first day:

TrInc: Small Trusted Hardware for Large Distributed Systems
Dave Levin, University of Maryland; John R. Douceur, Jacob R. Lorch, and Thomas Moscibroda, Microsoft Research

Sora: High Performance Software Radio Using General Purpose Multi-core Processors
Kun Tan and Jiansong Zhang, Microsoft Research Asia; Ji Fang, Beijing Jiaotong University; He Liu, Yusheng Ye, and Shen Wang, Tsinghua University; Yongguang Zhang, Haitao Wu, and Wei Wang, Microsoft Research Asia; Geoffrey M. Voelker, University of California, San Diego

Trust and Privacy

Session Chair: Steven Hand, University of Cambridge

TrInc: Small Trusted Hardware for Large Distributed Systems
Dave Levin, University of Maryland; John R. Douceur, Jacob R. Lorch, and Thomas Moscibroda, Microsoft Research

———————————————————

Large number of selfish and malicious behaviours in distributed systems can be described as Equivocation.

Equivocation: say something to A and say something else in conflict with that to B

Goal: A minimal required abstraction for elimination of equivocation

Using a non-decreasing counter and a key, the tool provides a new primitive: unique, once-in-a-lifetime attestation.

Trinket: the implementation as a trusted piece of hardware. The users need to attach this hardware to their system, to gain the benefits of TrInc.

Application 1: trusted log by TrInc
- the log can be changed only by append operation
- implement Attested Append-only Memory (A2M) using abstractions of TrInc
- compared to A2M, the TrInc size is smaller and semantics are simpler

Application 2: prevent under-report in Bit-torrent
- Node A can decide not to tell the other peers about block x that it has. In this way, because a few nodes posses the block, Node A can keep the other peers interested over the time and consequently reduces its own download time.
- The story starts when A receives x and give the receipt ACK to the sender but tells the others that it does not have block x (example of equivocation.)
- Solution: in the ACK of receipt, the node should put the number of blocks that it has and also the last block it has received.
- Again: reduction to monotonic timer and key

TrInc is implemented in Gemalto .NET SmartCards.

Micro benchmark results are not promising: It is very slow.

Perhaps the reason is that the current hardware is not designed to make the TrInc efficient; the future hardwares by keeping Trinc tasks in mind, could be efficient.

Sybil-Resilient Online Content Voting
Nguyen Tran, Bonan Min, Jinyang Li, and Lakshminarayanan Subramanian, New York University

———————————————————————

Websites rely on votes for ranking

Problem: vulnerable to Sybil attacks

Solution: using social networks

Assumption in their proposed system (SumUp): A central collector for votes and also the friendship graph

It uses Max-flow algorithm to collect votes over the graph

Congested edges could be:
- attack edges, because lots of fake identities are connecting to the rest of the graph via those edges.
- honest edges close to collector, because bigger sub-trees are connecting via them to the connector.
How to distinguish attack edges from honest edges?

Solution: capacity assignment to links via ticket distribution
Each node takes one ticket and divide the rest to the next following nodes.
- Nodes closer to main collector would receive more tickets and hence they can afford the congested links connected to them.
- However, nodes connected to attack edges would receive less tickets that is not enough for all fake identities connected to them.

How to obtain the optimal value for total number of tickets (i.e. v)?
Solution: Iterative approach; start with small v and double it after each iteration.

Furthermore, taking advantage of feedbacks from collector, it can do more:
- reduces the capacity on attack edge by penalizing the capacity on links towards the attack edges
- eliminates links with high penalties

It has been evaluated over YouTube, Flicker, and Synthetic graphs.

Q: What if the attacker is close to the collector, then it can be mistaken by a honest node?
A: Yes, it is true.

Bunker: A Privacy-Oriented Platform for Network Tracing
Andrew G. Miklas, University of Toronto; Stefan Saroiu and Alec Wolman, Microsoft Research; Angela Demke Brown, University of Toronto

———————————————————————–

Network tracing raises privacy concerns:
- Accidental disclosure
- Remote attacks over the Internet
- Operational attacks: the attacker has physical access to the hardware
- …

Previous Solutions: Anonymization Techniques to protect sensitive information
- Offline: has high privacy risk
- Online: has high engineering cost

Key insight: Record the data on buffers online and process the buffer offline
- The closed-box environment produces an anonymized trace as its only output.
- Make it safe-on-reboot: upon a reboot, all sensitive data gathered by the system is effectively destroyed.

To implement closed-box environment, they have to restrict I/O in that module.
Concern: How much restriction is sufficient and necessary?!

Debugging: The recoded information are not enough for debugging.
Solution: A debug mode that can be activated by rebooting the system.
Concern: Is there any guaranty that the debugger mode would not be used for normal usage and endanger the privacy of the users?

Security attacks:
A) Attacking the Restricted Interfaces of the Closed-Box VM: basically there is no algorithmic solution. They do their best, but there is no proof for safety.
B) Attack by direct access to the hardware: usually the attacker needs to reboot the system to take the hardware (like memory) off the system. Upon reset signal, all the data would be deleted.
- Concern: What about the attacks that does not reset the system?

They report less boring implementation because they used mostly python and scripts rather than C++!

Q: What about Bugs? They can break the security!
A: Yes, we just can do our best.

Q: People in security community has been working for decades to solve the problem of security, but still the systems are vulnerable against new unpredicted attacks. How can you guaranty the safety of your approach?
A: We have not tried to do any thing in that domain. We just do our best to use the available techniques to provide privacy.

OSDI 2008: Day 3

Wednesday, December 10th, 2008

Various Good Things

Difference Engine: Harnessing Memory Redundancy in Virtual Machines

  • Awarded a best paper award.
  • Trend towards consolidation in data centres. Want to consolidate large numbers of VMs on a small number of physical machines. e.g. Cloud, Enterprise, Thin-client computing. Also interested as virtualisation as a basis for testing: can test larger systems with less infrastructure.
  • Hurdles to consolidation: CPU utilisation is often highly bursty, and multi-core makes VM multiplexing more appealing. But machine memory is a hard limit on the number of VMs. Could just add more memory, which becomes expensive after some point, and has high power costs.
  • State of the art is identical page sharing (Waldspurger ‘02).
  • Opportunities for sharing beyond entire page sharing. Can introduce a second level of page fault (to memory) which is much faster than paging to disk.
  • Contributed set of memory management techniques to Xen: identical page sharing, similar page patching and in-memory compression. Explored a range of policies and mechanism. Did a detailed performance evaluation and found savings up to 90% on homogeneous workloads and 65% on heterogeneous, which is 1.6–2.5x what ESX gets.
  • Look at pesudo-physical to machine memory map. Identical pages, similar pages and cold pages. Identical pages map to the same machine frames, marking them both read-only and copy-on-write. Similar pages: one page is unmapped and metadata is stored, and the other page is mapped read-only. Write to one is trapped; read/write to the other is trapped. Finally, cold pages are unmapped, the metadata is tracked, and a smaller representation is stored in memory.
  • Implement a global clock for not-recently-used page selection policy.
  • Implement multiple hashes over page sub-regions to find similar page candidates.
  • Implement demand paging for physical memory overcommitment.
  • Global NRU clock by iterating over Xen’s P2M map, using referenced and modified bits. If recently modified, don’t choose as a candidate. If recently accessed (not modified), candidate for sharing. If no recent access, candidate for sharing or patching. If seen no recent access on several passes, then a candidate for compression as well.
  • Maintain two hashtables in Xen, one for sharing and the other for similarity. Request not-recently-modified pages from clock and insert hashes into the similary table, and a whole-page hash into the sharing table. On an identical hash, do sharing CoW.
  • For the not recently accessed pages, check the similarity table for candidate matches, calculate the best possible patch size and if there are savings then do the patching.
  • Identifying similar pages must be done without too much effort. Related work in finding similar web pages (Rabin fingerprints, min-wise hashing) but these are too resource intensive. So just do hashes over sub-page chunks and insert entries into the sharing hash table. If savings are more than 2kb, then do the patching. Two hashes over 64-byte chunks (policy question). If at least one chunk matches, calculate the patch and consider the overhead. This policy was often the best trade-off.
  • Demand paging because Xen assumes all physical memory is available. Difference Engine may require more memory than is available for a VM. Could end up overcommitted. Xen defers all I/O to Dom0, so implement a user-level swap daemon, communicating with Xen via an event channel. Swapd monitors total available memory monitor which will see a low watermark of available memory and do
  • Modifications to Xen 3.0.4. Available online. 15kloc for Difference Enginge. Additional 20kloc for patching (xdelta) and compression.
  • Evaluated with microbenchmarks (cost of operations), considered techniques in isolation to see relative benefits. Measured memory savings and performance effects for homogeneous and heterogeneous workloads. Compared to ESX with a scan of 10k pages/s.
  • 4 VMs, 512MB each, Debian 3.1 running dbench. Difference Engine achieves better savings than aggressive ESX at all points in time, though they tend to the same savings in the limit. While the benchmark runs, 1.5x more memory savings.
  • Heterogeneous workload: 3 512MB VMs running different OSs and applications. Win XP SP1 with Apache. Debian 3.1 running SysBench. Slackware 10.2 running dbench followed by IOZone. DE gets much better savings than ESX. Identical sharing gives most of the benefits, but patching also contributes a large proportion (less from compression).
  • Overhead vs. Xen was < 7%, and < 5% vs ESX aggressive.
  • Should be able to increase aggregate system performance. Ran RUBiS (eBay-style benchmark) with 4x 650MB VMs. Baseline of 4 VMs. Aggregate throughput increases most when using 6 VMs on Difference Engine (declines with 7). 40% more requests per second on 6 VMs versus the baseline. Before saturation, there is some performance overheads. Diminishing returns caused by paging to disk.
  • Why do this now? Driven by continuing increase in the gap between CPU processing and Disk I/O, especially with multi-core. Previously, the effort required was too great, and there were performance overheads. Page patching is more expensive than page compression. But overall sharing can improve performance.
  • Address Space Layout Randomisation is making it less pointful to do identical page sharing. Energy is a significant system resource: reducing memory footprint reduces both capital and operating expenditure.
  • Extending the approach to clusters, to allow VM colocation based on memory footprint.
  • Correction: the aggressive scan rate explains the delta — as actually capped at 500 pages per second per VM.
  • Question: did you look at the overhead delta between identical page sharing and compression and patching? Found lower CPU use but less savings. Believe this is a potential
  • Question: compression seems to give 5% savings at most… is it worth the complexity? Could flip the ordering and do compression before patching, which would make it yield more savings. Incorporating compression did not add significant complexity.
  • Question: keep a reserve of free pages in the hypervisor to service page faults quickly. How big is it, a percentage or fixed number? Currently a fixed number.
  • Question: in benchmarks comparing ESX and DE, did you have to do any synchronous paging or was the reserve enough? In the performance comparison carried out, paging had no significant effect. The 7% overhead included paging.
  • Question: if you do demand paging and identical page sharing alone, what do you think you would get? Not sure how to interpret memory savings? Interested in how much memory I need to get something done in a particular way? Results show that you can get a factor of 1.6–2.5x savings.
  • Question: 7% overhead seems like a lot, even though doing lazy compression and tricks like that. Have you done work to track whether the compression was worthwhile (or were you just going to throw it away anyway)? Global clock evaluated in the paper. 7% is worst case (in the applications we looked at). Know that there is more work to do in improving savings and reducing overhead.

Quanto: Tracking Energy in Networked Embedded Systems

  • Where have all the joules gone? When have they gone and why have they gone?
  • Looking at mote-class sensor networks. Wide dynamic range in power draw: 10uA to 10mA (sleep and active), with a 0.1–1% duty cycle.
  • Energy-efficient design has pervaded the research agenda: all innovations have energy as a first-class concern. But energy is measured in packets sent or time sleeping or bytes written. These are all crude approximates.
  • Want to measure energy itself. But this is challenging: collecting the results themselves, breaking the aggregate down into different components, and tracking activity on behalf of applications (and causal connections).
  • What is an activity? About connecting causal dots. An activity is a causally-connected set of operations whose distinct resource consumptions shuld be grouped together for accounting. Notion used in the Rialto OS, and also considered in Resource Containers.
  • Look at a toy example: the hello world example from TinyOS — blinking LEDs. First want to slice the energy usage by device (LEDs, CPU, etc.), and by logical activity (blinking the red LED, blinking the blue one, etc.).
  • Measurement is challenging. Wide horizontal (timescale) and vertical dynamic range. The particular mix is very application dependent. Vertical: 1uW to 50mW for different components.
  • Quanto uses iCount energy meter (IPSN paper). A microcontroller and counter attached to a voltmeter, switching regulator, a couple of capacitors. Simply count the cycles of a switching regulator.
  • Slicing challenge: breaking into parts… could add sensors to each domain, which is cumbersome and inaccurate. Not all energy sinks can be instrumented like this: variable voltages e.g.. Borrow notion of power state tracking: when a device’s power state changes, track that. Instrument device drivers, which export their power state through a very narrow interface making it simple to track changes. Estimate energy breakdowns with regression: snapshot the system-wide power states, the global energy use and the system clock for each transition. Generate a linear equation, which can be solved using weighted multivariate least squares.
  • Activity tracking challenge: there is a gap between what is measured and what matters. Previously people have looked at breakdown by subsystem, application, EIP and PID. We care about activities. Want to know why energy is being spent, and activity tracking answers this (gives us meaningful resource principles).
  • First annotate any abstraction which associates a label with an execution. Label is (origin-node, activity-id) pair. System software propagates these lables across subsystems, nodes, etc.
  • Causally-connected actions are painted with the same label. Initiate annotation with a runtime call to CPUActivity.set(LABEL_NAME).
  • Hard part. Issue of deferred computations: CPU might post a task (deferred function call), or put an object in a queue which will be dequeued by a worker. Modify task scheduler to add an activity field which is set on task posting and restored on task invocation. Queue items are tagged with an activity label as well.
  • Node-to-node communication: packets are tagged with a hidden activity field. Problem: every interrupt causes energy consumption before you know to whom to attribute it. Proxy activities are an ephemeral label, carried with a computation that hasn’t yet been identified, but which is later set when you know what label should go with it.
  • Concurrent activities on shared devices are described in the paper.
  • Look at a 48-second run of Blink. Compare oscilloscope ground-truth to the results from Quanto. R-values of 0.99 or better. Very small differences in the measured current per device.
  • Works well, but how much does it cost? Changed 38 files in the core OS, and added 13.
  • Space cost is 12 bytes per energy or activity sample, with an 800-sample RAM buffer.
  • Read system time in 19 cycles, read energy in 24 cycles and log a sample in 102 cycles (CPU running at 1MHz). The logging Blink app using 0.12% of CPU time, and 71% of CPU active time (CPU is almost always asleep, so using any CPU looks like using a lot of it). Energy cost is 0.08%, 0.41mJ. Quanto can monitor its own energy usage.
  • Can look at where have all the milliseconds gone? Noticed that CPU does 2x work for red LED as for green, and 2x for for green as for blue.
  • What’s the cost of false alarms in low-power listening? Send a preample that is longer than the listen-check period, to ensure that the receiver will hear it and wake-up. Could have noise that wakes up the receiver, wasting energy. How much, and how bad is it? Tried different channels and saw that channel 17 uses much more energy than channel 26, due to a few false positives. Supposed to be 2.2% duty cycle, but false alarms cause this to raise to 5.6%.
  • Does it work over the network? Wrote a ping-pong application between two nodes, and showed how the energy use propagates.
  • Found some bugs in TinyOS using power profiling.
  • Some limitations. Energy metering requires hardware support and calibration. Energy breakdown makes several assumptions. Activity tracking doesn’t support reentrancy.
  • A very simple bit of hardware plus a few lines of code lets you measure, slice, dice and track energy usage.
  • Future work is: deploying on new platforms, scaling from 2 to 1000 nodes, rolling out to the community. Research direction is to explore new frontiers in energy efficient sensor networks.
  • Question: some of the applications shown seemed fairly simplistic, so how would it scale with more complex subsystems, and how would it work? Saw too much data coming off the nodes. Rectified this problem and now do compression of the log data. Now looking at larger applications.
  • Question: most of Quanto did energy analysis offline, and benefits would come if you could do it online (and make decisions based upon it) so how do you think that you could adapt it to do it online? If you can do the regressions online, you can keep counters rather than all the log data, which would improve scalability. But wouldn’t that greatly increase the overhead? You would do regressions periodically, and future work will address that.
  • Question: you spend some effort in the device drivers to model the device state and find out the appropriate transitions etc. Do you want things from the hardware manufacturers to make it easier to extract this? Device drivers export power state information. If hardware had APIs to extract that information, that would be great. The fact that this is a small embedded system makes this easier; might be harder to do for something like a disk drive (and you might actually want to run Quanto inside the disk drive because of all the differnent activities going on in there).

Leveraging Legacy Code to Deploy Desktop Applications on the Web

  • Xax takes legacy code and uses it to enhance web applications.
  • Probably noticed that we are in the Web 2.0 world with AJAX doing interesting things in the browser. Gives a similar experience in the browser to what you used to need a desktop application for.
  • Microsoft is building Office in JavaScript. Google is doing something similar. Why?!
  • One reason is location and OS independence: can use it anywhere. Also have a strong notion of isolation: would be insane to run arbitrary code from a website, but web browsers let us do this (safely).
  • Write applications in a virtual execution environment. JavaScript (and Java and Flash) is type-safe which is safe but prohibits non-type-safe legacy code, typically written in C or C++ with unsafe pointer aliasing and unions, etc.
  • There are decades on knowledge in C/C++ code. Some components have multiple implementations (GUI framework, XML parser). Some are shipped with browser (JPEG decoder) and included in the browser to be used by the type-safe code. Heavy tail of valuable legacy code that isn’t ubiquitous (like a speech synthesiser, a RAW decoder for images, or GraphViz). They tried to port GraphViz to Java, but it was too much work to port and maintain it: but it’s still being maintained in a legacy application.
  • Could split application, putting graphical part on the client and run the legacy part on the server. Don’t get scaling that we would like to see and also introduces latency.
  • Could just run it in a client process (what ActiveX does). This is plainly wrong. Difficult to isolate the code effectively: a large surface inside the OS which is vulnerable to privilege escalation. Moreover, it isn’t OS independent. ActiveX is pretty deeply dependent on the Windows OS running underneath it.
  • No web-app framework provides legacy support, security and OS independence.
  • Likely you already have Firefox JS container and a Flash container… to these add a Xax container. Has four properties: an OS process with two features (MMU isolation from the browser and other applications, and no system call interface (a tiny hole that plumbs it to the Xax monitor). Use a device driver that disables the system call table for the “picoprocess” and plumbs all calls through to the Xax monitor.
  • The Xax monitor can give it more memory. All other services via the browser (so plumb the Xax monitor through to the browser). The Xax process looks like a remote web server: make an HTTP request to the Xax monitor and picoprocess.
  • A simple reduction argument about why it’s secure: no libraries are being shipped for the browser. If you patch a bug in the browser, it’s patched for Xax.
  • Added a platform abstraction layer on top of the minimal syscall interface, running inside the container. Not part of the TCB, but delivered with the monitor. Lets the application see a common interface, which gives OS independence.
  • Also get native CPU performance: no software overhead, limited only by the quality of the developers compiler and the hardware.
  • How do we get legacy code working in this environment? Will it really work for legacy code? Surely real programs make lots of syscalls! But it’s just a porting task: libraries and applications are portable, and the browser can be our kernel. C is portable!
  • Showed it working for a simple rasteriser. And then for OpenGL. And then for GhostScript. (Probably got the Adobe Reader, which is clearly part of the TCB, going by some of the privilege escalation bugs that have been seen in the past.) Putting GhostScript in Xax removes it from the TCB.
  • Got to the point where porting was fairly routine. Not much more than a few changes to the Makefiles.
  • Wrote a Python program using GraphViz to visualise IMDB. Ran the program on the desktop and it called 2700 syscalls to 29 different interfaces. To port, needed to disable static dependencies (e.g. on pango text processing library, might not be relevant; also X dependencies, so ask for an SVG file rather than an X window), restricting interfaces, rejecting unneeded syscalls (can turn off signal handling, for example, which Python uses), emulating syscalls internally (no filesystem, so ship data to Xax container in a big tar file) and letting the browser supplying remaining I/O (write Python bindings to JavaScript, reach into the DOM from Python, which gets sent to the JS in the browser).
  • Support 21 libraries and 3.7 million lines of legacy code. TCB is < 4500 loc. Xen, Java and Flash are 1–2 orders of magnitude bigger.
  • OS independence? Each demo came from Linux-based sourced code compiled with gcc to an ELF binary, but ran on Windows. Also taken Visual Studio-compiled binaries and run them on Linux.
  • Performance? Context switch times are ugly, especially in Linux, which uses ptrace to block system calls. Haven’t fixed it yet because overall performance doesn’t depend on context switches (do a large amount of work then ship bulk data to the browser). Full quantiative analysis in the paper.
  • Xax is secure and can increase security by subsuming existing trusted extensions (e.g. Acrobat Reader). It’s fast, deployable and portable.
  • Question: are there any limitations of this work? What would suck really bad? There is nothing you cannot do! Don’t want any special interface so that the security is never less than the browser. Performance limited by e.g. sending compressed PNGs of the OpenGL demo. If you have low-level blitting and sound interfaces from the browser, you could do better. Maybe put the browser renderer in Xax, or the Flash container to do video. Not sure how to use the GPU directly.
  • Question: the porting effort was stunning… presumably the chunk cut out must be huge. What portion of the lines of code are actually being touched? Certainly are pieces not being used. Don’t think it’s just 10%. Most of the machinery in OpenGL, e.g., are being used. Just chopping out the shim that talks to different interfaces. For each library, get most of the value out of it. Haven’t quantified which parts of the count count for the parts that will be used.
  • Question: what types of legacy code do you see being ported to this? What if it wants to do something a web browser doesn’t inherently support (e.g. opening a socket)? Not database servers, certainly. What about IM, Skype, etc.? Things like direct network access and disk access are orthogonal to whether the legacy code can be run. What should web browsers allow applications to do? Look at Google Gears, which adds a facility to talk to the local disk, and Xax could use this too. There are policy questions about what applications should be able to access. Would like to just port Office.
  • Question: existing frameworks like Flash have a fairly complicated security model, including whether Flash instances can intercommunicate, and also look at resource usage to prevent starvation, so there is a lot of infrastructure to limit the abuse that is possible. Have you thought about how to limit this? Not doing any resource constraints in the existing application. Most commodity OSs allow the monitor to ask the OS about this, and then throttle it.
  • Comment: for all graduate students who add features to the API to support their feature, this work shows how reducing the interface is possible and desirable. Thompson and Ritchie were right.
  • Question: could you do reasonable work to have communication between Xax silo? Will look at that, but haven’t yet.
  • Question: shouldn’t you just use a use-once VM? It’s extreme paravirtualisation. Should it just run on hardware directly? Considered that initially, as it makes porting easier, but you have a lot of heavyweight machinery, which you would want to optimise away and end up with something looking very like Xax.
  • Comment: chosen a different API, but there is an ABI as well, and could these concepts be applied by an ABI expert to turn a browser into a hypervisor.
  • Question: performance-related… how big is the tarball that you put together to run the legacy code, and how long does it take to deliver? Deliberately didn’t worry about that. Some of them were pretty big: didn’t try to optimise, so several megabytes. But we know how to cache and do differential compression, so there are plenty of techniques to make this less of a problem. Not a fundamental question about whether the technique will work.

Wide-Area Distributed Systems

FlightPath: Obedience vs. Choice in Cooperative Services

  • P2P is an increasingly popular way to deploy services, like internet radio, internet TV and file-sharing. Hard to ensure that every participant is following the protocol faithfully. Should tolerate Byzantine and Rational peers (e.g. free-riders).
  • Deal with rational participants informally by providing incentives and punishments, but there’s no formal basis for this.
  • BAR gossip is a P2P system that uses Nash equilibria to give a rigorous framework, at the sacrifice of performance.
  • Look at approximate equilibria as a trade-off between the two.
  • Epsilon (e) Nash equilibria. e = 0 ensures obedience is in the best interest, which offers few choices and hurts adaptability (to churn or transient failures). Cripple P2P to achieve game theoretic goals.
  • e >0 provides more choices and makes it possible for systems to adapt. Retain the rigour of a formal approach but also gives the opportunity for choices.
  • Compared jitter of FlightPath to BAR Gossip: much less for FlightPath.
  • Strict Nash approach makes it hard to reason about the protocol and encourages micromanagement. Increasing e improves this.
  • Application is P2P live streaming.
  • Basic trading protocol. Source sends stream updates to clients or peers. Peers run gossip protocol to disseminate updates between themselves. Updates are useful for finite time, the membership is static and everyone knows full membership lists. Everyone has public/private key pairs.
  • Measure jitter and average and peak upload bandwidth. Jitter is % of rounds when a peer failed to deliver everything it was supposed.
  • Assume rational peers derive benefit from delivering a jitter-free stream.
  • In each round, select a partner, exchange histories, and trade equal numbers of updates. Peer puts updates in a locked briefcase and sends it to the partner, along with a “promise” that the data is legitimate (from BAR gossip). Then exchange keys. This does not achieve no jitter 20% of peers get at least 4% jitter. For 200kbps data stream, use 100kbps extra bandwidth. Peak bandwidth is much larger than 512kbps (regular broadband).
  • Choices: avoided overloaded peers, avoiding departed peers, trading more when in trouble, erasure coding updates, preferring older updates to newer ones, etc. Made possible by shifting to e-Nash.
  • Need to temper choice with obedience. Allowing to choose partners permits more attacks on the system and allows rational abuse. Look at how we get controlled choice in selecting partners.
  • A membership list is divided into log N bins. A has a seed for round r, which is fed into a random number generator. Can further constrain partners by putting a constraint on the hash of the partnership pairs. Limits the number of Byzantine peers you could interact with.
  • This gives very low jitter, reduces overhead in average bandwidth and caps peak bandwidth.
  • Extending to dynamic membership, need to propagate membership changes. Divide rounds into epochs, tracker sends list of members to source, and the source turns the list into updates, sent out to trackers in future epochs. To join, talk to a tracker, and join the list for a future epoch, but causes some latency before you can begin trading. Let peers immediately begin trading by organising peers into tubs ordered by when they join. Trade with peers in the same tub and earlier tubs with some probability.
  • Evaluated on 400 emulab clients. Rounds of 2s, epochs of 40 rounds, 200kbps stream and 10 round deadline to update stream. Much less jitter than BAR Gossip. Impact on average bandwidth when nodes join is manageable and the tubs mean that this impact is fairly invariant (and doesn’t grow) as more nodes grow. Some jitter caused when nodes leave: 75% need to leave before the remaining ones are jitter… caused by the choice in partner selection. As churn per minute is increased, jitter gets worse, but nodes that have been in the system for longer are less affected than new arrivals. Jitter with attack only becomes bad (<4% nodes affected) with 16% malicious nodes. Several attacks explored, but this one hurts the most. FlightPath handles 10% of the system acting maliciously.
  • Based on the assumption that rational peers follow the protocol. Can have a utility function, based on jitter events per minute and the upload bandwidth used. Epsilon is defined as max utility gain from cheating over the utility from obeying. The maximum upload savings possible over the (expected benefit:cost ratio minus 1). Cannot provide small epsilon where the benefit to cost ratio is less than 3.36. For bigger benefit:cost ratio, can provide very small epsilon. Want e close to 0. Can think of this in terms of how much videos cost, and the cost of bandwidth.
  • The analysis provides rigour. Gives P2P live streaming with good stability, and tolerance to Byzantine and rational nodes, and churn.
  • Can join a live stream online.
  • Question: did you look at collusion between Byzantine attackers? For collusion between attackers to bring down the system, looked at all attackers acting at the same time, that was shown in the experiment. As for collusion to extract maximal data, the most effective ways to collude are a “beachhead” in which one peer is part of the system, and everyone else is outside, receiving forwarded data from the participant, but not giving anything back. Also a problem of reclusive nodes, but would need at least 30 nodes acting together to be a problem.
  • Question: challenge the assumption that we want equal contribution from all peers. Studies showed that people had great diversity in the resources available to them, and we should take advantage of the ones with an abundance of bandwidth. How well would you do without the tit-for-tat assumption? Currently working on dealing with network heterogenity, and it would be nice to do that, but there is a danger from the game theoretical perspective that adundant nodes would incur more cost, and it’s hard to incentivise this. Game theory is difficult.
  • Question: have you given thought to looking more at general content distribution (where utility stops being a step function based on a deadline and jitter)? It would be interesting to see how you could do bulk file distribution on a rigorous footing, but lots of this work wouldn’t apply, because we have focussed on timely delivery.

Mencius: Building Efficient Replicated State Machine for WANs

  • Replicated state machines are a widely-applicable approach. Got some servers and several clients. All servers will agree on the same result, using repeated consensus algorithms. Existing algorithms such as Paxos and Fast Paxos, etc.
  • Why wide-area? Want reliability and availability. Want to cope with the correlated failure of a colo due to fire, earthquake, etc. Also for consistency: would enable the coordination of wide-area applications.
  • Are existing protocols efficient enough in the wide-area? If not, how can we improve it?
  • Wide-area is more complex than simply not local area. Depends on distribution of servers and the distribution of clients. Fast Paxos is slower in terms of latency than Paxos with wide clients and local servers. Interested in the wide by wide case.
  • Assume up to f of n servers can fail by crashing. System load can be variable from each site, can be network-bound (large requests) or CPU-bound (small requests). Want low latency under low client load, and high throughput under high client load.
  • Paxos is a leader-based algorithm: one distinguished server. Leader gets a request, proposes it to the replicas, which accept it, then send a result back and the replicas learn the request. Low latency at the leader (2 steps), simple and with low message complexity. But high latency at non-leader replicas. And load balancing problem.
  • Fast Paxos in steady state: acceptances are broadcast among replicas, which saves a round. Load balanced and low latency at all servers under no contention. Under contention, risk of collisions, resulting in additional latency under contention. Also high message complexity, and requires more replicas.
  • Want a new algorithm that combines the pros of both. This is Mencius.
  • Mencius derived from Paxos: rotating leader, variant of consensus, and optimisations. Ensures safety (derived from existing protocol) but has a flexible design (allowing others to derive their own protocol).
  • Rotating the leader. Each consensus instance is assigned to a coordinator server. The coordinator is the default instance leader. Could use round-robin to assign this. Assignment scheme needs to be well-known. Server proposes client request immediately to next available coordinated instance (don’t wait for other servers). Server only proposes requests to clients it coordinates.
  • Benefits: all servers can propose requests directly. Leader is no longer the bottleneck (load balancing). Balanced network communication pattern.
  • Challenges: ensuring liveness while maintaining efficiency. Servers may observe different client loads, but requests can’t be committed until seeing all prior requests. Servers can skip their turns (propose a no-op) which leaves no gaps in the history. Challenge now to cut down the cost of skips: use simple consensus (restricted form), in which a coordinator can propose either a client request or a no-op, and non-coordinator can only propose no-op. So no-op can be learned in one message delay if the coordinator skips, and it is easy to piggyback no-ops to improve efficiency (essentially no extra cost).
  • If failure, faulty servers cannot do skips, so the solution is revocation in which non-faulty servers propose no-ops on behalf of faulty servers. Can optimised by revoking in advance to reduce delay, and revoking one large block at a time to amortise cost.
  • Mencius achieves all the good properties of both Paxos and Fast Paxos.
  • May have delayed commit which adds up to one round-trip extra delay (if two people propose at onces). Can have out-of-order commit if requests are commutable. This is a benefit of using simple consensus, and reduces latency.
  • Experimental evaluation: Mencius versus Paxos. Implemented in C and C++. Clique topology for three sites and a simple replicated register service. k registers: Mencius-k with rho bytes of dummy payload (rho = 4000 was network-bound and rho = 0 was CPU-bound). Wanted commutable requests (updates to different registers). Out of order commit option, and clients send requests at a fixed rate.
  • Maximum throughput for Paxos is 400 operations and latency goes to infinity. Mencius can achieve better latency, and doesn’t max out so quickly.
  • Mencius has a 200% improvement in throughput when network bound, and up to 50% improvement in the CPU-bound case. Enabling out-of-order commit hurts throughput.
  • Throughput measured under failure. Have time between when a failure is suspected and it is reported. Small decrease in throughput when a failure of a follower is initially suspected with Paxos. Big drop in throughput when Paxos leader crashes. Stall while leader fails, and also caused by duplicates. Mencius has a temporary stall when any server fails, then recovers quickly when the failure is reported. There is a peak after reporting caused by delayed commits. Then throughput decreases by the proportion to be expected by the decrease in resources, but still better than Paxos.
  • Look at fault scalablity by increasing the number of sites. Looked at 3, 5 and 7 sites. Mencius gets better throughput as the number of sites increases, whereas Paxos drops off.
  • More results in the paper (showed lower latency at all servers under low contention, and adaptability to available bandwidth).
  • Question: in the diagram of the recovery from failure, why was the throughput lower after failure? Mencius can use all links to improve throughput. 3 servers before the failure and 2 afterwards.
  • Question: what happens when a client contacts multiple coordinators (with rotating coordinators)? Best thing to do would be to contact a single server (unless it fails). So you may have to send requests to multiple failures, so replicated state machine would have to handle idempotency of requests.

OSDI 2008: Day 2

Wednesday, December 10th, 2008

File Systems

SQCK: A Declarative File System Checker

  • File systems are becoming larger and must be reliable, but they still get corrupted (hardware errors, firmware bugs, FS bugs, etc.), and require immediate repair to get out of an inconsistent state (which can end up causing even more damage).
  • Who should repair this? Is journalling the solution to every problem? Write-ahead log only prevents write inconsistencies due to system crashes, but this doesn’t cover all problems. Should the FS repair itself online? There isn’t enough machinery in the FS to do this: all it can do is, e.g. mount itself read-only. So we need a utility, and we have one: fsck. A trusted utility which is the last line of defense for bringing back consistency. XFS says it never needs fsck, but it uses it in the end, due to these other problems. All file systems have their own variety of fsck.
  • Fsck is complex: it has a big task, must read all FS metadata and turn any corrupt image to a consistent image. e.g. is a data-block being shared by two inodes? Complexity arises from them being implemented in C, which is hard to reason about. Ext2 fsck does 150 checks in 16kloc; XFS does 340 checks in 22klock. Looks like hundreds of if-check statements. But fsck is untouchable: you can’t modify it because it is (i) crucial and (ii) changing it might introduce new problems.
  • Are current checkers really reliable? If not, how should we build robust checkers?
  • Analysed e2fsck, which does inconsistent repair (sometimes), making the file system unreadable in some cases. So fsck makes an incorrect repair, making the FS more inconsistent. Sometimes it is consistent but not “correct”, deleting valid directory entries and loses a huge number of files.
  • Lesson: complexity is the enemy of reliability. A big task with a bad design leads to complexity and unreliability. Want a high-level approach to have simplicty. So use a SQL-based declarative language to write checks, which means fewer lines of code. Evaluated the simplicity (150 queries vs. 16kloc) and reliability. Also showed flexibility and reasonable performance.
  • e2fsck must cross-check all ext2 metadata: an indirect pointer must not point to the superblock; a subdir should only be accessible from one directory. Injected a single corruption at a time and tried to analyse ho e2fsck repairs this single corruption. Only corrupted on-disk pointers (make indirect pointer point to superblock; make directory entry point to another directory). Usually a corrupt pointer is simply cleared to zero.
  • Inode has an indirect pointer that should point to an indirect block. But if it points to a superblock. First e2fsck finds that the point is invalid, and zeroes. But if it does point to an indirect block, it has to check the content of the indirect block (checking if any entry points outside the FS range), and garbage pointers are zeroed. At least, this is the ideal.
  • So in e2fsck, if the inode indirect pointer points to the superblock, it will assume that the superblock is a corrupt indirect block and zero some of the entries in the superblock. Then it will check the indirect pointer, find that it is pointing to the superblock, and zero it. The incorrect ordering fatally corrupts the FS.
  • Can also have consistent but incorrect repair. We have information about the .. pointer of directories, so we can identify when two directories point to the same subdirectory, and can resolve this. But e2fsck actually chooses the directory with a lower inode number, and then rewrites the .. pointer. e2fsck doesn’t use all available information.
  • Also policy-inconsistent and insecure problems (in paper).
  • These are not simple implementation bugs.
  • Fsck must do hundreds of checks, and some complex cross-checks, between single and multiple instances of the same and different structures. But all of these must be ordered correctly.
  • In SQCK, use a declarative query language (i.e. SQL) because it makes high-level intent clear. And it is fit for cross-checking massive amounts of information. Simplified e2fsck into 150 queries. Each check/query is easy to understand, and it is flexible, enabling the plugging in and removal of differnt queries.
  • Using SQCK, take a FS image, load all of its metadata into temporary database tables, created fresh when SQCK starts running. e.g. inodeTable, dirEntryTable. Checks and repairs run on the database tables. Finally flush any modifications and delete tables.
  • Example of cross-checking a single instance of a structure: find a block bitmap that is not located within its block group. Basically a range check on whether the block bitmap is within the first and last block of the block group. Simple SELECT … NOT BETWEEN … query.
  • Example of finding false parents of subdirectories. The e2fsck code is wrong here. Join on parents and children where there are two parents with different inodes.
  • Declarative repairs must also be performed. Some are just update queries, where a table is updated with a new value in some field (and a dirty bit is set). But a repair might also be a series of queries, which can be combined with C code, though the actual work is done in the SQL.
  • 150 queries, 1100 lines of SQL statements. The new checker passes hundreds of corruption scenarios. Easy to add new checks and repairs, to get different versions of e2fsck. Introduce some performance optimisations: total runtime is slightly longer (up to 50% more than e2fsck), based on MySQL prototype. Future optimisations include hierarchical checks and concurrent queries.
  • Question: in paper, the MySQL is held in a 0.5GB ramdisk; given an FS of some size, how big need the ramdisk? Half-full terabyte-sized disk needs a 0.5GB metadata DB.
  • Question: what about the complexity of the scanner-loader which has to understand the entire FS? e2fsck also includes 14kloc which do this scanning, not counted in the comparisons.
  • Question: what about ordering the SQL statements to ensure that correct repairs? Haven’t looked at automating the ordering, but it is easier to reimplement different orders.
  • The times include only the times for loading the DB, scanning and performing the checks, assuming correct FS.
  • Question: could flushing introduce errors? e2fsck doesn’t use journalling to introduce repairs, so a crash of e2fsck would cause a (different) inconsistent state. But could make the DB flusher use a journalling approach quite easily.
  • Question: wouldn’t it be better to declare what the FS should look like, rather than how to check it? Started that way, making our own declarative language, but thought it would be less work to use SQL, which has been honed by the DB community.

Transactional Flash

  • Want to replace disks with solid-state devices, but they are quite different. However today they are used as traditional disk replacements, but this misses an opportunity to export new abstractions that better use the properties of SSDs.
  • TxFlash provides transactional guarantees to higher software layers. WriteAtomic abstraction updates multiple pages atomically. This can simplify the software layer of transactional systems. Today use CoW or logging to provide these guarantees, but TxFlash can move this to hardware.
  • Could use a traditional log-based protocol, but this paper introduces a new protocol: cyclic commit.
  • Key results: implemented TxFlash on a simulator and an emulator running on a real SSD. Want to understand how cyclic commit performs compared to traditional commit. Improves by 95% for small transactions. Also want to know the overhead of transactional Flash. Overhead is <1%: essentially free. And what is the end-to-end performance improvement. The running time is reduced by up to 65% for I/O intensive workloads.
  • Architecture of an SSD: a controller connected to a set of Flash packages. No moving parts, giving fast random accesses. The memory is non-overwrite in nature: must first erase before writing, and cost of erasure is high, so better to use a log-structured file system. Each page has a 4KB data part and 128-byte metadata (checksum, ID, version number, etc.). Traditionally export read and write. TxFlash exports WriteAtomic and Abort.
  • Programming model consists of only writes, and no reads. Straightforward to extend to a more general transactional construct.
  • Suitable for transactional support: non-overwrite pages mean minimal extra overhead in tx-support. Also, CoW and logging lead to fragmentation but Flash has fast random access. Plus, since SSDs are quite new, there is an opportunity for new abstractions.
  • Commit protocol is used to assure atomicity of multiple updates. Must work correctly in the presence of transaction aborts and multiple failures. Need a recovery module (run when system starts to regain consistency) and garbage collector (to remove old versions of pages).
  • Traditional log-based protocol will first write data that must be committed, and then write a log entry to say that the data has been written. Recovery is simple (scan the log to replay transactions if need be) and the garbage collector is simple (just copy valid page versions to fixed locations). Problem is the space overhead with the per-commit block for each transaction. Also a performance overhead of write ordering, which reduces parallelism. Overheads become large for small transactions.
  • T1 updates pages A, B, C. The TxFlash device initialises the pages in memory and stores a point to the next logical page of the transaction (from A to B, B to C and C to A). All pages are written in parallel which means we have a cycle and therefore have a successful commit. If we have a broken cycle, this doesn’t quite imply that the transaction aborted. Could have a point from A to B and B to C, but no version of C (and pointer to C). But what if A, B, C committed, then C, E, F was committed? The previous version of C is obsolete, but this means that the A, B, original-C cycle is broken, which looks like the abort case.
  • Simple Cyclic Commit: if a page version exists, any previous version of the same page present or referenced in the storage must be committed. Aborted pages and their references are erased before writing a newer version. Resolves the broken cycle ambiguity: the presence of C’ means that C has been committed, and if the original version of C is committed, then T1 is committed. In the abort case, the edge from B to C must be deleted.
  • Recovery and garbage collection is easy to understand and implement, and does not add extra code complexity. The recovery module actually performs better.
  • Weakness: need to erase all aborted versions before creating a new version for a page, which is costly under a high rate of transaction aborts.
  • Back-pointer cyclic commit improves this (details in the paper).
  • Implemented on a simulator and an SSD-based emulator. TxFlash simulator extended from previous work on SSD. The emulator is a pseudo-device driver which exports the WriteAtomic() API. TxExt3 is a modified ext3 file system that uses WriteAtomic.
  • Transactions per second improves with cyclic commit (simple and back pointer versions perform similarly): improvement better when small transactions, but approximately equal as tx-size tends to infinity.
  • No appreciable overhead for IOzone, Linux-build, Maildir benchmarks; and very small overhead for TxFlash on TPC-B due to metadata. Using traditional commit, Maildir and TPC-B show larger overhead. (Compared to a regular SSD, non-transactional.)
  • End-to-end impact: compared to regular ext3 on an SSD, TxExt3 on TxFlash improves run time by up to 65%, but for Linux-build the impact is negligible (compute bound).
  • TxExt3’s JBD is less than 50% the size of ext3’s JBD (journalling module). Most reduction from removing journalling and revoke modules.
  • Advances in hardware devices can lead to new abstractions and advances in the software layer. SSDs are suitable for transactional constructs with low overhead and code complexity. TxFlash simplifies software and improves performance of end applications. Cyclic commit is an alternative to traditional commit for systems with small transactions.
  • Question: concerned about the deployment incentives that would enable this to be deployed in the real world (an OS would require the code to support transactions on legacy devices anyway, so the simplicity argument doesn’t hold)? Some specialised OSs (like those for the Xbox on game controllers, or phones) don’t want to pay cost of implementing all this in software, so TxFlash might apply here.
  • Question: have you given thought to what mechanisms might be used to identify transactions over existing interfaces? Or could you export a fuller view of Flash (including metadata) up to a host? Insufficient to export the metadata to a higher-level software layer, so either you export all of the control, or do it at the device level and save the complexity from the software.
  • Question: not sufficient just to make sure there are no versions of the aborted page, but there have to be no pointers. How do you find all the pointers? Maintain metadata structures in the system which let you know the last aborted transaction for a particular page. This is sufficient because it is the only thing you have to erase when committing a new version.
  • Question: the microbenchmarks suggest that the best improvement is 2x, but the macrobenchmarks suggest a 3x improvement. How come? Benefits arise from not doing additional checkpointing (moving data from the log to the actual location). Also from reducing the write ordering problem. Microbenchmarks only show the benefit from eliminating write ordering, or separately for removing the checkpointing. The end-to-end benchmarks showed both.
  • Question: the SSD has software to process the FTL. Would you want to move things to the FTL to improve energy costs or write performance? If you remove code from the FTL, the job of the FTL (like remapping, cleaning, load levelling) has to be moved to the software layer. Doing this might not be a great idea, because not all Flash devices would want this complexity moved to software (especially for resource-constrained devices). Simply return the wrong parents and make the appropriate repairs.

Avoiding File System Micromanagement with Range Writes

  • Today’s file system views a disk as a linear array of blocks. This gives a simple-to-use interface but impacts performance because it hides some details.
  • High-level goal is to place a goal near other related blocks, but it can only make a low-level command to write a block to a specific address. Classic example of micromanagement.
  • Cost of this: unnecessary positioning time, because the FS knows what is free, but the disk knows what is close to the current position.
  • Previous solutions: block remapping, whereby disks select the closest block on the fly and maps it to the requested block, requiring metadata. An object-based interface deals with writing objects to disk but also requires lots of metadata.
  • Want FS to specify a set of options for blocks to write to, and lets the disk select the best from among these. The result is an efficient write with minimal change.
  • How do we specify range writes? Normally would specify an address, the data and its length. Simple way would be a pair: start address and end address. More sophisticated would be to specify a list of pairs, or a large range with an exclusion bitmap. Need to get back from the disk a status (successful?) and the address to which it was written.
  • Case of overlapping ranges for two range writes: the disk must note that it has selected a block from the overlap and ensure that that block is masked out from subsequent writes. Need a write barrier to allow the disk to forget which blocks were written and flush its metadata structures.
  • Disk scheduler should consider all options within each range write, but select only one. Traditional scheduling has a queue which is scanned and blocks are selected to some policy. Expand-and-cancel scheduling changes a range write into several possible writes in a regular disk scheduling queue (then cancels all other options when one is selected), which is easily implemented in standard disk scheduler. Added 1000 loc to Disksim, for a particular HP disk (relatively old, but still applicable).
  • Do multiple requests solve the positioning time problem? Use an SPTF scheduler with workload based on random writes and writes spanning across block group. As the queue size grows, total time gets less. The seek time reduces but not as much as the rotation time. The problem remains with multple requests, and is greater at smaller queue size.
  • Evaluated scheduler with range writes, with random writes, range spanning a track and writes spanning across block group. Now the rotation time is reduced for all queue sizes.
  • What is the benefit of blockgroup-sized range writes? The seek and rotation times are significantly reduced for all queue sizes.
  • How to incorporate range writes into a filesystem? Need to preserve sequentiality, determine proper ranges and handling delayed address notification.
  • Loss of sequentiality may impact read performance. Solution is to issue larger writes.
  • How to determine the size of the range? Larger range implies more benefit. Solution is to report largest range that best matches layout policy.
  • How to handle delayed address notification? The FS knows the address after the write completion. So there are ordering constraints: data block must be written before inode. So only use range writes for data blocks, not metadata, e.g.
  • Evaluated on an ext2 FS simulator on a modified disksim. For first data block of file, give the disk some flexibility. The next blocks are, as far as possible, written sequentially.
  • Evaluated on a synthetic workload of 1000 4kb files on an empty FS. Slightly better performance for the single process case, but much better when multiple processes are writing files at once. The layout that results is similar in both cases (so the high level policy is the same, but we get better performance).
  • Also evaluated on untar, PostMark and Andrew benchmarks. 16–35% improvement on untar and PostMark. Andrew has more reads, so the performance is very slightly worse.
  • Journalling system case study. No seek delay but some rotation delay. See a sequential workload. Approach is to convert a write-ahead log to a write-ahead region. Trick is to write to next rotationally closest location. Add a software layer between ext3 journalling layer and disk, which uses a disk performance model, which spreads out the transaction across the disk and reduces rotational delay.
  • Mounted ext3 journal on Bark (software layer) as a pseudo-device, and ran TPC-B on it. Improves performance in both cached and non-cached cases. Also improves the distribution of write times: 99% of requests have write times less than 2ms (when rotation time is 6ms).
  • Want to look at RAID integration, flash SSD integration, and Range Reads.
  • Question: change the disk interface and change the file-system to improve write latency… shouldn’t I buy a cheap disk and put some Flash in front of it? Existing disks could benefit from this approach, and this could improve the situation on Flash as well.
  • Question: what is the right way to handle dependencies between data writes when you need to know the location of preceding writes? Have seen this problem already in the file-system, so higher levels should take care of this ordering.
  • Question: if you ever do overwrites, you’ll need to do a second write? Don’t address this, but it is an interesting area for future work. Were all evaluations done using empty file systems, which have fortuitous fragmentation properties? Evaluated with half-full systems in the paper.
  • Question: range writes might not be helpful on Flash in front of the disk, because rotational latency becomes less important and you should maybe even use slower-rotating disks for better power consumption.
  • Question: is it true that you’ll get more fragmentation? Haven’t really looked at this.

Programming Language Techniques

Binary Translation Using Peephole Superoptimizers

  • Binary translation is ability to run software for one ISA on hardware implementing a different ISA. Applications include portability, virtualisation, backward- and forward-compatibility, on-chip binary translation and JVMs. Could exist at different layers of the stack: in a hypervisor, in the OS kernel (e.g. Rosetta) or as a user-level application (like a JVM).
  • Difficulties are performance, complexity of ISAs, retargetability (want to target different source or destination ISAs) and OS compatibility. Concentrate on performance, complexity and retargetability.
  • Technique is peephole superoptimisation.
  • Superoptimisation is a term from 1987 to describe a code generator that does brute-force search to find the optimal code for a translation. Enumerate all sequences up to a certain length, then compare each enumerated sequence with target function for equivalence. Originally scaled up to 12 instructions in a sequence. To support all instructions in an ISA, you can typically do 3 or 4 instructions.
  • Peephole superoptimisation is using a superoptimisation to infer peephole optimisations. This is pattern matching, when you search for a pattern and have a replacement template.
  • Get the instruction sequences and scan for sequences that can potentially be optimised. Canonicalise and store them.  Then do brute force optimisation on each of the target sequences, search for optimisations in a brute-force manner and check for equivalence.
  • To test for equivalence (in: two sequences, out: are they equivalence). First do an execution test on random input to see if they result in identical outputs. If you pass this, use a satisfiability solver to check equivalence. Usually fail the first test, so this is efficient.
  • Finally generate a table of peephole optimisations.
  • Approach is to use lots of peephole transformations to translate code from one ISA to another. Need a PowerPC to x86 register map (e.g. r1 -> eax; r2 -> ecx) for each peephole optimisation. This is the main difference between a peephole optimiser and a binary translation, and needs to be chosen wisely. The best code may require changing the register map from one code point to another. The choice of the map affects the choice of instructions and vice-versa.
  • Assume a cost model for instructions (register access versus memory access). Also capture the cost of switching registers to and from memory (because more registers on one ISA).
  • Could use a greedy strategy for register map selection, which minimises switching cost for each subsequent instruction. But this doesn’t give the optimal solution, in terms of instruction costs (and overall). So use dynamic programming to attempt to reach a near-optimal solution. It accounts for translations spanning multiple instructions and simultaneously perform instruction selection and register mapping.
  • Implemented a static user-level translator from ELF 32-bit PPC/Linux binary to ELF 32-bit x86/Linux binary. Translate most (but not all) system calls. Use PPC emulator for execution tets, and zChaff SAT solver for binary test.
  • Implementation issues: endianness, so need to convert all memory writes to big-ending (source) and all memory reads to little-endian (destination). Also compiler optimisations: PPC optimiser staggers data-dependent instructions to reduce pipeline stalls, but the solution is to cluster data-dependent instructions in basic block before translation.
  • Evaluated with the same version of gcc, using soft-float library and statically-linked input executables. Did microbenchmarks and ran SPEC CINT2000. Compared against natively-compiled code and against other binary translators (QEMU and Apple’s Rosetta).
  • 750 rules in the peephole table, but cost of constructing that is amortised across every run. Used same optimisation settings on both architectures.
  • Microbenchmarks include sort algorithms, binary search, Towers of Hanoi. Minimum performance is 61% of native; best performance is 319% of native (for fibonacci, under an unrealistic optimisation). Sometimes gcc generates better code for PPC, due to the preponderance of registers, and the register mapping algorithm does a good job of remapping on the CISC architecture.
  • On SPEC CINT2000, minimum performance is 42% of native, median is 67%, and best is 167% (no optimisation on twolf, using soft-float). Some translations failed under compiler optimisations.
  • On average, 3% faster than Rosetta for -O0; 12% faster for -O2. Included cost of translation when running on QEMU and Rosetta, their system does static translation. Takes 2–6 minutes to translate a 650KB executable with ~100K instructions. Could reduce this to <10s, because for 98K instructions on which 0.01% of the time is spent, we could use any register map.
  • Future work to look at JIT compilation and machine virtualisation.
  • Question: what pattern matching algorithm do you use, and would optimising this improve performance? Use something very simple, and an improvement could be made.
  • Question: is peephole superoptimiser thread-safe? It’s a code generator, so this doesn’t really mean anything. Designed in such a way that the translation should work on multi-threaded applications, but haven’t tested this.
  • Question: what about proper handling of synchronisation between two platforms, such as moving a register to a memory location and vice versa, so do you look at this? Have not been pulling memory locations into register (because x86 has fewer registers), so have not looked at this.
  • Question: how much of the optimisation techniques would work in a dynamic translator? The reason for using a static translator was for simplicity and the separation of translation and run time. But no reason why this couldn’t work in the dynamic case.
  • Question: can you really handle equivalence testing in the case of aliasing? The execution test could suffer problems with aliasing, but the satisfiability solver should catch this. See previous ASPLOS paper.

R2: An Application-Level Kernel for Record and Replay

  • Debugging tool for recording a run of execution and replaying it for debugging. Because it is difficult to reproduce some bugs when re-executing. It’s hard to apply comprehensive analysis with no interference at runtime (e.g. triggering timeout logic).
  • In a replay, exactly the same output will appear, and the same bug will arise. R2 captures nondeterminism, and makes it happen the same during the replay.
  • Have seen virtual machine approaches, which replays the application and the OS, but this is difficult to deploy. Also a library-based approach which replays the application only, but cannot replay some system applications (like asynchronous I/O).
  • Library-based tools inject a library into the application which do syscall interposition, and writes the results to a replay log. On replay, the interposition switches to consulting the log instead. But this doesn’t record all non-deterministic instructions (e.g. spinlock assembly code) or functions with unclear semantics (like socketcall and ioctl), so you have to write application specific stubs. Can also be too heavyweight, generating large logs for I/O intensive applications.
  • R2 allows developers to select functions that can be easy-and efficiently replayed. Abstract assembly code to higher level functions, and identifying calling functions with higher level semantics (recv rather than socketcall).
  • Selecting a replayable set of functions with the goal of capturing nondeterminism. Need to make a cut through the call graph. Need to track data dependencies between functions (e.g. if you call socket(), need to use the replayed value in read()). Possible to make several incorrect cuts, but came to a set of rules that developers can follow. Rule 1: any non-determinism should be below the interposed interface. Rule 2: all instances of unrecorded reads and writes to a variable should be either all below or all above the interface. Can be difficult to follow these rules in a complex call graph. Module APIs are good candidates, as they encapsulate internal state well.
  • Next challenge is to keep a deterministic memory footprint. Need the same memory addresses in both record and replay to ensure that different control flow isn’t followed. Need to make sure malloc returns the same addresses in both executions. Tool and application in same address space, and the tool runs differently during replay, so get different malloc behaviour.
  • Split space approach: system space (libraries and R2) and replay space (app). Split the user memory address space into two memory pools: native memory pool and deterministic memory pool. malloc in each case is appropriately mapped to each of these. R2 allows code annotation to ensure deterministic memory addresses.
  • Another challenge: deterministic execution order in multi-threaded applications. Especially in conjunction with dynamic memory allocation where this is kept deterministic (replay writing into a buffer before it is allocated). Need to maintain an inter-thread happens-before annotation: callback annotation for functions and sync annotation for synchronisation primitives.
  • Three categories of annotations: data transfer, execution order and optimisation.
  • Annotated prototypes converted via a PHP code template to a C++ code snippet.
  • R2 implemented on Windows in 20 kloc. Annotated the Win32, MPI and Sqlite interfaces. Executed on challenging system applications, including a web server, database, distributed system, virtual machine and network client.
  • Evaluated annotation effort, overall performance and effectiveness of customised interface.
  • One week to annotate the first 500 Win32 prototypes. 2 days to annotate MPI and SQLite. Cost of annotation is amortised.
  • 1% overhead of the stub code, 9% overhead when recording. Log size is 300x greater than the SQLite database looked at. Solution is to choose an interface with less I/O (use SQLite query interface rather than filesystem interface). Log size is now comparable to the database.
  • Space split also helps other in-process tools, e.g. in a model checker. Annotation and code generation also used in other projects.
  • Question: the interface choosing rules seem useful, but a human would screw it up, so is there any static checking or automatic for choosing the interface cut? Assumption is that you can use a modular API to select these interface. Latest work is to use compiler techniques to select these. Also to minimse log size.
  • Question: could the tool be used in production for more complex workloads, since the overhead for the microbenchmark is helped by caching? Have seen 10–20% overhead for larger benchmarks. Also, a problem is that the log size may become very large, but they have looked at checkpointing techniques.
  • Question: could you describe the kind of bugs that R2 does not replay faithfully? One problem is that you assume that, as an optimisation, the local disk operation is deterministic for read-only behaviour, but if during replay the file is missing, then you fail. What about data race problem? Assume that programs are data race free, and these may cause some problems to fail. Can always detect these problems, but not always reproduce them.
  • Question: do you need to reannotate new versions of applications? No, just annotate APIs (like Win32) but you have to annotate some interfaces by yourself.

KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs

  • Awarded a best paper award.
  • Hard to write correct systems code, it has tricky control flow, complex dependencies and abusive use of pointer operations. Also lots of environmental dependencies, which the code has to anticipate. Also a possibility of malicious users.
  • KLEE is designed to check such complex systems code. Fully automatic symbolic execution tool, exploring a large number of paths in a program. Finds many bugs, and generates test cases for exploring tasks, which gets over 90% line coverage (for Unix utilities).
  • KLEE runs a program on unconstrained symbolic input. As the program runs, constraints on the symbolic values are generated, and symbolic execution is forked (adopting constraints) when branches are hit. Can then generate a test case for each path that may be followed.
  • KLEE starts with C code, feed it into LLVM compiler to generate LLVM bytecode. KLEE takes the symbolic environment and the LLVM bytecode to execute the code, interacting with the STP constraint solver. Outputs a collection of test cases.
  • Real programs have an exponential number of paths. Need to use search heuristics. Coverage-optimised search selects path closest to an uncovered instruction and favours paths that recently hit new code (as they are likely to hit new code in the future). Also random path search (in the paper).
  • Constraint solving is an inherently hard problem (NP-complete) and needs to be invoked at every branch. Two simple, effective optimisations (15x speedup).
  • Eliminate irrelevant constraints before sending them to the constraint solver, which works because each branch depends on a small number of variables. So eliminate the constraints which have no effect on the branch.
  • Also cache the results of the constraint solver for later branches. With a static set of branches, there are lots of similar constraint sets. If we see a subset of some constraint set, eliminating constraint does not invalidate a solution, so no need to re-solve. Adding constraints often doesn’t invalidate solutions as well.
  • Problem of interacting with the environment (e.g. reading data from the file system or the network or the end user). If every argument is concrete, forward a syscall to the OS. If they are symbolic, provide a model that can handle symbolic inputs: want to explore all possible legal interactions with the environment.
  • Easy to write these models. A symbolic file is represented by a symbolic buffer and an offset initialised to zero. Reads become memcpy()s. Can just write plain C code that is run by KLEE, which allows users to extend or replace the environment with no knowledge of KLEE internals. Currently implement files, command line arguments, pipes, etc.
  • Looked at the GNU coreutils application suite. 89 stand-alone applications (v6.10). Many functions, different authors, major environment interaction, and heavily-tested, mature code. Challenge to find bugs in this code. Most are between 3000–4000 loc. Some up to 10000 loc.
  • Methodology: run KLEE fully automatically on each application for one hour. Run the test cases on uninstrumented utility and measure line coverage using gcov. Coverage measurements are not inflated by potential bugs in KLEE.
  • Minimum coverage is ~63%, overall is 84%, average is 91% and median is 95%. 16 applications were 100% covered.
  • Coreutils also has an effective regression testing suite. Manual testing achieves an average of 68% coverage. In comparison, KLEE loses on 9 applications, and won on 80 of them.
  • Looked also at the Busybox suite for embedded devices (91% overall, average 94%). Lost on a single application and won on everything else. (Regression suite got 44% coverage with manual effort.)
  • Found some bugs in coreutils. 10 crash bugs (more than were found in the last three years, combined). KLEE also generates command lines which expose these crashes. Showed ten sample inputs which would cause segfaults.
  • In 453 applications, with 430 kloc, found 453 bugs (also looked at Minix and the HiStar kernel), and many of these have been reported. Reported 56 bugs in total.
  • KLEE is not limited to finding crash bugs, but can also find higher-level correctness bugs. Can prove asserts on a per-path basis. Constraints in KLEE are complete and bit-accurate. An assert is just a branch, so KLEE can look at the feasibility of each side of the branch. If the assertion failure path is infeasible, then the code is correct.
  • Can also use this to cross-check applications: two functions which implement the same interface, then assert(f(x) == f’(x)). If KLEE terminates without an error, the paths are equivalent. Otherwise, a mismatch is found. Were able to test coreutils and Busybox utilities with conformance with IEEE Std. 1003.1. Many paths were equivalent, but some had output mismatches.
  • Lots of important related work in many research groups. Distinguished KLEE by testing hundreds of benchmarks, and showed extensive coverage. Also the idea of cross-checking is new, as is support for a symbolic environment.
  • Question: about the long term goals, what are the scalability issues with looking at applications that are larger than the ones tested in the paper? A very large program thrown at KLEE will suffer from exponential explosion of paths, and difficult constraint solving. But there is not necessarily a correlation between the size of an application and the difficulty for KLEE of solving it. Can write a 5-line program that really blows up. The comparative advantage over manual testing may even improve.
  • Driver for something like HiStar is necessary, but can be inferred automatically (e.g. symbolic environment of file system can be taken from the contents of the actual file system).
  • Question: how do you generate a database for testing a database server? Interactions with a DB are symbolic. Previous work marked the FS as symbolic.
  • Question: where is the bottleneck as you scale up the size and complexity of the checked system? Constraint solver or number of states? Working close with constraint solving groups to better characterise constraint solving load. Some benign looking problems turn out to take a huge amount of time. KLEE uses an implementation of CoW at the object level to help with states. For coreutils, can have 100000s of concurrent states.
  • Question: did LLVM buy you anything over another symbolic execution technique? Used to do source-to-source translation. LLVM has useful tools and is a real-world compiler which can be helpful with real code.
  • Question: would running KLEE for 24 hours have beaten manual testing for the cases that a one-hour run didn’t? Some interactions with the environment are not effectively supported. Done at the syscall level, but might have been better to do at the libc level (but syscall interface is smaller and easier). Some problems with the constraint solver meant that running for 24 hours wouldn’t nearly be enough, due to exponential explosion in time. Ongoing work to improve this.
  • Question: what about proper checks vs. last-minute checks (in bulletproof code)? Can write checks in C code which gives the developer flexibility.
  • Question: what is the length of the error path for the 56 bugs? For many of them it varies, but 100s of branches must be hit in order to hit the bug. Very complex to find and reason about.
  • Question: did you try to apply tools in the related work to see if they would uncover the same bugs? No. Many of them are not available. Others are not directly comparable, as they aren’t symbolic execution tools.
  • Question: what does KLEE mean? Don’t know! Daniel came up with the name: perhaps a Dutch work. (van Renesse: apparently not a Dutch word.)

Security

Hardware Enforcement of Application Security Policies Using Tagged Memory

  • Software security is in a crisis. No good abstractions to enforce application security policies, so application software has to do it itself, leading to TCB bloat. Large software systems have millions of lines of code in the TCB, which makes it difficult to eliminate bugs or prove correctness. Linux has gone from 170000 lines to over 6 million in the latest version.
  • Virtualisation often used to minimise TCB, which is good for partitioning applications, but not totally adequate.
  • Current hardware is also inadequate. Kernel data structures require fine-grained protection (e.g. process data structures). Some struct members may be globally readable, but not writable. Or permissions should only be given to owners. One possibility is page-alignment, but this is complex and expensive. Paging also does not associate policy with physical resources (problem of aliasing).
  • Proposal is tagged memory. Associate each word of memory with a 32-bit tag, and map tags to RWX permission for protection domain. Gives fine-grained access control, and simplified security enforcement. Helps even with a compromised OS. Policies are tied to physical resources. Allows for a smaller TCB.
  • Evaluated by building a full system on an FPGA.
  • In proposed system, kernel is shrunk to a trusted security monitor (TCB for entire system), responsible for labelling all resources. One or more kernels providing abstractions but no longer fully trusted. System resembles a distributed system.
  • Prototype: LoStar, based on Loki + HiStar. HiStar places all OS abstractions on top of a few types of objects, which can be labelled and control r/w access. Security monitor translates these labels into hardware tags and controls access to shared memory using tags. Loki is the hardware layer, with a tag permission cache in hardware, populated by the monitor.
  • File has an inode, which has some label in the kernel, and a mapping to the pages which comprise it. The monitor maps tags to labels, and the tag permissions are enforced in hardware.
  • Each user’s thread has its own kernel instance, and a tag permission cache. When you try to access something with a tag that isn’t cache, a trap occurs, the kernel looks at the label, and, if appropriate, gives access by loading permissions into the cache. Threads and memory objects have tags associated with them, so can’t directly modify someone else’s runtime state. Monitor provides a yield-to primitive for thread collaboration. Monitor has no scheduler; scheduling done in the kernel, and transitioning between kernel instances is possible. Cannot spoof objects either.
  • For collaboration, could grant access to someone else, and then try to corrupt it. But when a grant is accepted, the tags are changed to belong to the grantee (and this is disallowed).
  • Kernel objects are also tagged. If trying to delete references to a kernel object, because the fields (such as the refcount) are tagged in a fine-grained manner. References can be created by non-owner, and non-owner references can be deleted, but the refcount cannot be written directly (monitor does this).
  • Garbage collector runs in “idle garbage collection” domain, outside the TCB.
  • Can have a hashtable of objects which might individually be modifiable by owners, but not the linked list pointers (so cannot delete someone else’s object).
  • Prototype based on a full-system FPGA. Runs on a modified SPARC processor with an in-order 7-stage pipeline. Loki logic overhead is 19%, which would be a much lower fraction in more complex CPUs. No overhead in clock frequency.
  • Instruction and data caches had to be extended to add cache bits. Permission cache had to be added, accessed on instruction fetches and on loads and stores. Permission checks happen at the trap stage. The pipeline isn’t deepened, which is why there is no clock frequency overhead.
  • Tags tend to exhibit spatial locality. Could use page-level tags and where necessary switch down to word-level tags. Page tag array which can either have a tag in it, or a pointer to an array of word-level tags.
  • Monitor updates tags and permission cache entries. Traditionally have user and kernel modes. Add a monitor mode (ring -1) which has access to all physical memory, and only monitor mode can read and write tags. (Kernel reads tags only; User doesn’t see tags.) This helps to remove the OS from the TCB.
  • TCB reduced from 11600LOC to 5200LOC. Currently based on a RAM disk, but need to extend tags to the hard disk if it is to work with disks (could use a small amount of Flash to store tags).
  • Performance impact is negligible.
  • Significant number of tags are word-level, so fine-grained tags are necessary. Need many tags to protect users, processes, file descriptors.
  • Loki port to a Xilinx XUP costs $300 for academics, $1500 for industry, and the full RTL and HiStar-SPARC distribution is available.
  • Question: can you contrast Loki fine-grained memory protection with Mondriaan? Can you implement LoStar on top of Mondriaan? MMP provides independent protection domains at the hardware level. But LoStar was based on extending application security policies all the way to hardware. Mondriaan relies on the correctness of the translation mechanisms, whereas Loki is orthogonal. Loki and HiStar were designed to work together so they probably play better together. Builds on MMP.
  • Question: does the garbage collector have implicit access to all kernel data structures? It would have to call into the monitor to do this, but tags are used to protect this.

Device Driver Safety Through a Reference Validation Mechanism

  • Cornell is building a new OS called the Nexus. Addressing a key problem in constructing the trusted microkernel.
  • Drivers take up the majority of OS code, but they shouldn’t be trusted. They have a high fault rate and are written by different people from the core kernel authors. They run in the kernel and have privileged access to hardware. Loads to possibilities for trojans, privilege escalation, bugs, etc.
  • Nexus will have secure drivers. Want a small and auditable TCB, which is incompatible with having drivers in the TCB. Too big, too complex, too changeable to audit.
  • Not enough to move drivers to user-space. Only prevents direct attacks by the drivers on the kernel. Devices can also compromise the kernel integrity (using DMA, using interrupts (DoS)). One approach would be to use an IOMMU to stop the device DMAing to the kernel. What if we had a reference monitor in software, which would be more flexible?
  • Nexus puts a reference validation mechanism, acting as a filter, in the kernel, which monitors references by the drivers in user-space. Want to factor out device-specific code from the reference validator. The validation mechanism is parameterised by a device-specific reference monitor, which covers anything device-specific.
  • Device Driver Reference Monitors (DDRMs) can prevent illegal reads/writes by drivers (like an IOMMU), priority elevation, processor starvation, physical damage of the device by the driver, resource exhaustion of the device.
  • I/O model. Nexus is built to run on x86 and use the PCI bus (nothing about the architecture is specific to these). Allowable operations are I/O ports, MMIO, PCI configuration registers, interrupts, shared interrupts, DMA and deep data structures.
  • Have a language for writing specifications, which expresses allowable sequences of operations. A stateful filter, because some operations are illegal sometimes and legal at others. So the reference monitor is a state machine. Each legal operation is a transition in the state machine. Specify variables, alias names and state transitions.
  • Don’t enumerate every state in the state machine, instead encapsulate them in the state variables. Some are device-specific, and some are predefined by the DDRM (e.g. the memory that the driver has access to may be a state variable, but defined at runtime).
  • A transition is a combination of an event and state. Predicate (boolean expression over operation and state variables) followed by actions (modification of state variables). Written in a safe, declarative language (easier to reason about than C).
  • An illegal transition kills the driver, treated much like a segfault. No reason for a good driver to violate the specification: if it does, it isn’t a good driver. The DDRM runs a trusted reset routine from its specification to put the device in a harmless state (otherwise could cause havoc with the system).
  • Drivers are 99% source compatible with Linux. All operations that a Linux driver expects became syscalls in the Nexus kernel. Added traps and macros to capture the actual I/O. Unmonitored DMA mapped directly into driver memory: works for contents of network packets and contents of disk blocks. (Don’t map the data structures which are security-sensitive.) Added an untrusted disable interrupts operation, which only defers its own interrupts, and pauses the execution of all that driver’s other threads. Weaker than sti, but good enough for most drivers (only want to stop reentrancy). This has the bonus of lowering interrupt latency for other drivers.
  • Written drivers for i810 audio, e1000 Ethernet card, USB controller, a USB mouse and a USB disk.
  • Measured TCB, performance and robustness against attacks.
  • Between 0.4–1.3% of lines of code changed in each driver. Added lines as glue code. Specifications are 150–500 loc, which are at lesat an order of magnitude smaller than the drivers themselves. No need for a spec for the USB mouse and disk (the USB controller spec is all that needs to be monitored).
  • Performance evaluation with gigabit throughput test, audio and network responsiveness tests, ping latency, and end-to-end CPU utilisation during video streaming. Compared a Linux kernel, Nexus kernel, Nexus unsafe, Nexus with an empty spec, and Nexus safe (with complete specs and checks).
  • Very little impact on receive throughput; some degradation for small packet sizes when sending.
  • Audio latency adds 46.5 microseconds. But the driver itself has hundreds of milliseconds to respond to these interrupts.
  • CPU latency from 0.8% on Linux to 2% on safe Nexus for streaming video at 1Mbps.
  • Also measured robustness to attacks using perturbation testing (probabilistically modifying driver operations, changing register addresses, length, values read/written). And made a suite of malicious drivers for each execution environment.
  • Perturbation testing: in no case did they crash the kernel or damage the hardware. In a few cases the driver confused itself. With a null specification, after 31 tries, it damaged the hardware, and they stopped retrying this.
  • Safe kernel avoids direct kernel r/w by driver, livelock due to interrupts, DMA kernel crash, DMA kernel read/write, DMA kernel code injection and DMA write device memory.
  • Question: performance data only looked at throughput, which is meaningless because of the CPU utilisation increase. What was the impact of this? The CPU impact was roughly a doubling, captured by the end-to-end benchmark.
  • Question: how would more demanding devices like FireWire or high-speed USB cope, where jitter might be more catastrophic? The latency didn’t have any effect. Did USB 1.1 for this paper and have an experimental USB 2.0, both of which are functionally correct despite a change in latency. But it only degrades performance, never correctness (so no dropped frames when copying video from a camera).
  • Question: who would write DDRMs? Devices manufacturers, or could be open source, or could get a third performance.
  • Question: is having the reference monitor in the kernel necessary for performance? Has to be trusted anyway, so didn’t see a benefit to putting it in user-space, but it could be.
  • Question: were state machines developed by looking at the driver or by looking at what the hardware allows? Looked at both the drivers that we were trying to port, and the specifications of the hardware.
  • Question: how common is damaging the hardware with a malicious driver? e.g. Ask a device to overclock itself, or overwrite flash and abort it halfway which could cause it never to reboot.
  • Question: could you compare the security of the Nexus driver model to the Windows user-mode device model which also runs USB devices in usermode? They are quite similar, but Windows applies it only to devices that don’t do direct I/O.
  • Question: might an IOMMU be useful to enable lazy trapping (like how shadow page tables are used in VMMs) and validation in advance? Problem is that lots of accesses are individual commands, so it is difficult to validate them in advance.
  • Question: did you consider reusing an existing specification language for drivers like “devil”(?) or the MS Singularity language? Took inspiration from the languages but didn’t reuse any of the specifications that were written for them.

Digging for Data Structures

  • Looking at polymorphic viruses. AV programs can’t detect them. You’re running a firewall, you’re still going to get infected, and you’re still going to have to wipe your machines.
  • Signature checkers are basically grep. Can obfuscate by encryption/packing, polymorphism or putting garbage in there. Most of these aren’t even widely used yet!
  • We’d like an AV technique where obfuscation would destroy the human element: we’d be able to tell that code was obfuscated. Reckon that the commonality will be in the data structures (and these won’t be obfuscated).
  • So detect programs based on their data strctures. Emphasis on the field types, not the actual content. Encrypting memory might hide the contents, but the patterns should still be evident.
  • Also need to infer data structures without debugging symbols.
  • Problem: image looks random. We know where the heap is, so we can reasonably reliably infer heap pointers by looking at four-byte intervals. There is a probabilstic mapping between block and atomic types: small integers are usually data, for example.
  • Compute given a value, what is the block type. Is it a string, pointer, zero? No, then it’s probably an integer. Can also divide the heap up into collections of blocks that appear to have the same signature.
  • Can also infer the types of pointers from the memory that is found at those locations.
  • Need some mathematics to determine the number of classes, and whether to put a given object in a given class. Use a standard unsupervised Bayesian classifier, details in the paper.
  • System called Laika, implemented in about 5000 lines of Lisp.
  • Computationally expensive problem. But only 30% of objects contain pointers, there are a large number of strings and typed pointers are necessary. Also some problems with unions (between integers and pointers), and tail accumulator arrays (where the last element is a variable size).
  • Ran programs in GDB, trapping on malloc. 7 real test programs (things like xpdf). Measured probability of correct classification. Want to optimise P(real|laika) and P(laika|real), both around 70% which can improve slightly with some malloc information.
  • Laika virus detector can be written in 50 lines of code. Assume Laika correctly clusters types into classes. Take two programs and weld them together. Measure how mixed each class is and take the weighted average. A mixture ratio close to 50% means that a lot of data structures are shared. If the mixture ratio is less than some threshold (tested on Kraken), then we have a match. Estimated accuracy is 99.4–99.9%. Always correctly classified a program as itself over 150 tests. Expected number of errors was 0.33. Does much better than ClamAV (83–85% detection).
  • How might virus authors adapt to this system? They obviously will, as generic virus detection is undecidable. Mixture ratio is a simple first cut, but both sides can probably do better. Laika synergises well with other approaches, so there’s defence in depth.
  • Simplest attack: memory encryption (but all programs use data structures, so some patterns will still be evident)… XOR all reads and writes with a key. Could shuffle field orders (only removes 50% of the information). Could mimic structures from a benign program (like Firefox, say), but other heuristics should reveal that a program doing this isn’t Firefox.
  • High-level structure requires more structure, and very simple programs don’t have it. But as these botnets become more sophisticated, these will require more structure. Also there is computational expense.
  • Best related work is all from Wisconsin.
  • Question: what are the challenges involved with getting a consistent snapshot of memory? (Some malware is careful not to do all of its evil at once, and how do you figure out when that is?) Just ran for five minutes and looked at the data structures… and did pretty well.
  • Question: class of non-malicious applications that make extensive use of obfuscation for protecting IP, so can you differentiate these? Could sign binaries.
  • Question: how does using different languages and compilers affect this? Well, it would fail on a Lisp virus. On Java, you have a lot of metadata from all the classes. Just looked at C++.
  • Question: attacker code might try to detect the memory-detection sandbox, and try to confuse you, so can you sidestep this? We rely on a working sandbox, and it did for the three viruses considered.
  • Question: what if you detect that several programs share the same data structures, such as the libc ones? The main trick is to train based on actual viruses, and compare to these.
  • Question: should you forget about the bad guys, and use this technique to understand programs that might be buggy? A PoPL paper on doing just this, which assumes knowing the structure definitions. So this would only be useful when you don’t know your data structure definitions. (Might want to do something like failure-oblivious computing so you can fake it?) Might work.
  • Question: will Laika get confused from strings that are encoded in different languages, using UTF-8? Detect UTF-8, and just called it ASCII.
  • Question: what if you serialise and encrypt an object before putting it on a heap, and only keep it in plaintext on the stack? The trick is to be careful about the kind of detection used. This behaviour would be very conspicuous, and Laika would get confused.
  • Question: do you have to run Laika against everything for five minutes, and compare against every possible signature? Run once, take a snapshot, then compare against all signatures. Polynomial time in the number of classes and the number of objects.

Dealing with Concurrency Bugs

Finding and Reproducing Heisenbugs in Concurrent Programs

  • Concurrent executions are highly non-deterministic, and rare thread interleavings can result in Heisenbugs. These are difficult to find, reproduce and debug. Just observing the bug can “fix” it. The likeliness of interleaving changes when adding printfs. In practice, developers can spend weeks on finding and fixing these.
  • CHESS is a specialised user-mode scheduler, attaches to your program and controls all scheduling non-determinism. Guarantees that every program run takes a different thread interleaving, but then lets you reproduce any particular run.
  • Motivating example of a “thread-safe” bank account class, but with a piece of incorrect synchronisation. Can run the program with “stress”: additional threads that do random stuff.
  • Demonstration and of course the bug didn’t turn up: illustrates the point about Heisenbugs.
  • CHESS is a collection of .NET or Win32 wrappers which trap all synchronisation-related calls, and puts the program under the control of CHESS. Every run takes a different interleaving, and the interleaving can be reproduced for every run. Developing a version of CHESS for the NT kernel.
  • Goal: any error found by CHESS is one that is possible in the wild (no false positives, soundness of CHESS, CHESS does not introduce any new behaviours). Don’t want people chasing CHESS bugs! Also want any error that is possible in the wild should be found by CHESS: completeness, no false negatives, but very difficult. Need to capture all sources of non-determinism, and explore that, which leads to a state-space explosion. Goal is to beat stress-testing.
  • CHESS is a scheduler. Its goal is not to find bugs (need to introduce assertions to do that). CHESS can find deadlocks and livelocks. It drives the program along an interleaving of choice.
  • CHESS introduces calls to the CHESS scheduler at potential preemption points. Inserted before every synchronisation opeation. Could insert a random sleep at schedule points. This does not introduce new behaviour, simply modelling possible preemptions, and guarantees starvation-freedom.
  • Two main improvements over random sleeping. First, capture the happens-before graph. Any delays that result in the same happens-before graph are equivalent, so only need to explore one of them. Second, understand the synchronisation semantics, so avoid exploring delays that are impossible and identify when threads can make progress. Does this by maintaining a run queue and a wait queue (mimicking the OS scheduler state).
  • Emulate execution on a uniprocessor, and enable only one thread at a time. This controls the order of data races. Reducing the parallelism is the price we pay to get determinism. What about multiprocessor-only bugs? Any concurrent execution can be emulated on a uniprocessor. So we introduce schedule points before synchronisations, volatile accesses and interlocked operations. But can also introduce these points before every memory access, to get even more accuracy: capture all sequentially-consistent executions. Even have a non-sequentially-consisten execution model, which understands the x86 memory model intimately.
  • What about input nondeterminism? Rely on the user to provide a test harness that makes the input deterministic (can use techniques like KLEE and R2 to do this).
  • Translate all the .NET/Win32 synchronisations (often complex and under-specified) into CHESS scheduler abstractions. Enables handling multiple platforms. e.g. Asynchronous read file is like forking off a child task.
  • Problem of state-space explosion O(n^nk) interleavings (k scheduling points, n threads). CHESS only inserts preemptions at a small number of places, which gives a state-space that is polynomial in k.
  • Found a Heisenbug in  MS Concurrency Coordination Runtime. CHESS found it in a minute. Found many other concurrency bugs, and other people have used the tool to find bugs themselves.
  • Current status: working to get it shipped as an Add-on to Visual Studio. But there is a command-line version available on the website. Now looking at kernel-mode CHESS, search heuristics, and bugs in IE, Windows Shell and graphics libraries.
  • Systematic exploration of scheduling nondeterminism can be more effective than stress-testing.
  • When designing APIs and abstractions, please think about the plight of the programmer who has to program on top of them. Minimise the nondeterminism, and provide good specifications!
  • Question: have you thought about other sources of nondeterminism, such as rand() and time? Time is handled, so e.g. a timeout would happen in a CHESS execution. But gettimeofday() isn’t considered. You would be better off using symbolic execution with something like KLEE. Random numbers are determinised, but the choices are not explored.
  • Question: how do you handle home-grown spinlocks? We do handle spinlocks, as part of guaranteeing starvation-freedom. The backoff in a spinlock, and hyperthread yielding, is handled by CHESS.
  • Question: how do you handle loops and recursions? It is a dynamic analysis tool, and we assume that the user has given a test program that terminates, and it will just run it.
  • Question: how large are the tests that you use, in terms of lines of code? Run on things like Windows Explorer. Usually take existing stress tests, and fork fewer threads. What about other model checkers? Haven’t looked at them, but think CHESS is the only model checker that could deal with things that large.

Gadara: Dynamic Deadlock Avoidance for Multithreaded Programs

  • Parallel programming is tough. It’s hard to parallelise algorithms and data structures. In this talk, looking at deadlocks. Might need to look at a large part of a program to understand why a deadlock occurs. Just composing two bits of deadlock-free code is not guarateend to be deadlock-free. And multicore is making parallel programming inevitable.
  • Goal is to avoid circular mutex wait deadlocks by postponing lock acquisitions.
  • Using discrete control theory to perform a source-to-source translation that inserts extra code to avoid deadlocks.
  • Architecture: start with a C program, get control flow graph, translate it into a Petri net (formal model from discrete control theory), then do control logic synthesis to get control logic, which is compiled together with the original source to get an instrumented executable. The control logic observes and controls interactions between threads.
  • Look at the example of the dining philosophers problem. Random fork-grabbing gives a simple control flow graph with four basic blocks.
  • Petri net is a bipartite graph with two kinds of node. Circle is a place and bar is a transition. Arc connect places to transitions and transitions to places. Tokens are passed through the Petri net. When a transition fires, it takes a token from one input and puts it in every output. Can model pthread_create(), lock acquisition and lock release. A token is a thread of execution or the availability of a lock. A place is a basic block or statement. A transition is a jump or lock acquisition/release.
  • Simple translation from control flow graph to Petri net, augmented with the identity of locks.
  • Showed a huge Petri net for OpenLDAP, but need automated tools.
  • Siphon (deadlock) analysis. Siphon = set of places whose input transitions are a subset of their output transitions. Once empty, a siphon is empty forever. Proven that a deadlocked Petri net has at least one empty siphon, and structural analysis makes it easy to find them. No state-space explosion.
  • Deadlock freedom means no empty siphon, so the total number of tokens in a siphon should always be >= 1. Specify this as linear inequalities. Control logic adds a new control place which guarantees that the linear inequality is always satisfied. Control places are local, independent, concurrent and easy to implement. Adding control places may add new siphons, but this can be prevented by iterating on the Petri net.
  • A control places is implemented by a token variable and a lock/condition variable pair. Control transition checks and updates the tokens in the control places, using a “gadara lock”. Observe transition updates the tokens in control places, using a “gadara replenish” operation.
  • Most lock/unlock operations are unaffected, so no overhead.
  • Gives provably deadlock free code!
  • Moves most of the computationally expensive analysis offline, and includes only modest overhead at runtime for the control places.
  • Did a case study on OpenLDAP. Showed a deadlock bug, and how it results in a siphon in the Petri net. (And how the siphon can be fixed with a control place, and how the code is modified.)
  • Gives maximal permissiveness and maximal concurrency. Don’t want to block any thread or transition unless it will lead to a deadlock state in the worst case.
  • Bug was identified and fixed manually, but newly-added code reintroduced the bug. A second fix required a complete rewrite. Gadara would have had deadlock-free code at all points.
  • Challenges: need function annotations to provide some form of data flow analysis (does calling a function result in acquiring a lock?). Type analysis helps with dynamic locks to show how many locks and which lock instances are acquired. Gadara chooses to be conservative when the model is imperfect. What about uncontrollable programs with blocking syscalls and IPC? These may give incurable deadlocks, but Gadara warns about these. Also C features such as function pointers. Don’t care about function pointers outside critical sections, only those within critical sections. For the rest, use annotations or type signatures to conservatively assume the set of functions that might be used.
  • Evaluated on OpenLDAP, PUBSUB benchmark and BIND. Tool has a 11% overhead with an add/delete workload on OpenLDAP.
  • OpenLDAP: 1795 functions, 495 after pruning irrelevant functions. 41 lock types, 25 of 28 ambiguous functions were annotated in an hour. Four siphons found, including one confirmed deadlock from the bug database, one false positive (required data flow analysis) and two previously unknown deadlock bugs. Performance overhead only when the workload exercises deadlock-prone paths.
  • Gadara eliminates all deadlocks and is provably correct. Only modest runtime overhead because it is maximally permissive and calculations are performed offline (and most locks are not deadlock-prone).
  • Many existing approaches, but all differ because only Gadara uses control theory. DCT is a principled foundation for dynamic deadlock avoidance. Works on real programs, and useful in several scenarios (rapid development, corner-cases in mature code and end-user bug fixing).
  • Question: assuming a program with recursion, and the procedure has a summary of what happens to tokens, is it possible to use this information for the synthesis? (Taken offline.)
  • Question: why not just fix the deadlocks if you know where they are and you know how to fix them? Gadara may have some false positives. Admittedly you could get better performance by doing this, but Gadara is automated.
  • Question: did you consider just adding a simple lock acquire and release before and after the siphon? In some cases this would work, but for e.g. 5 dining philosophers, the control logic is more involved.
  • Question: what would be the scalability if you weren’t limited by the hardware (2 cores, limited memory)? Similar result in the paper for the 16-core case.

Deadlock Immunity: Enabling Systems to Defend Against Deadlocks

  • A real system can be in one of two states, running or failed. When it hits a bug, it fails. Eventually it will come back, another bug will strike and take the system down. But when the first bug occurs again, the system will still fail. It doesn’t learn from past failures.
  • Give it an immune system. On seeing a bug, save an antibody to match the bug, and when you see that bug again, the antibody will neutralise it and let the system continue as if nothing had occurred. Like immunity against a pathogen.
  • Dimmunix gives immunity against deadlock bugs for Java and C++ programs.
  • Run a program, learn the executions that lead to deadlock, save their fingerprints in persistent history and use them to identify deadlock cases in future so as to avoid them.
  • Say you use CHESS and Gadara to rid your web browser of any deadlock, but then you load a plugin that hasn’t been so vetted, and it introduces a deadlock bug. Software vendor can distribute the antibody to end-users, or end-users can run Dimmunix to defend against deadlocks in some buggy code.
  • Lock inversion leads to a deadlock bug. Look at the stack trace which leads to the lock inversion, and abstract away parameters. The call flow is the signature that we will save (a deadlock pattern). Actually we want the list of EIPs from the call stack to identify particular call sites.
  • A deadlock bug leads to a single deadlock pattern, and a deadlock pattern can lead to several hangs. (Only in two cases did they see >1 patterns for a single bugs.)
  • Dimmunic intercepts calls to lock/unlock, detects the deadlocks, saves the signatures and avoids executions that match the patterns. The history of patterns is the immune system.
  • Versions for Java and POSIX threads. Java: instrument the bytecode. PThreads: shim library to add instrumentation. This means we need no assistance from programmers (no annotations) and users, and no need for source code. Works for closed source and commercial software as well.
  • Use a resource allocation graph (RAG) to detect deadlocks. Hold edge from lock to thread holding it. Allow edge from thread to lock that it is blocked on. A deadlock will be a cycle in this graph. But once you hit this cycle, it’s too late to do anything about it, but Dimmunix will save the pattern that lead to it. Add a label to the hold edge which indicates the call stack of the thread when it took the lock. These are collected to form the signature, which is saved in the history.
  • Now you’re hung, so you’ll have to take steps to restart the threads or the program.
  • Next you do avoidance. When there is an attempt to take a lock, see if there are any matches in the history. A partial match is okay, but if you have a complete match (all stacks are present), you do deadlock avoidance, by suspending the last thread that attempted to take a lock (in the signature). So Dimmunix slightly tweaks the thread schedule to avoid deadlock patterns arising.
  • Does it work? Took a number of C/C++/Java systems. Largest was MySQL (million lines of code). Went through their bug databases, looking for user-reported deadlock bugs. Managed to reproduce about 12 of them. Ran the test cases 100 times and they deadlocked 100 times. With Dimmunix, ran 100 times and only deadlocked the first time, worked every other time.
  • What about performance? Dimmunix does a lot of work, and if it is in the critical path, it would slow down the problem. Aimed to make it as asynchronous as possible, moving a lot of work into a separate thread, and used lock-free data structures. Paper has implementation details. Measured overhead on a lock-intensive microbenchmark, looked at lock operations per second. Dimmunix had very little overhead compared to baseline, though it increased at 512 and 1024 threads. Generated a history of 64 synthetic deadlocks. Dimmunix introduced thread suspends, which account for the overhead. 0.6–4.5% overhead for up to 1024 threads.
  • Need to match signatures on the critical path. Therefore hash the call stacks and index into precomputed in-memory tables. Keep a thread-local cache of data structures to reduce contention. Common case is no matches, which is efficient. By varying the number of signatures in history up to 256, the lock operations per second overhead doesn’t really change. (And you wouldn’t expect to have that many signatures anyway.)
  • Suspends can induce starvations because you’re yielding waiting for a blocked thread to make progress. (Haven’t seen this in reality, but can construct programs that do this.) So introduce yield edges to the resource allocation graph from the suspended thread to the one that caused the suspension (by holding a lock): label on edge is the label from the corresponding hold edge. This leads to a cycle in induced starvation cases, and can be detected in just the same way.
  • Dimmunix can have false positives, if you try to avoid a deadlock that would not have occurred. They don’t impact safety, but can impact performance (sometimes positively, by reducing contention). Trick is to adjust the precision of matching the call stack to generalise the pattern: how many call frames should you choose. Use a heuristic what-if analysis post-factum. If it sees no inversion then you have a false positive, and you eventually increase the precision. Dimmunix focuses on control flow and doesn’t look at input dependencies: no concrete program state in the signatures, so you get false positives but also get general, portable signatures.
  • Someone has to experience the first occurrence but can then immunise everyone. It cannot affect deadlock-free programs and cannot induce incorrect outputs in (non-real-time) programs. It must be aware of all synchronisation mechanisms (so homegrown spinlocks will not be identified).
  • Complements any approach that is aimed at reducing deadlocks (including Gadara, and it’s still useful because of bugs outside the code that you have checked like the web browser plugin case), including static analysis, model checking and transactional memory.
  • Question: could you extend Dimmunix to provide a trace for programmers to debug their code? One of my students is working on that: how to reproduce a deadlock given a signature. Trying to make it work for programs where model checking is still not applicable (millions of lines of code).
  • Question: what about when your immune system misbehaves? Say you have some functionality in your program in which all execution paths are deadlock-prone: the conservative approach won’t let you make any progress. One possibility is to log the signature every time you do an avoidance and use a pop-up blocker-like interface to tell the user that a deadlock was avoided, and you can prevent it avoiding it in future. The other possibility is to put an upper bound on the yielding.
  • Question: can this idea be generalised to other symptoms like segfaults? Another student is looking at this idea, applying failure immunity to other situations, like resource leaks, and we hope to be able to generalise it.
  • Question: seems like microbenchmark stopped scaling at a certain point? The graph flattens out and there is a limit to how many lock acquisitions you can do per second. Also the curve seems to be dropping off, but this is amplified by the logarithmic x-axis, so it’s really very slight.
  • Question: how many deadlock bugs that you find can actually be used as a vaccine (presumably with MySQL it would have to be independent of the application running on top)? But the patterns are specific to MySQL, and any application on top could use them.
  • Question: do you have any results with more frequent locking behaviour? Some workloads with 300000 locking operations on that kind of machines? As you vary the interval between lock acquisitions, the overhead is proportional to the number of instructions that you are placing in the critical path. In the scenario where you keep a lock for 1us then wait 1ms and acquire it again, we were trying to be more representative of a real system with I/O, or something like that. The overhead becomes signifcant when the interval is down at the microsecond level.

OSDI 2008 - Day 2 - Session 2

Wednesday, December 10th, 2008

Binary Translation Using Peephole Superoptimizers

Sorav Bansal and Alex Aiken, Stanford University

They introduces a new scheme for performing binary translation that produces code whose efficiency is comparable to existing binary translators with much less engineering effort. Instead of hand-coding the translation from one instruction set to another, the approach automatically learns translation rules using superoptimization techniques.
They have implemented a PowerPC-x86 binary translator and report results on small and large compute-intensive benchmarks. When compared to the native compiler, the translated code achieves median performance of 67% on large benchmarks and in some small stress tests actually outperforms the native compiler.
It is not clear whether their approach is thread-safe. Assuming the code on the source architecture is thread-safe, it is not clear if the produced code is thread-safe in the target architecture.
R2: An Application-Level Kernel for Record and Replay

Zhenyu Guo, Microsoft Research Asia; Xi Wang, Tsinghua University; Jian Tang and Xuezheng Liu, Microsoft Research Asia; Zhilei Xu, Tsinghua University; Ming Wu, Microsoft Research Asia; M. Frans Kaashoek, MIT CSAIL; Zheng Zhang, Microsoft Research Asia

Library-based record and replay tools aim to reproduce an application’s execution by recording the results of selected functions in a log, and during replay returning the results from the log rather than executing the functions. These tools must ensure that a replay run is identical to the record run.
These methods have limited usefulness, because function with side effects and reentrant function pose big problems in replaying programs if we don’t execute these functions - the replays may be infeasible/unrealistic.
To alleviate this problem, R2 allows developers to choose functions that can be recorded and replayed correctly. Developers annotate the chosen functions with simple keywords so that R2 can handle calls with side effects and multithreading. R2 generates code for record and replay from templates, allowing developers to avoid implementing stubs for hundreds of functions manually.
It is not clear how easy is to write such annotations. Moreover, if the annotations are wrong, spurious replays may be generated.
KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs

Cristian Cadar, Daniel Dunbar, and Dawson Engler, Stanford University

KLEE is a symbolic execution tool for automatically generating tests with high coverage and discovering bugs. It is a timid tentative to make model checking practical for real applications. In fact, it is not clear from the paper what bugs KLEE tries to discover, but I suppose it tries to discover the same bugs model checkers use to address (e.g., memory errors like null pointer dereference and buffer overflow, or assertions).

KLEE shares many ideas with modern model checking techniques based on symbolic execution. In my opinion, KLEE doesn’t bring any important contribution in the area of bug discovery. All the techniques they used for optimizing the contraint solving (expression rewriting, propagation of constants, counterexample caching, implied value concretization, extracting independent constraints, etc.) are already employed by modern contraint solvers (e.g., SAT solvers), and all modern model checkers have some powerful constraint solver as backend.

They don’t provide any intuition of why the state scheduling strategies that KLEE uses (random path selection and coverage-optimized search), the metrics they use in these strategies (e.g., minimum distance to an uncovered instruction and call stack of a state) and the way they switch between strategies (round-robin) are efficient in practice. I suppose the minimum distance to an uncovered instruction is useful for a fast convergence to a high coverage of the tests. I suppose they use the call stack information to find tests having as few nested function calls as possible, and possibly to also bound the recursive calls. We can only speculate, since they don’t provide any clear intuition behind these heuristics they use. However, I believe these heuristics are only useful for converging faster to a good coverage. I don’t know how well they will work for bug discovery. They may harm the bug discovery, instead of helping it. In my opinion, finding a bug efficiently doesn’t have so much to do with the coverage of the symbolic execution, but rather with the amount of relevant states explored. They have to employ other heuristics for optimizing the bug discovery.

The programs they used to test KLEE have about 1KLOC on average (452 applications totaling 430 KLOC). Applications of this size don’t constitute a challenge for modern model checkers — they can easily handle 10KLOC applications. They also don’t compare KLEE to existing test generators and model checkers, on the same test C programs.

An important missing feature is the ability to symbolically execute multi-threaded programs. The state explosion is even worse for multi-threaded programs, compared to sequential programs (i.e., factorial vs. exponential blowup). There are already model checkers that can handle multi-threaded programs.

The only real contribution of KLEE seems to be the environment modeling. KLEE is able to execute programs that are environmentally-intensive. The environmnent is represented by command line arguments, file data, network packets, etc. All variables which denote inputs (e.g., command line arguments) are symbolic variables — they can have any initial value. Files whose names are stored by symbolic variables become symbolic files; the other files are concrete files. KLEE handles in an interesting way the symbolic files, by implementing a symbolic file system that efficiently simulates all I/O operations performed on that file. The I/O operations are executed natively on concrete files.

The native execution of I/O operations will constitute a bottleneck for programs that do heavy I/O on concrete files in certain states and revisit those states very often. I would therefore renounce at the notion of concrete file — I would consider all files symbolic. Moreover, it is not clear how the I/O operations on concrete files are isolated while symbolically executing different paths. Mechanisms like copy-on-write (or some sort of versioning and rollback) should be used on top of the I/O operations — they cannot be just executed natively.

OSDI 2008 - Day 2 - Session 1

Wednesday, December 10th, 2008

SQCK: A Declarative File System Checker

Haryadi S. Gunawi, Abhishek Rajimwale, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau, University of Wisconsin, Madison

They introduce SQCK file system checker based on a declarative query language. Declarative queries are a natural match for the cross-checking that must be performed across the many structures of a file system image.

The paper shows problems that the current e2fsck implementation encounters when handling corruption in directories. One problem is that e2fsck does not always use all of the information available to it regarding directories. e2fsck could use available information to keep a target directory with its correct parent and to reattach the lost directory to its parent instead of moving it to lost+found. Another problem is that, Ext2 contains replicas for important meta-data, such as pointers to the inode tables, but e2fsck does not always use this redundant information. They fix all these problems in SQCK.

By encapsulating the logic of a file system checker in a set of declarative SQL queries, they provide a more concise description of what the checker should do. The checks are written as SQL queries, which are executed by a MySQL server, which contains the file system meta-data.

Transactional Flash

Vijayan Prabhakaran, Thomas L. Rodeheffer, and Lidong Zhou, Microsoft Research, Silicon Valley

Transactional flash (TxFlash) is a novel solid-state drive (SSD) that exports a transactional API (WriteAtomic). The atomicity property is important in file systems and database systems for maintaining consistency across crashes and reboots. Having the storage layer provide transactional API improves reliability.

SSDs are more suited for transactions than hard disks, thanks to their copy-on-write nature, fast random reads and high-concurrency.

There seems to be a large amount of related work — there are already techniques that try to achieve atomicity for flash systems or disk drives, based on copy-on-write and logging. It appears that TxFlash is a new point in the design space, with clear advantages over the existing approaches, advantages which were well pointed out in the related work section.

Besides the transactional API on SSDs, the contributions of this work are: (1) new cyclic commit protocol that uses per-page metadata to eliminate the need for a separate commit record, and therefore is more efficient for small transactions (95% better throughput, compared to traditional commit, for transactions smaller than 100KB), and (3) TxExt3, a version of Linux Ext3 modified to use transactions on top of TxFlash, which reduces the run time by 65% for I/O intensive synchronous applications, compared to Ext3 on top of SSD. In TxExt3, instead of writing to the journal, the commit function uses the transactional commands, and there is no need for a separate checkpointing process as in Ext3.

TxFlash is an excellent piece of work and the paper is very well written. The novel ideas are very interesting and consistent. It also seems to be a very mature system — they have a new commit protocol and a new file system on top. It looks like nothing important is missing. In conclusion, I don’t find any significant flaws in this work.

Avoiding File System Micromanagement with Range Writes

Ashok Anand and Sayandeep Sen, University of Wisconsin, Madison; Andrew Krioukov,University of California, Berkeley; Florentina Popovici, Google; Aditya Akella, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Suman Banerjee, University of Wisconsin, Madison

They introduce range writes, a simple but powerful change to the disk interface that removes the need for file system micromanagement of block placement. By allowing a file system to specify a set of possible address targets, range writes enable the disk to choose the final on-disk location of the request; the disk improves performance by writing to the closest location and subsequently reporting its choice to the file system above. The result is a clean separation of responsibility; the file system (as high-level manager) provides coarse-grained control over placement, while the disk (as low-level worker) makes the final fine-grained placement decision to improve write performance.

For workloads that have significant write components (untar, PostMark), range writes boost performance (a 16% speedup for untar and roughly 35% for PostMark). Second, for workloads that are less I/O intensive (Andrew), range writes do not make much difference.

In my opinion, it is not worth to modify the file system interface and the hard drive controller just to get a 16-35% speedup. We could just buy fast drives, like SSDs, or use flash memories in existing drives.