HotDep 2008
Wednesday, December 10th, 2008To kick off my blog posts from OSDI, I’ll start with summaries from HotDep 2008. The order of the day is system dependability, some controversial papers have been included, and active audience participation has been encouraged. My notes presented below are attempts at an objective summary for each talk, and are fairly raw and unedited. If you spot any errors, let me know in the comments!
Software Reliability
Reverse-Engineering Drivers for Safety and Portability
- Drivers aren’t portable. New OSs aren’t always backwards compatible. And drivers are a safety nightmare.
- Techniques proposed for portability, isolation, static analysis, model checking, driver languages.
- Idea is to take a log of hardware interaction (by recording the protocol between the device and the driver) and the original binaries, analyse them, and extract the protocol into a safe device driver. Uses driver templates (OS-specific boilerplate), fleshed out with the protocol.
- View a driver as a state machine. State is heap data, local variables and CPU registers. Look at effect of a syscall on the state.
- Use QEMU to snoop communication between the device and the driver. Present prototype only works on devices that are emulated by QEMU. 2–3x overhead during reverse engineering.
- Generate a trace log of EIP, and writes to memory locations: associate this with the disassembled binary. Work out which instructions are relevant for controlling the device (i.e. that modify hardware state). These are then converted to C code, and can be combined with OS-specific boilerplate template. GIves you source code which can be compiled into a kernel module and loaded as usual.
- Tested by porting the NE2K driver from Windows to Linux, and from Linux source to safe Linux source. Generated a 500kb trace by sending, receiving and setting the MAC address, which covered 50% of the basic blocks in the original driver. No performance overhead in the synthesised driver: even though it is incomplete, it was still usable.
- Required half a day to specialise the driver template manually, but this could be automated. Much harder to do when porting between operating systems, because of API difference. Errors occur in manual translation.
- Can formally prove the correctness of synthesised drivers: the driver template will not crash the OS, and the specialisations are safe by construction. Can also prove liveness.
- Some legal concerns about reverse engineering (specifically decompiling). Can mitigate these by including no original binary code in the synthesised driver.
- Want to trace real hardware, deal with real-time constraints, handle OS API mismatches better.
- Aim is to get hardware vendors to provide formal specifications.
“Otherworld”: Giving Applications a Chance to Survive OS Kernel Crashes
- Cannot avoid software faults. 30–50% of TCO is spent on recovering from faults. Major vendors are now looking at dependability rather than performance.
- Concentrate on kernel faults because these are hardest to recover from. Can minimise consequences by: checkpointing (expensive), hardware redundancy (expensive and complex) or microkernels/SFI (not widely available).
- Afterwards, the kernel state is corrupted and cannot be trusted. This affects all running applications, but rarely corrupts application state: only 18% of cases. Application specific recovery reduces this to 1–2% of cases. So give the application a chance to restore its data.
- Can we reboot the kernel without destroying everything on the system? We should be able to. Kernel is logically well isolated from other components.
- Challenge: kernel contains critical data belonging to applications. e.g. the page tables, file descriptors, sockets, etc. Need a component that can manage these after a fault.
- Idea is to have two kernels: a main kernel and a crash kernel. Main kernel is active by default, but transfers control to crash kernel on fault. Crash kernel sucks information from the main kernel when it is initialised, and takes over the system. Processes need to register a crash procedure which is analogous to an application exception handler: called when the kernel crashes.
- On crash, we have an initialised, undamaged kernel, but the main kernel data is preserved and accessible to the main kernel.
- Main and crash kernels are the same version of Linux. Hardware state is not currently preserved. Crashes detected are limited to those detectable by the OS itself.
- The crash kernel recovers state for each process (could use checksums to detect corruption here), creates a new process descriptor, copies the address space from the main kernel and restores resources belonging to the process.
- Crash kernel calls the resurrection function in the application to notify it that a kernel crash has occurred.
- Finally, crash kernel reclaims the resources occupied by the main kernel, and loads a new crash kernel. Now we again have a fully functional system.
- Automatic resurrection of: application physical memory, pages swapped out, mmap’ed files and open files. Currently don’t restore: network connections, pipes, framebuffer and threads.
- Benefits: no runtime speed overhead, fixed small memory overhead, applies to monolithic and microkernel OSs, requires no special hardware, and restores (an unlimited amount of) the most recent state.
- Evaluated on Linux 2.6.18, using kdump to load crash kernel. Tested with four classes of application (an editor, database serer, web server and Berkeley Lab Checkpoint/Restart). Considered kernel crashes triggered by BUG_ON(). Faults are injected by triggering the failed assertions (not corrupting memory).
- Added 25 lines of code as a crash procedure to the JOE text editor. Simply saves all opened documents and restarts the editor with the same open documents. Totally transparent to the editor’s user.
- Added 50-line crash procedure to MySQL that retrieves in-memory tables, saves them to disk and restarts the server. Also 25 lines to the start-up code to retrieve this saved data. Gives reliable in-memory databases, which are 1.5–140 times quicker than disks.
- Added 1500 lines of code to the Linux kernel.
- Still need to work on detecting and preventing data corruption, keeping applications running without interruptions, resurrecting other resources and resurrecting groups of interacting applications.
Deprogramming Large Software Systems
- We already have programming and compiling, and decompiling. Can we extract programmer thought by deprogramming?
- Use case: a programmer has to maintain a system that he/she has to written, which contains hundres of classes. May have access to documentation, but has access to source code. Goal is to understand how it works and be able to fix it.
- One approach is to recognise patterns in class naming (e.g. NetworkConnection). Or higher-level design patterns (e.g. MVC).
- Programmer needs a high-level description of the software: i.e. what the designer would say when describing the system. Deprogramming extracts these patterns automatically.
- Algorithm: 1. Statically extract dependencies from Java bytecode to get dependency graph. 2. Augment graph with dynamic analysis. 3. Analyse the graph to do automatic pattern matching of “ideas”. 4. Generate documentation.
- Dependency graph: highly expressive abstraction of the code. Nodes are code constructs (classes, methods, fields), and edges are directed relations. Omit details of variable names and sequencing of operations. Should be sufficient to detect design patterns.
- Dynamic analysis can e.g. show what concrete class is invoked through an abstract interface; track uses of reflection.
- Patterns are defined in a DSL that allows the specification of subgraphs. Allows the expression of any subgraph and can match a wide range of patterns.
- Pattern matching is equivalent to subgraph isomorphism problem. Uses the VF2 algorithm with some modifications for type-checking.
- Can search for anti-patterns and identify targets for refactoring. Potential for auto-refactoring of these cases (not done at present).
- Can also detect semantic-level copy-paste. Compare the structures of two programs by comparing their dependency graphs.
- Can generate documentation from this, e.g. UML diagrams from the patterns that exist.
- Can identify the fingerprint of a particular programmer: he/she will have preferences for particular patterns or combinations of patterns, or modularisations. Could extract this and compare to a database to find the author (e.g. for finding out who wrote malicious code). Assumes everyone is unique, though.
- At present, don’t infer patterns (but that would be useful). Only identify known patterns.
- Plan to evaluate the speed of the analyses and the pattern matching. Pattern matching efficiency is on the order of a couple of minutes (for small patterns). Also want to measure precision: no false negatives because algorithm is exact, but some false positives arise from an insufficiently precise definition of the pattern graph.
- Possibility that anti-patterns do not appear in such consistent forms in the dependency graph.
Redundancy
[Unfortunately, I was presenting in this session, so I didn't take notes on these talks.]
A Spin-Up Saved Is Energy Earned: Achieving Power-Efficient, Erasure-Coded Storage
- [A nice summary can be found here.]
Spread-Spectrum Computation
- [Hard to be objective about this one, as I was the first author and presenter, but you can read the paper, and a summary of the presentation here.]
Modeling
Toward Quantifying System Manageability
- Operations cost 50–80% of IT budget, many billions of dollars a year. Should make manageability a first-order design goal: currently we have too many knobs to tweak. Oracle 10 has 220 initialisation parameters, 1477 tables of system parameters and an 875-page admin manual.
- Engineering requires measurement. Want to be able to compare systems to one another and assess progress. Today, proposing an objective, quantitative metric that is directly measurable. Not qualitative. It’s easy to show that Oracle 10g is better than IBM DB2 and vice versa, depending on the reviewer.
- Manageability is a predictor of the amount of human effort and risk of mistake involved in keeping a system operating at a satisfactory level. Management is a set of tasks, a task is a sequence of atomic steps. Task complexity is estimated by the number of atomic steps (intuition is that many steps means many opportunities for mistake).
- Do tasks include troubleshooting? It’s possible that the majority of management time and effort is spent there. Might be hard to quantify this.
- Manageability is time period over the sum for each task of the product of the number of steps, task duration and a weight which is based on the frequency that a particular task arises (opportunity for a standardisation body to come in here and suggest these). Non-dimensional measure, defined as being a “Management Unit” (MU). Relies on being able to certify atomicity: perhaps using an online reputation system.
- Choice taken to exclude many things from the formula: want to keep the manageability metric manageable. Want it to be easy to remember and apply. Principle is to skip aspects whose absence simplifies the metric more than the loss of accuracy (e.g. system size, data volume, using averages instead of distributions).
- Metric is: objective, directly measurable, simple, intuitive and widely applicable. (Question of whether the set of tasks is really objective, but could be standardised.)
- Need to come up with lists of tasks that define the management workloads for each class of systems. The list would have a count of the number of times each task would be carried out over a specific period of time (e.g. 3 years): these give you the weights.
- Example of installing Oracle on Linux. This is a single task, with 3 sub-tasks, all defined in the admin manual.
- Possible to include MU measurements in business decisions, assuming the cost of an MU (or the MU capacity of e.g. a DBA). (Question of whether you could spend less management time for a less optimised system, and how that would affect the cost trade-off.)
- Patterson quote: “For better or worse, benchmarks shape a field.” So this is an attempt to shape the field.
Towards Automatic Inference of Task Hierarchies in Complex Systems
- What do we mean by complex systems? One that is so complex that we can’t reason about it intuitively. So we tend to simplify and model them. Developers can represent a system as a hierarchical task model, which allows the encapsulation of implementation details as high-level tasks: e.g. in map-reduce we care about the map and reduce functions, not invidual message sends and receives (or the details of fault-tolerance, etc.).
- The work is to try to infer these models with minimal manual assistance from real systems. Tool called Scalpel, evaluated on Apache and the PacificA storage system.
- Give a number of function calls, want to identify task boundaries, and correctly associate dependencies between tasks.
- Tracing of calls and the parameters to signal, wait, send and recv. Use R2 in the implementation.
- Identifying leaf tasks as the smallest unit of work in the task model. Look at synchronisation points in the trace, and reason the dependency only happens on these boundaries.
- Constructing a causal graph, using the happens-before relation, but want to distinguish between causal dependency and things like mutual exclusion. Use heuristics, like observing queues and event-based code.
- Want then to identify frequently-occurring patterns in the causal graph, which are then recursively replaced with super nodes. This identification done by using hash functions for fingerprinting or by canonicalising the sub-graph and serialising it deterministically.
- Case study of a performance bug in PacificA. Discovered an issue during stress tests. Used task-level profiling based on the inferred task model, and took a top-down approach to identifiying the problem. Saw that the committing task could not saturate network bandwidth while its CPU usage remained low. Discovered that the sender was sleeping for a second during the send-packet task, to approximate flow control.
Consistability: Describing Usually Consistent Systems
- Distributed storage systems implemented as a group of replicas acting as a data store. Ideally the system would behave like a single replica.
- Consistency semantics define the guarantee provided by the storage system about what values can be read from it, both with and without concurrent operations. Existing notions focus on the worst-case guarantees, but don’t look at how often the worst case might arise.
- Opportunity exists to improve these notions to include a dependency on the operating conditions (e.g. the number of failures that might be seen), and the probability that these conditions might arise.
- Consistability: consistency guarantee dependant on operating conditions. Similar notion to performability. Enables the specification of SLAs that quantify provided consistency.
- Formal model of consistability as a mapping from set of failure scenarios to the set of achieved consistency levels. Consistability is the probability that a given consistency class will be satisfied given the expected distribution of failures.
- Performability is an open problem. So are transitions: what happens to consistency when you have reads and writes happening in different failure scenarios? Assume that these transitions are rare, otherwise the guarantees can be inaccurate. Also, can we verify SLAs based on consistability? This might require an oracle.
Distributed Systems
Byzantium: Byzantine-Fault-Tolerant Database Replication Providing Snapshot Isolation
- Database systems are pervasive and used for many applications. But they may exhibit Byzantine faults. Want to build a middleware system that uses an off-the-shelf DBMS and gives BFT.
- First version aims to be as efficient as possible, using an existing BFT replication engine.
- Usually talk in terms of ACID properties, but there are actually four ANSI-defined kinds of isolation, based on the set of phenomena that might be observed (e.g. dirty read, non-repeatable read, phantom read). Vendors have added a few more as well.
- Snapshot isolation: (logically) take a snapshot of the database, run transaction on the snapshot, and, on commit, abort transaction if has a write-write conflict with any other committed transaction since the transaction began (otherwise apply it). Supported by many vendors, easy for read-only transactions, and gives serialisability under common database workloads.
- BFT replication algorithms exist, but tend to have a high overhead.
- Don’t want to change the DBMS or applications (hide details in the JDBC driver), and build system on top of PBFT.
- Goal is to be efficient, but BFT operations are costly. Need to minimise the number of operations that run the BFT algorithm. Transactions help us: BEGIN and COMMIT are BFT operations, but the regular DB operations in the transaction are not.
- On BEGIN, client chooses one replica as the coordinator, and does a BFT operation to ensure that all replicas have started a transaction. On a read, only the coordinator is invoked. Same for a write. For commit, a list of operations and the results is sent using a BFT operation to all replicas. The coordinator checks if snapshot isolation properties hold; the other replicas execute the operations, and, if the results are different from the logged ones, it aborts; otherwise commit only if snapshot isolation is maintained.
- Correct replicas compute the same results because they execute in the same snapshot (due to total order on BEGIN and COMMIT). If there is a Byzantine replica, a quorum of correct replicas will still commit. If the coordinator is Byzantine, all correct replicas will abort on obtaining different results.
- What about Byzantine clients? Just need to do some simple checks: that IDs are valid, that BEGIN occurs before COMMIT/ROLLBACK, and that (at the coordinator) the list of operations is the expected one.
- Handling ROLLBACK could just do a BFT ROLLBACK operation and make no change to the database, but a Byzantine coordinator might caused the client to observe incorrect results that led to the ROLLBACK decision, so instead run a ROLLBACK in the same way as a COMMIT (which informs the client if it was rolling back due to incorrect results from the coordinator).
- First database that tolerates Byzantine faults while providing snapshot isolation (previously only serialisability) which gives better concurrency.
Pretty Good Packet Authentication
- At present, packets can’t be authenticated. It’s possible to spoof the source address of packets, or change the contents of a packet in the network. This causes problems with e.g. spam blacklists, false accusations (of contacting an illegal BitTorrent tracker), unverifiable complaints (of e.g. port scanning) and plausible deniability.
- There is a spectrum of possible solutions, from weak to strong: status quo, additional ingress filtering, IP traceback, clean-slate designs (AIP), cryptography and biometrics, brain scanner! But these strongest solutions are infeasible, and have privacy issues. Clean-slate designs would be nice, but it’s too expensive to replace all the routers in the network. PGPA is in the middle of the spectrum.
- Given a packet, a source address and a timestamp, the ISP owning the source address can verify whether the packet was sent at approximately that time. Requires trust in the ISP.
- PGPA protects user’s privacy. You must have the entire packet to ask a question about it. (And there is sufficient information in a packet that it is difficult to guess what the packet might look like for, e.g., connecting to a given website.)
- For spam blacklists: the recipient can ask the ISP that a packet appears to come from if it actually came from there (so no point in spoofing the source address).
- For false accusations: it’s now possible to check if a packet actually came from the accused’s machine (although we don’t know who sent it).
- For unverifiable complaints: can now verify if the port-scan packets were actually sent.
- Question of what if the packets come from a hijacked, innocent machine, e.g. via a botnet. So attacks would be launched at the application level, and the IP address is immaterial.
- PGPA limitations: associates packets with addresses, not users; reveals only that packets were sent but not why; and assumes no collusion between ISPs and users. But very simple, effective against real-world problems, compatible with anonymity, privacy-preserving, straightforward to implement and plausible to deploy.
- PGPA needs to keep a record of past traffic: a timestamp and a hash. These monitors should be placed on the access link, making deployment easier because the backbone is not changed. Could be at the user’s premises, which is inexpensive and scalable, but makes it possible for the user to destroy the device (and hence evidence). Could be at the ISP, which is easy to deploy, but relies on user trusting the ISP to report correct results. Could be at both, which requires less trust, but more overhead. Device could be tamper-evident.
- Storage requirement for a DSL connection: 1Mbps upstream fully utilised with 40-byte packets, means 3125 packets/s. Storing a SHA-1 hash and a 32-bit timestamp per packet would need 187Gb/month. Would require a single hard disk per user in the worst case.
Dependable Self-Hosting Distributed Systems Using Constraints
- Cloud computing infrastructure is becoming more prevalent. These provide different computation and communication resources, with their own pricing policies. More and more applications are getting deployed on these clouds because there is no need to buy physical resources. The scenario is trying to deploy an online service, but how do we select appropriate providers and resources? Want to save money here.
- One possibility is to deploy a service with a single provider, renting three instances in the US, and two in Europe. As the load from different parts of the world, you could rent additional resources in each market. But there’s a risk of correlated failure (e.g. all the machines in the US go down at once).
- Another challenge then is how to respond to outage from a particular provider.
- What if the requirements change? For example, the service could suddenly become big in China (but the service is hosted in the US or Europe). So we need to adapt to changes in application load.
- The pricing policy might change. For example, a cheaper hosting service might come online in China, but if you just use a single provider then you might miss this benefit.
- Approach is to bundle the management logic with the application instances. Use P2P-style self-organisation.
- Need a management runtime that lets application deploy itself, by monitoring load, resources, pricing, etc. It should adapt to changing conditions, acquire and release VMs as needed, and deploy new instances dynamically.
- Has been seen before in “worms”, intelligent agents, and autonomic systems.
- Question of whether the system would be trusted with your credit card number: the eponymous constraints will constrain the budget.
- The addressed challenge is how to specify autonomous application behaviour. For example (PsEPR): 6–10 well-connected nodes, geographically distributed and lightly-loaded (fixed constraints). Aim to maximise CPU cycles, minimise network diameter and reduce budget and migration cost (dynamic constraints). This becomes a constraint optimisation problem, so apply Constraint Logic Programming.
- Constraint Programming: specify constraints, utility function and cost function (i.e. objective function). Logic Programming: heterogeneous resource information.
- System called Rhizoma. Nodes running it (each node that a customer is using) form an overlay: a single node acts as coordinator. Need a monitoring system like CoMon in PlanetLab: could be provided by a third party.
- Initially deployed on PlanetLab: more challenging than cloud providers due to heterogeneity and flakiness. (Soon aim to do this on EC2.) Measured utility over time, in terms of CPU utility over time.
- Key ideas: distributed systems can be self-hosted, and use CLP to express desired behaviour. This should lead to better expressiveness, optimisation (of performance and cost) and responsiveness (to changes).
- Open questions: scalability of the CLP solver (use different reasoning algorithms, but don’t waste time on getting optimal solution (which might be more costly than having a suboptimal solution), and just use some heuristic knowledge to speed things up), application usage of CLP (reporting application and platform integration (trying different clouds, and running on personal networks (collection of personally-owned resources)). Need to have applications that are able to request more resources.
- Question of how to expressive the requirements of the application: other approaches have foundered at this point, will CLP be better? CLP appears to be quite high-level for specifying the utility and cost functions, and the constraints.