EuroSys
The European Professional Society on Computer Systems

European chapter of the
Special Interest Group on Operating Systems (SIGOPS)
of the Association for Computing Machinery (ACM)
Home Join or renew membership EuroSys for Students EuroSys for Faculty Job Offers Activities Systems Directory Systems Events and Blog Eastern Europe Initiative Member Area Member News Officers and Volunteers Useful links Press releases

OSDI 2008: Day 1

Opening Remarks

  • 193 submissions with three reviews each, through a two-level PC (16 heavy load and 17 light load). Three rounds of reviews.
  • At PC meeting, each paper had reviews from at least 5 heavy-load PC members.
  • 26 papers of which all were shepherded
  • Record attendance: 466 participants, 113 student grant awardees (some of whom stood up when asked).
  • 6 co-located workshops.
  • PC drawn half and half from industry and academia.
  • Used HotCRP and Banal for managing submissions.

Cloud Computing

DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language

  • Awarded a best paper award.
  • Problem is how to write distributed data-parallel programs for a compute cluster. Want to provide programmers with the illusion of programming for a single, sequential machine. Want it to scale up to multi-core and distributed resources. Want to use familiar languages and tools.
  • DryadLINQ programs written in Visual Studio, a plan is generated by DryadLINQ, and the program is executed in the cloud.
  • LINQ = Language INtegrated Query, available in current versions of Visual Studio. Provides traditional relational operators to work on datasets. Integrated into .NET languages (can invoke arbitrary .NET functions). Data elements are strongly-typed .NET objects. Extensible by adding custom operators and execution engines. DryadLINQ exploits this last fact.
  • DryadLINQ is one of many LINQ providers (also LINQ-to-object which is sequential, LINQ-to-SQL which uses a SQL database, and PLINQ which exploits multicore). The same program can run on any of these and scaled from single core up to cluster-level.
  • Dryad manages the execution: a general purpose distributed execution engine, in use for several years within Microsoft. A general DAG, with edges representing files, TCP, FIFO, etc. for communication between vertices (representing execution steps). Provides a vertex scheduler and fault tolerance.
  • Simple example: word count. SelectMany operator can take one dataset and return potentially many objects for each element (e.g. words from a document) — like the map function in MapReduce. Also GroupBy (regular SQL interpretation: combines identical keys into one dataset each) and Select (run once for each element).
  • DryadLINQ generates a Dryad execution graph, and code for executing the vertices and managing serialisation between the vertices. Output is stored in the cluster, but the client can obtain it through a DryadTable interface.
  • First step: generate distributed execution plan. Similar to optimisation in a parallel DB. Static optimisations include pipelining and eager aggregation (e.g. MapReduce combiner). Also dynamic optimisations: data-dependent partitioning (range or hash-based), dynamic aggregation, etc.
  • Also code generation: vertex code, serialisation code, callback code (to enable runtime optimisations).
  • LINQ query is separate from its local context, which means that all referenced objects in the query must be serialised and shipped to the cluster machines, as must any referenced DLLs.
  • Reduce is implemented as a sort, group-by and reduce function (local count, in the word count case).
  • DryadLINQ operations are converted a graph of vertices, and then vertices are multiplied so that they may be executed in parallel.
  • More interesting example: PageRank. Written in about 10 lines of code in Dryad LINQ. First join pages with ranks and propagate updates. Then re-accumulate the score for each page. Can see that program invokes DryadLINQ and DryadLINQ invokes program functions.
  • Note that the output of one round is already partitioned, so there is no need for data shuffling on joining in subsequent rounds. A shared memory FIFO is able to pass ranks from one iteration to the next, so no need to go via stable storage.
  • Execution can be combined with PLINQ on a single machine. And with LINQ-to-SQL which can allow retrieving data from a SQL database (and some functionality can be pushed into the database). Also runs locally for debugging purposes.
  • Currently works with any LINQ-enabled language, multiple storage systems. Internally released within Microsoft, and now released to some academic partners.
  • Learned that deep language integration worked out well: allowing programmer to use his favourite languages, tools and libraries. The useful enablers were LINQ extensibility, and .NET reflection and dynamic code generation; also the generality of Dryad (DAG model and runtime callbacks).
  • Future goal is to use a cluster as if it is a single computer. Firstly, what can be written with DryadLINQ. How can performance and usability be improved? How should concurrent jobs be scheduled? How can caching and incremental computation be exploited? What can be done with static checking?
  • Question: what control does the programmer have over the compilation process? Answer is to provide a policy to the Dryad execution runtime. At present, DryadLINQ hasn’t really dealt with this problem.
  • Question: what about continuous queries, instead of batch jobs? Dryad is a batch-processing system, so DryadLINQ inherits this property: would be an interesting research area to look at how to extend Dryad this way.
  • Question: what are the syntax and semantics of the language (like the relational algebra for SQL)? No explicitly parallel instructions are provided: the code is written in the relational algebra, which is highly parallelisable, and the execution plan is generated from these automatically. C# 3.0 has a syntax and semantics specified for LINQ, and these are inherited. Recursion is possible.
  • Question: systems like MapReduce have made people unaware of the sheer cost of a computation (which might use millions of dollars worth of resources) has any headway been made on how to give the developer an intuition about the cost of a computation? At the moment, we are still learning about this: it’s like a new toy. But there has been work into looking at optimisations, which could reduce this cost.
  • Question: have you talked to the ML and data mining people about how to model the performance and cost of DryadLINQ? ML and data mining researchers are using the tool right now, but understanding the performance is ongoing work.

Everest: Scaling Down Peak Loads Through I/O Off-Loading

  • Problem is that of I/O peaks on servers: that which is short, unexpected and high. These are typically uncorrelated across servers, and across volumes on a single server, and can lead to high response times.
  • Example is a trace from a production Exchange server (for 5000 MS employees). 7.2TB across 8 volumes, with hardware RAID, NVRAM and over 100 spindles. Despite all this provisioning, it frequently hits peak load: and the response time and load are 20x the average. Peaks occur but are very bursty.
  • Write off-loading inserts an Everest client in front of volumes that are being managed. Everest stores are placed on other machines in the cluster, which can export spare capacity. Under light load, Everest simply passes through reads and writes. When load gets higher, the writes are sent opportunistically to another Everest store that has low load. The base volume sees lower load and the response time improves. When the load recedes, the base volume now gets reads and writes passed through, and starts to reclaim blocks from the stores to which they were off-loaded. Each node would typically have a client, and a small portion of space set-aside as a store.
  • Peaks are uncorrelated across volumes, so less-loaded stores are typically available. Peaks must involve some writes, so off-loading the writes means reads see less contentions. And we see few foreground reads on off-loaded data: if they are frequently written, they are probably in the buffer cache.
  • Challenges. Wanted to have flexibility to send any write anywhere (always the least-loaded store to maximise load balancing). Wanted reads always to return latest version. State must be consistent and recoverable: plugging in at the block layer means applications above expect persistance — need to track latest and stale versions (for garbage collection). Want no meta-data writes to the base volume (because it would presumably be under stress).
  • Recoverable soft state is the meta-data to track off-loads. A map from block ID to its location and current version. Also need this map for stale versions. This meta-data cached in memory. Each off-loaded write has a header with the range of blocks used and the location. Meta-data is persisted on stores, so no synchronous writes to base volume: the base volume only needs to know the set of stores, which is small and infrequently changing. For a client to recover, it contacts all stores and gets the relevant meta-data; for a store to recover, it need only scan disk.
  • Stores are optimised for short-term, write-optimised storage. Store layed out as a simple, circular log, in a small file or partition on an existing volume. Not log-structured, data will be reclaimed so no cleaner. The store monitors underlying volume load.
  • Reclamation is a background task: read any blocks from a store that have been off-loaded, then write them to disk locally and delete the range. In practice, multiple concurrent reclamation threads are used.
  • Correctness invariants and consistency are maintained. I/O on an off-loaded range is always off-loaded. Reads are sent to the correct location, but writes do so to ensure that the latest version is always recoverable. Version deletion is only allowed if the new version is stable and written to a store, or if the data has been reclaimed and all older versions are deleted.
  • Evaluation based on replaying the Exchange server trace (24 hours long). Chose time segments with peaks, and extended them forward to see the effect of reclamation. Local server had 14 disks and 2TB of storage, so used a 3-volume subset of the Exchange trace. Chose max, min and median loaded volumes. A store was created on each volume, with 3% of disk space reserved. Clients were symmetric (could off-load to the other two volumes). Response time is significantly reduced for both reads and writes, both for mean and 99th percentile (numbers are 10 times higher).
  • Substantial improvement in latency for a real I/O server workload. Cannot tell end-to-end effects from an I/O trace however, so ran an OLTP benchmark with a SQL server binary that used an Everest client for I/O. 10′ warmup for caches, followed by 10′ measurement run. Transactions per minute improved by 3x by write off-loading, with only one extra disk. Just using log-structured FS gives 2x improvement, using 2-disks striped also gives about 2x improvement. By striping and log-structuring, you get the same 3x improvement as write off-loading (but it is easier to manage write off-loading).
  • Not a panacea: only works for short-term peaks, relying on burst behaviour. Cannot improve performance 24/7, because idleness is necessary for reclamation to work: in the long term this could lead to contention. Also presume that data is reclaimed before the store is filled up.
  • Question: synchronisation problem between updating the metadata on the client and persistent data on the server? Wait for the data to become persistent, and then can be sure that we have the version.
  • Question: can a sys-admin specify certain data ranges that should not be off-loaded because you might want to read them back straight away? Can be configured on a per-volume level, or a per-file level. Conceptually possible to do at either granularity.
  • Question: any analysis of how much correlation between volumes can be adapted to before the system breaks down? Not yet, and it’s not clear what kind of synthetic tests could be run to vary correlation.
  • Question: need to do anything special to support journalling file system, where the order of writes is important? Assume that if concurrent I/Os are sent in, then you are agnostic about the ordering, and write barriers could be put in at a higher level for serialisation.
  • Question: could larger caches achieve the same benefits? They’d have to be very large caches, because it deals with bursts of gigabytes.
  • Question: does the interposition of hardware RAID introduce any additional failure modes? The new dependency is between the client and the stores to which the writes are off-loaded, and these may or may not have hardware RAID.

Improving MapReduce Performance in Heterogeneous Environments

  • MapReduce is becoming popular in industry: even outside the usual suspects, due to open-source Hadoop. Promise is to allow running large scale jobs on large clusters of machines (20PB/day at Google).
  • Also see utility computing becoming more popular, especially Amazon EC2, which lets you rent machines at a cost of 10 cents/VM/hour. Good for letting you try out a system at scale before paying for physical resources. But performance is not as predictable as if you would be running it in your own data centre. Lets you run experiments on 1000 machines for a price of $100 per hour.
  • Main result is that Hadoop is challenged by performance heterogeneity between nodes, due to contention. This breaks task scheduler assumptions. The new scheduler, LATE, reduces response time by a factor of 2.
  • MapReduce splits computations into independent parallel tasks. Hides the complexity of fault tolerance at this scale. One FT mechanism is responsible for the problems we see.
  • If a task fails, re-run it on a different node. If a node is very slow (a straggler), you run backup tasks. Stragglers could be due to a failing disk (recoverable read errors), or other reasons.
  • Backup task races with the original to see which wins, and when one finishes, the others are killed. Very easy to decide who the stragglers are in a homogeneous environment. But less obvious in the heterogeneous case.
  • Heterogeneity. VM is good at isolating the CPU and memory access, but I/O bandwidth is inherently shared, so disk and network contention means that everyone gets an equal share. The performance differs by 2.5x between machines that run with a single VM, and machines that share with others.
  • Backup tasks are run after all primary tasks are started, and nodes begin to become free. Tasks report a progress score from 0 to 1, and a backup task is launched if a task has progress more than a fixed threshold below the average.
  • This can lead to starting too many backups, which thrashes shared network bandwidth. It also might mean that (under heterogeneity), the wrong tasks are backed up, which stalls the backing up of the tasks which are actually stragglers. And backups could be started on the slow nodes. And if tasks start at different times, this can break (since tasks started later will obviously have less progress). For example, ~80% of reduce tasks arer backed up, most losing to the original tasks.
  • Idea (patented) is to use the progress rate instead of the progress values. This improves things, but can still select the wrong tasks as backups.
  • Key idea is to back up the task with the largest estimated finish time: LATE = Longest Approximate Time to End. Look forward instead of looking backward. On top of this, put in some sanity checks, capping the number of backup tasks, launching backups on fast nodes and only backing up tasks that are “sufficiently slow”.
  • To estimate the finish time, estimate progress rate as progress score over execution time. Estimated time left is the amount of progress left to make over the progress rate. Put at 10% cap on backups and use the 25th percentile to select slow nodes. These values were validated by sensitivity analysis.
  • Evaluated on EC2 (200–250 nodes) and a small local testbed. Self-contention generated by VM placement. Stragglers generated by background processes.
  • Ran a sort on EC2 with stragglers. Gain a speedup of 58% over native Hadoop, and 220% speedup over no-backup case (on average). 93% maximum speedup over native.
  • Heterogeneity is a growing challenge for parallel applications. Since MapReduce abstracts away the problems of programming at a large scale, it would be nice to abstract away heterogeneity as well.
  • LATE contributed back to Hadoop.
  • Question: surely the only variable that matters are the configurations of the machines on which the job are run, so why don’t you just use a greedy local strategy? More sophisticated strategies are possible.
  • Question: what about a small cluster where you have more tasks than nodes? First launch all the primary copies before considering backups.
  • Question: what if the individual tasks are heterogeneous? Currently assuming that tasks take the same amount of time, and that Hadoop is taking care of that. Would need to normalise by the end of the task, and prioritised based on expected gain.
  • Question: any advice for people who build utility computing services? Two things would have helped: would be good to have more monitoring and visibility into the system; and also to know the network and rack topology (to aim to place tasks close to the input data).

OS Architecture

Corey: An Operating System for Many Cores

  • About new OS interfaces to help applications scale with number of cores. Applications spend time in the kernel: e.g. Apache doing directory lookups and TCP processing. Even MapReduce spends time in the kernel. Want this time to scale with the number of cores.
  • Problem is contention on shared data structures in the kernel. Need serialisation and copying between caches. OS semantics requires these to be shared, unfortunately.
  • Current practice is to redesign kernel subsystems with finer-grained locking, RCU, etc., which has been lots of work for kernel developers.
  • Solution is to change the OS interface: applications don’t always need to share all this data. Want to allow applications to control how this is shared.
  • Three new interfaces: shares, address ranges and (not discussed) kernel cores.
  • Shares. Kernel must map an application-visible reference into the address of a kernel object: e.g. file descriptors. Measured the cost of using the FD table on a 16-core Opteron. Would expect or hope that dup()/close() operations (on different FDs) would scale linearly with the number of cores. But the drop in throughput from one to two cores is almost an order of magnitude. Throughput drops because of the high latency of retrieving data from a remote cache. On one core, it’s in L1; on an additional core, it has to go to the remote L1, which takes 40x as long. Also, the shared FD table is a bottleneck as accesses are serialised.
  • Can performance be improved? Some of this sharing is unnecessary, and could improve performance for these applications. How to figure out when it is and isn’t appropriate to share these structures? Solution is shares.
  • Shares let applications allocate a global table for FDs that are shared amongst cores, and allows FD allocation either in the shared or the global table.
  • Shares were added to Linux. FD system calls (sys_open, sys_dup, etc.) were extended to take a share ID argument. Can have threads with their own private FD table (using share_alloc() call), which gives the linear scaling we expected.
  • Few modifications to the kernel and applications required to get scalability.
  • Address ranges. Multiprocessor shared-memory application can be written to use a shared address space, or shared memory with mmap(). With a shared address struct, contention on mm_struct, even when modifying private mappings (e.g. cores growing their own stack). With private address spaces, one mm_struct per core, so no contention on updating private mappings. But, first time a page is touched, a soft page fault happens, which allocates physical memory if necessary: this will happen once per core with private address spaces, because there is no PTE sharing.
  • Neither option accurately represents how kernel data structures are being used.
  • Address ranges avoid contention while sharing PTEs. Each core has a private, root address range. For shared memory, you can add a child address range to the root, and map it into each root. Contention only when both manipulating the shared AR. Page tables corresponding to address ranges can also be shared. Therefore the soft page fault only happens once.
  • Also good for complex memory sharing patterns: not just shared or private.
  • Building an inverted index with MapReduce. Each core gets a partition of the input, builds up buckets with the locations of each term (map task). Reduce task on each core generates the inverted index. Goal is to have no contention when growing the address space using map (due to buckets growing). PTEs should be shared between map and reduce tasks.
  • Each map task’s bucket array is a different address range. The PTEs are then shared with all the reduce tasks, which means no additional page faults.
  • Address ranges implemented in Corey, with a low-level kernel interface for mapping memory and address ranges. Performance evaluated by building an inverted index of a 1Gb file. Throughput improves better on Corey than Linux after 8 cores, then Corey is 25% better than Linux on 16 cores.
  • Built on research on NUMA operating systems and multicore.
  • In future, finish Linux interface changes, to experiment with larger workloads, and answer how much of the interface needs to be changed. Also want to look at how to use caches better to reduce the cost of manipulating shared data. How can the large aggregate cache capacity be exploited.
  • Also see kernel cores, in the paper.
  • Question: memory subsystems at the moment are dumb, but what if you had an ALU in your memory subsystem which could do atomic operations for you? Probably a lot.
  • Question: were the shares evaluated using more-realistic workloads? In Corey, shares are used for everything, and there was lots of contention on these shared data structure. Some Apache benchmarks scaling up to 16 cores were showing lots of FD table operations.
  • Question: how could the ideas of reducing sharing be applied to other systems like databases or libraries? Haven’t thought about it.
  • Question: as the number of cores increases, wouldn’t any sharing kill you? In HPC systems, message-passing is used, so wouldn’t this be a better approach? The focus is on system level applications (not HPC workloads), but interested to look at how difficult it would be to implement applications without shared memory. Seems like you have sharing globally across the whole chip, or keeping things private, so as that chip gets bigger, is having global sharing a scalable model? The Corey model allows semi-private shares, as well as global or private.
  • Question: when there are private copies of things, there are issues with consistency, when doing something like a global unmap, aren’t there? Is there an operation to globally unmap a page? Not dealing with private copies, rather private address spaces, so that operation wouldn’t exist.
  • Question: reminded of early NUMA work, but is it possible for the OS to automatically pull out the locality? A topic for future work, but it is in some ways to tell the OS exactly what you want, and then get it. But it might be possible to extract some benefit.

CuriOS: Improving Reliability through Operating System Structure

  • Looking at faults in systems: hardware (stuck-at faults, bit-flips, unresponsive devices, soft error rate a growing problem) and software faults (race conditions, invalid pointers, uninitialised variables) with something like 20–30 bugs per 1000LOC.
  • Could manage OS errors by writing OS in a type-safe language, or prove using a formal model. The other approach would be to let an error happen and then recover from it (microkernel approach).
  • Restart-based recovery: when a microkernel server fails, it dies and is recreated — hopefully everything is fine. Tested this theory on Minix3 and L4/Iguana with fault injection. Minix3 has inter-component isolation, and has a reincarnation server to restart failed services. Injected faults in the network service, which can restart, but all existing network connections fail. On L4, killed the timer service, but all OS components registered to receive periodic interrupts stop working. On Chorus, restarted the file system service, and all file handles became invalid and the system is unusable. On EROS, there is no protection between OS components, but it does periodic checkpointing of complete system state, so can recover by loading from a checkpoint. This works, but there is overhead with checkpointing.
  • Problem is that there is server state that is lost when the server is restarted. The fresh state doesn’t reflect the existing connections, handles, registrations etc.
  • Restart-recovery works for stateless components. To get client-transparency in stateful components, you need: transparency of addressing, client suspension during the recovery, client-related state persistence, and client-related state isolation (errors in the service mustn’t corrupt it).
  • No microkernel tested provides all four of these. None provides isolation of client-related state.
  • CuriOS partitions server state per client and moves client-based information in separate regions of memory (notionally with the client), persisted across restarts. “Server state regions”: isolated from the clients and the server.
  • When a client interacts with the server, the client info is mapped temporarily into the server’s address space. On response, it gets unmapped. This gives error containment: it can only affect the client which is currently interacting with the service.
  • CuriOS is object-oriented, written in C++, and uses hardware-based protection to isolate the components. CuiK kernel runs in privileged mode, all else in unpriviliged mode.
  • Protected objects are the unit of isolation (process), with restricted privileges, and private page tables, heap and stack. Implemented in C++ using inheritance.
  • Single address space OS with different access permissions for different components (protected objects).
  • Server state management: SSR created per client-server binding, destroyed during unbinding, and mapped using PTEs. Service is restarted if an internal exception is raised. Can access SSRs during recover to recreate internal state, and check SSRs for consistency. If multiple restart-retry attempts fail, need to escalate the exception to the client. Might sacrifice one client (e.g. that gave a malformed request) to keep the rest of the system running.
  • Gives all four necessary properties for restart-recovery of stateful services.
  • Periodic timer manager provides notifications to clients. In SSR, store timer period and start time. Server stores list of pending notifications. On restart, server recreates its queue by looking at all of the SSRs.
  • Network service based on LWIP, converted to C++. All failed assertions (checking data structure consistency) throw exceptions, and state management code modified to use SSRs. Each socket object is a client, and the server is a TCP service protected objects. After crash, the IP configuration is reloaded from configuration: in the worst case you might miss a couple of packets, but TCP will obviously recover from that.
  • Implemented for mobile devices on ARM (due to cellphone company funding). Run on TI hardware, and QEMU. Did fault injection using QEMU, with memory aborts and register bit-flips. System is recovered if it can run processes and access its file system. Also recovered if at most one client was affected by the recovery. Get over 87% recovery. 100% recovery for memory aborts, but less for register bit-flips with error propagation.
  • Performance evaluation, measured number of additional instructions (microbenchmark on a protected call) and the time overhead. ~4000 extra instructions and 200–400us time overhead.
  • Memory overhead due to additional page tables per protected objects, but memory is cheap (could use Mondriaan memory protection).
  • CuriOS built upon ChoicesOS code base. The services each required 50–100 extra lines of code.
  • Focus was on error confinement and reducing error propagation. Error detection was orthogonal, and a better scheme would give better recovery.
  • Question: how does exception escalation work? Only propagated to a single client, which is allowed to fail.
  • Question: is it always safe to reuse client data from before restart, after the restart? Any inconsistency should be resolved by the server on restart, when it has access to all the SSRs.
  • Question: SSR management is implemented as a kernel service, but could it be a user-level service, and how would you handle failures of the SSR manager? Assume the SSR manager is a small service, so it won’t fail (and hence it would be fatal). Could be implemented at a user-level, as it is just a mapping store.
  • Question: how many of the evaluated bit-flips ended up corrupting memory? They ended up causing bad memory accesses, which is how they were detected. Some of the bit-flips ended up being no-ops. About 70% of the bit-flips don’t manifest as any error. In the evaluation, only the visible ones were counted.
  • Question: type-safe OS would not help in many of these cases (myth: busted). Is it overly weak to rely on a component detecting its own faults? Just using a fault detector that we got for free. Should the fault detector be isolated itself? Yes, it could be. Detection is basically an orthogonal problem, and we’re only talking about recovery.

Redline: First Class Support for Interactivity in Commodity Operating Systems

  • Want better responsiveness. Applications are increasingly interactive (GUIs, multimedia). However, if you have a concurrent kernel compilation while watching a video, you get massive jitter in the frames per second. Redline only drops a couple of frames, and only occasionally.
  • Problem is resource management. Resource managers don’t take response time into account. e.g. LRU page replacement is bad for an interactive process, which doesn’t use memory often. Also CPU, memory and I/O contention.
  • Interactive systems are between real-time systems and general purpose systems. Some slow-down can be tolerated, as long as the system reacts quickly. Relatively low performance guarantee but better resource isolation (than a general purpose system).
  • Want to maintain system responsiveness even when there is heavy overload.
  • Redline does this by coordinating multiple resources (CPU, memory and disk I/O) with a lightweight specification of which processes are interactive. Uses adaptive admission and load control, using recent history as feedback.
  • A specification first passes through admission control, and then hits the CPU scheduler, I/O management and memory management. Feedback then from these units through the load manager.
  • Specification management by associating binaries with a few parameters. A single mouse click can touch many processes (X, kernel threads, daemons, user-selected application). Anything that might directly or indirectly impact the response time is included in Redline’s reckoning. A specification is a pathname, a binary interactive type, a CPU reservation (ratio of time reserved), memory protection period, disk I/O priority and inheritable/revocable flag. Revocable means that an interactive application can be made best effort under load.
  • Currently, CPU bandwidth allocated in a relative manner. If you launch lots of processes, priority boost becomes less useful. Redline allows reservation of CPU bandwidth, then uses a earliest deadline first scheduler. Interactive tasks go into EDF and proportional-share runqueues. So after the reservation is used, you can still use proportional sharing. EDF requests are processed first, before proportional shares.
  • Redline tries to be permissive, accepting the task first, and solving overload when necessary. The admission test is permissive, based on actual load and specifications. Overload is managed by revocation (which can be reversed under better conditions).
  • In standard virtual memory, interactive apps are vulnerable because they don’t touch pages very fast, so they are vulnerable to LRU, so the OS will shrink the working set size, leading to more pages being evicted, more page faults and therefore an even lower memory reference speed: a vicious circle.
  • Therefore Redline protects the working set size of interactive applications. Pages are protected for some period, by default 30 minutes. If the working set can’t be held in memory, revoke the interactivity.
  • Applications that occasionally allocate memory will run into page reclamation, affecting response time, so a small reservation is made for applications that need this: fast small memory requests (but can’t take too much). If best-effort applications request memory too much, a speed bump is inserted, to slow memory reference.
  • Disk I/O management stretches through page cache, journalling file system and the block device layer. The page cache caches dirty data that goes to disk. A journalling file system writes data first, followed by metadata. If an editor saves a small file, it will block because the pages must be laundered; fsync() will give a long delay before committing (20–30s). Solution is to keep different dirty thresholds for each type of process. Best-effort tasks must launder pages earlier which prevents large compound transactions.
  • Evaluated a prototype in Linux 2.6.20.5 with CFS patch. Specifications for 100+ applications, and 70+ kernel threads.
  • Looked at a fork-bomb. Both a best-effort or interactive fork-bomb will barely introduce jitter into a running video. Linux CFS will barely show any frames.
  • Same for a malloc-bomb: Redline doesn’t do quite so well (still much better than CFS), and shows a couple of large jitters, lasting a couple of seconds each (worse when the bomber is interactive).
  • Also for heavy direct I/O, Redline does very well.
  • Question: is the application at the right granularity to address this: some applications might have both background (best-effort) and foreground (interactive) tasks? These applications would be split into threads, so could do this at the thread level.
  • Question: what is the effect on the end-user of rejecting requests at the specification level? Will the computer stop doing what you want it to do? Or should I restart my laptop to get it back to a usable state? Redline never kills anything. It’ll never be worse than a normal Linux system. No tasks rejected, but it might not get the guarantee that you hoped for.
  • Question: why were different specifications used for each resource when in a unified model, you would expect to use the same thing for all kinds of resource? Different kinds of resources need different kinds of reservations. Redline attempts to have sensible defaults (memory protection and I/O was not varied in the shown experiments).
  • Question: though a mouse click touches many processes, how does this priority propagate through the system? init is specified as an interactive application, allowing all applications to be interactive for a short period of time. So after a process has forked, you don’t know what it will try to execute. So it gets to be interactive for a millisecond. When the exec is made, the specification is consulted.
  • Question: lots of opportunities to starve best-effort processes, so how would this compare to the Real-Time Linux scheduler (since CFS is supposed to be more fair)? CFS is based on fair queuing, but doesn’t guarantee how much resource will be received. So we need some way of isolating this: some starvation might be necessary.

Monitoring

Network Imprecision: A New Consistency Metric for Scalable Monitoring

  • Challenge is how to safeguard accuracy during network failures. Monitoring is important.
  • Motivating example: detecting DDoS attacks from PlanetLab: want to scale to a large number of nodes and high data volume. Want to detect attacks in real-time. And need accuracy despite failures: the key challenge addressed by this talk.
  • Example: computing traffic rate to each destination IP from all PL nodes. Half of the reports deviate by >30% from the true value. 20% of the reports deviate by >65%.
  • Network imprecision bounds the inaccuracy/uncertainty due to failures. Converts best-effort results into reliable results. It will tell applications whether or not to trust the accuracy of the result. By quantifying the system stability, it’s possible to improve accuracy by 5–10x. Implemented efficiently even with active probing of nodes, by explointing overlay symmetry, making it O(N).
  • Current techniques for scalable monitoring. Aggregation trees spanning all nodes in the system (using DHTs) can be used to summarise information. If an internal node fails, the entire subtree is lost; reconfiguration can cause double-counting. So improving scalability risks failure amplification.
  • Or could do arithmetic filtering. But if a subtree doesn’t send updates, is this because it hasn’t significantly changed, or it’s suddenly become unreachable?
  • Or could do temporal batching, by collecting individual updates at internal nodes, and combining them into a single report. But a short disruption can lead to a large blocking period across a large part of the tree.
  • Want to find out the top 100 destination IPs receiving highest traffic from PlanetLab. As the batching interval increases from 15–60 seconds, the load reduces by 5x, but there are diminishing returns. As the arithmetic filtering (deviation from average) becomes more liberal, there are 10–30x reductions in the probing bandwidth.
  • These scalability techniques imperil accuracy.
  • Do we face the CAP dilemma? Need to accept that the system is unreliable, due to node failures, link failures and overlay reconfigurations. Instead, quantify system stability. If the system is stable (all nodes are up), we can trust the monitoring system absolutely. Otherwise, the result should be distrusted. But in large-scale systems, stability is not binary.
  • Many sources of instability that manifest themselves as missing/delayed updates, and double-counted updates.
  • Missing/delayed inputs: quantify N_reachable (lower bound on number of nodes whose recent inputs are guaranteed to be in the result) and N_all (estimated number of nodes in the system). So now we know how much of the answer comprises stale inputs.
  • Double-counted updates. Define N_dup, an upper bound on the number of nodes whose inputs may be double counted. So now we know how much of the answer is impacted by overlay reconfigurations.
  • Together, these metrics are useful to expose the impact of disruptions on monitoring accuracy.
  • NI is application-independent, inexpensive (efficient to compute) and flexible: i.e. it is a mechanism. Applications can set the policy themselves (by filtering inconsistent results, performing redundant aggregation or on-demand reaggregation). No one-size-fits-all.
  • By applying NI-based filtering, can get 80% of reports to have <15% error. Do this by choosing results where NI is less than a particular value. NI is the proportion of unreachable nodes plus the proportion of duplicated results. So the technique here is to discard results when imprecision is high.
  • Another technique is redundant aggregation, using e.g. k trees, and then pick the best result. One of the trees is likely to route round disruption and hence give a more precise result. Can measure the value of k which gives the best improvement in NI (k = 4 gives a 5x improvement in accuracy). By combining the two techniques (filtering and k-aggregation), can get a 10x improvement in accuracy.
  • Implementing NI is simple: just a count aggregate. However it is difficult to implement these efficiently. Requires active probing of each edge in each tree, which would naïvely require a lot of messages.
  • Insight is that DHT-trees form a butterfly network. Common calculations can be reused across different trees, which reduces load from O(N * d) messages/node to O(d * log N) messages per node. For N = 1024, this requires about 5 messages/node/second.
  • Question: an extension to quality of information from the DB community. Wouldn’t it be nice if I could ask a monitoring system for a particular level of accuracy or NI or whatever, and could you construct me a system that gives me that level of accuracy? The approach has been to separate the measurement mechanism from the policy that sets the accuracy bounds, but you can’t guarantee an accuracy bound due to the CAP problem.
  • Question: looks like aggregation techniques from sensor-networks, so why not look at order- and duplicate-insensitive aggregation functions from that literature? These are complementary because in principle these try to minimise the impact of disruptions on accuracy. They are a policy that can be applied in this framework.

Lightweight, High-Resolution Monitoring for Troubleshooting Production Systems

  • System called Chopstix. Motivation is that there will always be bugs in production systems, aggravated by increases in system complexity, despite analysis and testing frameworks (tools are far from perfect). Need to be able to deal with them when they do happen.
  • There are easy and hard bugs. Easy bugs come with a precise description of the circumstances in which they happen. (e.g. trying to do a ping and finding that an address family is not supported by protocol.) The hard kind would be when the ping time is all over the place.
  • Hard bugs are hard to characterise, reproduce, trace to a root cause (ambiguous or temporally removed from the root cause) or intermittent and unpredictable.
  • Example, the kernel crashing intermittently with an out-of-memory bug on PlanetLab. Or the kernel freezing every 1–7 days. Or ping latencies increasing dramatically.
  • Medical analogy: easy problems are single, localised injuries, while a hard one might be a vague sense of malaise.
  • Chopstix treats systems as patients (using vital signs, symptoms and samples), and correlate all this information to pronounce a diagnosis. Monitor the whole system (holistic approach), continuously.
  • Strategy is to monitor low-level events (scheduling, I/O, syscalls, memory allocation, cache misses, etc.) which are omnipresent and capture changes in system behaviour. The magnitudes of these events are called “vital signs”. A deviation is a “symptom”.
  • Obviously can’t log all of these events, so could do uniform sampling, which is biased in favour of high-frequency events. So use frequency-dependent sampling, which uses a sketch (approximation of the frequency distribution of a set of events). Sampling rate is a function of the instantaneous frequency (in this case, logarithm of the frequency).
  • On an event trigger, extract a signature of the event, hash it, and update the sketch (which approximates the instantaneous frequency). If a sampling decision is made, collect the sample, and perform logging. Everything apart from extracting the detailed information is lightweight. This is efficient because of the principle of locality (for e.g. syscall used, it’s likely to be the same as the last one). At the end of an epoch (60s), store the log on disk, and update the sketch.
  • Chopstix server runs a polling process, and a time-series DB that enables monitoring via a GUI.
  • Event set is fine-grained, but the overhead is 1% of CPU utilisation and <50MB/day of disk consumption on each node.
  • Can drill down to interesting parts of the vital sign graph, which lets you see what is running, including processes, EIPs, stack traces, etc.
  • Might be able to use intuition (e.g. high CPU use means busy loop) or better use a library of diagnosis rules which combines different vital signs (e.g. low CPU, high scheduling delay, high system CPU might mean kernel bottleneck).
  • Example of how Chopstix tracked down an elusive bug. Some nodes would freeze every few days (or not), with no information on the console, no printk messages from KDB, stalled SSH sessions and heavy I/O reports from vmstat.
  • Chopstix said blocking and I/O vital signs shot up. Looking at stack traces revealed that journalling daemon was trying to commit a transaction at the same time as many transactions were being started. Thought this might be a deadlock. However, I/O performance was stable and both resource-blocking and I/O vital signs showed drops back to a normal, low value. So this behaviour might not be responsible for the stalls and the freeze.
  • But the spikes in the scheduling delay, coincident with spikes in kernel CPU utilisation mean that a loop in the scheduler was actually responsible for the freezes.
  • Implemented as a library for generating events (i.e. making additional vital signs), an efficient sketch data structure with adaptive sampling, various optimisations for collecting large amounts of event data (e.g. stack traces, etc.), and a time-series database in 4000 lines of OCaml.
  • Resolved a number of PlanetLab bugs.
  • Evaluated performance overhead using lmbench microbenchmark. The slowdown is between 0% and 2.6% (for getpid). Also with kernel compile and apache macrobenchmarks, which show almost no overhead in the benchmarks (much better than oprofile, and managed to collect more data).
  • Probability of error is a function of the distribution of counters in a sketch. By varying the size of the sketch, the probability of error is between 0.0001 and 0.00001%.
  • Question: compare work with DCPI from SOSP 1997? Similar to oprofile, which this beats.
  • Question: is it possible to capture the thinking of a systems guru looking at this output by data mining? Various knowledge-base systems for mining source code, and they’re not perfect: you really need real intelligence to analyse these data.
  • Question: useful for developer as a kernel hacker, but how could it be extended to abstract the vital signs to a higher level and give signals of more abstract health, so that a system could say, “I’m sick, I need to see a doctor”? A lot of intelligence could be coded into rules, and the rule database could grow over time (to give a SysMD, comparable with WebMD) that allows diagnosis by less experienced people.
  • Question: how scalable can this be made if you want hundreds or thousands of metrics? Haven’t carried out these experiments. The <1% utilisation is between 0.1 and 1%, so there should be more room to squeeze in other metrics.
  • Question: in AI in the 1970’s and 1980’s medical diagnosis with AI was prevalent, but there was an explosion after 20 or so rules, so there was a move to other systems such as Bayesian filtering. One of the reviewers was excited about rule-based systems, and wanted it cited, but it was hard to find. Maybe that should tell us something.

Automating Network Application Dependency Discovery: Experiences, Limitations, and New Solutions

  • Enterprise networks are attracting research interest. A bunch of clients, a server farm, router, firewall and connection to the IP backbone. Various applications run on these networks, such as VoIP, which may be critical to the business. In a large network, there are usually thousands of applications running simultaneously, thousands of people are doing IT support and lots of money is spent thereon.
  • Management is complicated because talking to single server is rarely sufficient to implement an application (e.g. MS Office Communicator, a VoIP/messenger app, uses DNS, Kerberos, VoIP, Director and many back-end (file, SQL) servers), so the correct functioning of an application depends on the correct functioning of all these servers.
  • Extracting dependency information is hard. Applications are heterogeneous, in terms of functionality and deployment. The knowledge of these dependencies is distributed across layers and locations. Also, applications continuously evolve, adding new services periodically, and reconfiguring/consolidating others.
  • e.g. if a Kerberos service is down, you can no longer use Communicator, but you don’t know why this is. You would typically check the Communicator backend servers, and it might not be obviously to go to the Kerberos one. Dependency information would accelerate and simplify the support process.
  • Current solutions are based on human knowledge and understanding of the system and its dependencies. This is expensive, error-prone and difficult to keep up-to-date. Want instead to automate this.
  • Sherlock and eXpose (MS, SIGCOMM papers) are related, based on co-occurrent dependency discovery. Also Project 5 and WAP5 (HP Labs) extract execution causality paths.
  • Introduce a new technique to discover service dependencies based on delay distribution. Identify the limits of this based on temporal analysis. Evaluated on MS’s enterprise network (first thorough evaluation).
  • Design goals: a generic solution for many applications. So do passive packet sniffing. Unintrusive way of obtaining data, and only parse TCP/IP headers (no application-specific knowledge). Minimise false dependencies without losing true dependencies, but it is hard to recover missing true dependencies. So minimise the filtering of false dependencies.
  • System called Orion. The time delay between dependent services reflects the typical processing and network delay. So identify these.
  • Identify service based on IP address, port and protocol. This leads to too many service pairs (so ignore ephemeral ports). Also problem is that dependencies exist between application messages, so only rely on TCP/IP on headers and aggregate packets into flow. This reduces bias introduced by long flows, and reduces the number of pairs.
  • What about a lack of samples? Orion needs a fair number of samples to infer dependencies. The solution is aggregation, so clients have similar dependencies, applications may be hosted on a cluster of servers, and the same service may be hosted on several ports. Can aggregate these.
  • Use FFT-based filtering to remove noise in the delay distributions.
  • Deployed Orion in MS enterprise network. Looked at Exchange, Office Communicator, Source Depot (like CVS), Distributed File System (software binaries held here) and Web (intranet sites).
  • Evaluation criteria, want to see missed dependencies (false negatives), incorrectly-inferred dependencies (false positives) and reduction ratio. Orion has no false negatives, and far fewer false positives than Sherlock and eXpose (though all non-zero).
  • Showed an example of 11 dependencies for MS Office Communicator.
  • As we see more flows, the false positives increase, but false negatives decrease. True positives eventually converge to the correct value.
  • Limitations: Orion requires training, isn’t applicable to P2P applications, and may miss certain kinds of interactions (like periodic backup tasks). It may include false positives.
  • Temporal analysis is limited (as no application-specific knowledge), but false positives can be made manageable.
  • Question: why did filtering induce a small peak in the delay distribution at the in the rightmost bins? This was an artifact of the filtering, but doesn’t affect the result.
  • Question: in a data centre, may have services dependent on services and so on, so a client action may go through lots of servers indirectly, so is the resolution enough to identify this without deep packet inspection? Could definitely do deep packet inspection, but temporal analysis can be made usable.
  • Question: with the prevalance of virtualisation technology, services could be migrated between machines, so does the technique take that into account? Service migration is a potential application of this information: could migrate dependent servers together.

Work-in-Progress Reports

  • Multikernel: an architecture for scalable multi-core operating systems. Multi-core scalability of OSs is a problem, because of locking and shared data. The interconnect on large-scale multi-core systems looks a bit like a network, so use distributed systems ideas to make things scale. Traditional approach is reduce lock granularity and partitioning state; blocking operations are the norm. In a multikernel, have no sharing by default (everything either partitioned or replicated), and cores run an agreement protocol, based on message passing between cores. Long-running operations become split-phase, and sharing is re-introduced as an optimisation between some cores (e.g. betwen cores that share a cache). In a multikernel, the kernel is replaced with a per-core CPU driver, and a monitor on each core that run an agreement protocol. So if an application wants to globally unmap some memory, it will ask its local monitor to do that, the monitors will communicate among themselves, and will have consensus before the unmap is implemented by the CPU driver. Barrelfish project is looking at how to make OSs scale to multi-core and with heterogeneous channel. Might give the opportunity to handle partial failures (through fault tolerance algorithms).
  • Transcendent memory. Memory is cheap, but should it be wasted? Many petabytes of memory across the world are sitting idle and wasted, but it is costing us money in power, cooling, etc. Virtualisation has shown us how to improve CPU utilisation, but we have still treated memory as cheap. So revoke ownership of memory from VMs that hoard it, and better utilise chunks left fallow by fragmentation. Want to improve ballooning and look at better ways to share money between VMs. We know that unused memory is used as a page cache, so can we improve cache management? Sometimes the hypervisor has memory unassigned to any VM. Solution is transcendent memory. Collect all memory in the system into a pool, and provide a PV interface to that to give indirect access strictly controlled by the hypervisor. Requires changes to or paravirtualisation of the OS. Can use the memory as a second-chance cache, a fast swap device (working today), or inter-domain shared memory, or a filesystem cache. If it works in a VMM, could it be used in a single OS, e.g. for autocompression of transcendent memory.
  • Towards Less Downtime of Commodity Operating Systems’ Reboot using Virtualization. OS reboots are essential (for patches etc.), even when something like iTunes is updated. The downtime is long, however. Solution is to use a VMM-level approach, applicable to commodity OSs without modifying the kernels. Uses snapshot technology, which restores the OS execution state from an image taken in advance. Restoring a snapshot is 6x faster than an OS reboot. Using VirtualBox at present. On reboot, clone the target VM and reboot the clone in the background. Once the clone finishes booting up, take a snapshot of the old OS. Then copy the state of the old OS into the rebooted clone. Provides a much shorter downtime.
  • Texen: Virtualization for HTM-Aware Operating Systems. TM is a simpler programming than locks, and they believe OSs will be using HTM in future. Solaris 10 is rumoured to be using it, and there was the SOSP 2007 work on TxLinux. Problem: a VMM will violate the assumptions in HTM because they deschedule virtual CPUs. Opportunity: paravirtualisation is an opportunity because of the uniform I/O model they impose. HTM is a replacement for locks: speculate that it is safe for critical sections to execute concurrently and detect conflicts in hardware. If a VCPU gets descheduled, it might not be possible to detect a conflict. But since transactions and I/O don’t mix well, we can extend the ISA to deal with I/O in critical sections. TeXen is TxLinux + Xen. Uses HTM for synchronisation.
  • SnowFlock. How do we instantiate VMs to deal with flash crowds in the popularity of a service running in the cloud? Takes minutes to start up a VMs, by which time you’ve missed the crowd. SnowFlock can clone VMs very quickly (in less than a second) and it imposes negligible runtime overhead. And it makes it easier to program because there is an API for forking new VMs. Available as free and open-source software.
  • Toward Differentiated Services for Data Centers. Why do we care about differentiated services? This is a hot topic. Cloud computing should offer availability, reliability, durability and performance. We think that replication is the key to provide differentiated services. Targeting application developers, not end users. Objective is to build a read/write only storage system that provides different availability to users with better server resource utilisation. Want to handle dynamic and heterogeneous hardware and software. What is a fine-grained approach: let client pick number of replicas, replication algorithm, network topology, etc. A coarse-grained approach is differentiated replication, which uses a reputation system for nodes (based on past behaviour) to choose them in the system in order to match the user’s requirements. Implemented a prototype based on Chord/DHash on a cluster of 22 nodes. Looked at synthetic and real traces (including one from Microsoft based on PC failure).
  • TCP Incast Throughput Collapse in Internet Datacenters. Problem that happens a lot in typical datacentre workloads, like the shuffle stage in MapReduce, in web search, and many other scenarios. Requests are broadcast to many nodes, they process for a while, and then they come back simultaneously, causing congestion. No clear TCP or network level solutions. Throughput collapses for even a small number of senders. Look at sequence number and RTT dynamics. In 10-to-1 case there is huge variation and the absolute values become very large. Looking at TCP solutions, like breaking synchronisation, less aggressive exponential backoff, better SRTT and RTO computation, and minor modifications to an OS kernel. Also non-TCP workarounds at application, network and link layers, but they all have limitations. Open to suggestions. (Switches that cap buffers at the order of an RTT might be worth looking at. Could just add more buffer space.)
  • Gridmix: A Toolset for Benchmarking Hadoop. Motivation is to find useful benchmarks for keeping track of Hadoop performance, and track how this changes across releases and different deployments (clusters). Observe that different people have studied the performance from different perspectives. Need to understand the performance impact of a patch. Also need to validate the performance of new clusters. Come up with five different types of MapReduce jobs of different sizes and different types, which try to push a Hadoop cluster to different extremes. One use case at Yahoo! has been profiling the change as more releases have come out: show that the execution time has decreased with each new version (from 108 minutes on 0.15.3 to 46 minutes on 0.18.0). Also used for performance engineering.
  • CrystalBall: Predicting and Preventing Inconsistencies in Deployed Distributed Systems. Reliable distributed systems are notoriously difficult to build: hard to get out all the errors that arise from realistic environments (asynchronous networks, node failures, race conditions). Even with a provably correct algorithm, programmer errors still happen. Can lead to DoS, data loss, etc. These are typically violations of safety properties. Use increased computing power to have nodes predict the consequences of its actions in the background. Run a state exploration algorithm on a consistent snapshot of the live neighbourhood, to predict what actions could lead to violations of safety properties. Compared to state of the art in distributed debugging: classic model checking is intrinsically limited to how deep it can go; as against log- or replay-based/live predicate checking. CrystalBall is deep online debugging. It can also do execution steering to alter predicted-to-fail nodes’ behaviours. Results in online debugging (found 13 bugs with small performance impact).
  • Co-scheduling of I/O and Computation in Commodity Based Storage Clusters. Looking at Hadoop scheduler, and how to schedule computation near different data sets. Map tasks currently scheduled based on data locality of a single chunk. Concerned with scientific applications, where there are dependence between chunks: a Map task will have to work with chunks stored across multiple nodes. Applications include bioinformatics, which assume each string is as long as a chunk. Current solution is to pre-process data to move all chunks touched by a single Map task to the same storage location (with a MapReduce task). Then do a second MapReduce job to analyse and process data. This might have limitations for productivity and performance. Want to introduce a scheduler that will do this for us automatically. Will quantify local and remote I/O, and work out where best to schedule a Map task.
  • An I/O Serializing Disk. Problem is that random writes are becoming more dominant. But serial writes are getting much faster. mmap isn’t scalable due to the dirty-page write-back flushing thread schedules effectively random writes (even though the elevator is used), especially if datasets are larger than RAM. Want that to go faster, with a log-structured file-system-style approach. Get better locality for large, indexed files. Solution called Honor. Sorting + I/O Redirecting + Logging. Logging is a simplified write ordering technique, used in a transactional file system (Valor). Before a process dirties a page, it registers undo and redo pages in the transactional file system, and then use a sorting disk which serialises disk I/O and stores metadata about the off-loaded writes. Reads are redirected to the sorting disk to get most recent update. Implemented as one large, virtual block device. Eventually the writes are written back to the underlying disks.
  • Zeno: Eventually Consistent Byzantine Fault Tolerance. Designed for storage backends used today in datacentres, like GFS and Dynamo (or Facebook’s Cassandra, or PNUTS). Rely on replication of data and computation to get fault tolerance and availability. Get geographical diversity as well. Favour availability over consistency at present, and work with weaker consistency semantics. The crash fault model is not sufficient: can have state corruption… database bugs are >50% non-crash faults. The Byzantine fault moel is a better candidate. Requires 3f + 1 replicas to tolerate f faults, and existing protocols favour strong consistency. Existing protocols do not solve the availability requirements of storage backends. Pick consistency level on a per-operation basis: linearisable consistency (strong) or eventual consistency (weak). Example of a shopping cart with a network partition: reconcile state by taking union.
  • Zzyzx: Scalable fault tolerance through Byzantine Locking. Goal is to build services that can handle arbitrary faults. Start seeing greater asynchrony and a variety of faults. The new technique is Byzantine locking, which gets higher throughput and lower latency than prior protocols, and hence greater scalability. Challenge is maintaining a consistent order of operations. Normally lots of communication, several message delays, and lots of cryptography. In Zzyzx, let clients order the operations. The Byzantine lock cordons off a portion of a state machine, and can then issue requests to 2f + 1 replicas with a single roundtrip and you’re done. Not like leasing. Throughput scales better with the number of responsive severs (for f = 1); Zyzzva doesn’t speed up at all, due to a primary server bottleneck. Lets you implement arbitrary distributed services, with better throughput, latency and scalability.
  • Fault Tolerance for Free. FT services are too expensive to program, run and maintain. Multi-* systems are becoming common and cheap. So the difficulty of programming and maintenance is now the real cost. Lagniappe is a programming environment and runtime that targets high-performance, request-processing applications. Maps applications to multi-* hardware. The interface for adapation is the same interface for FT replication. If the mix of tasks changes, need to adapt to run on different machines etc. Lagniappe FT enables the addition of a FT protocol to applications specified this way. Works for request processing applications. Currently building more FT protocols.
  • Device driver reliability. Problem is that they are buggy and cause OS crashes. Driver = device datasheet + OS documentation/headers. Create a driver implementation from these, obviously introducing bugs in the process. All the knowledge of how to program the device is contained in the device and the operating system. If we could represent this in an explicit and formal way, we could automate the process. All we need is a formal spec of the device protocol and the driver/OS protocol. Example of a simple network device with a control and data register. Can combine protocol state machines into a single state machine automatically, and prune this to a correct implementation. Finally simple to translate this into C code for the actual driver. Actually done this for a USB-to-Ethernet controller driver. The device vendor has to provide a formal specification.
  • CitySense: An Urban-Scale Open Sensor Testbed. Building a 100-node wireless sensor testbed deployed on streetlights and rooftops across Cambridge, MA. Monitoring air quality, weather, noise pollution, road traffic, etc. (with various sensors). Intended to be open like PlanetLab so that anybody can use the data and program against it. Nodes are small, single-board computers with 256Mb of RAM, a CF drive, dual 802.11 radios with high powered 900MHz radio and standard 802.11a/b/g. Applications include public health, urban sensing, homeland security, novel distributed systems… open-ended. About 15 nodes deployed between Harvard and BBN, and want to span the gap between them along Concord Avenue. Challenges: wireless mesh doesn’t work! Need disruption tolerant networking. Need slicing and experimental control methods (can’t just run PlanetLab on them). And need automatic failure recovery as can’t go back up streetlights every time the nodes fail.
  • Wifi-Reports: Improving Wireless Network Selection with Collaboration. Problem is how to select between a number of commercial APs (are ports blocked? How is performance?) when we have little information about each. Can we annotate this database with more information. Rely on user-submitted reports. Could do automatic measurements upon connection and provide this information to other clients. Challenges: location privacy (reports should not be linked to users, otherwise users can be tracked) but want to avoid fraud: need 1 report per AP per user. Solution is a new e-cash-like reporting protocol. Also need location context, as performance will depend on the location of the user. Need to estimate different loss regimes with distributed measurements. Did a measurement study of commecial APs at Seattle hotspot locations: there are a huge number of APs, and very large variance in performance. Often the official AP is not the best one (should use the one next door).
  • S3: Securing Sensitive Stuff. Need a TARP program (securiTy Against Rogue Pilferers) program for the virtual economy. Need to prevent data theft. Want to regulate flow of sensitive data according to high level security policy. Look at information flow control systems, which ensure confidentiality and integrity, and limit covert channels. But they require a new OS, new applications and sometimes new hardware. Simple goal: prevent data being stolen from a network. So just need to ensure that data is not able to leave the network. Only need to interpose at the exit points (network access, peripherals, etc.). Could therefore track sensitive data flow below the OS without modifying the OS itself. S3 is a network of hypervisors to track and enforce sensitive data flow in an enterprise. User OS is diskless, with a filesystem maintainted in Dom0. Host OS stores the sensitive files. Users tag sensitive files with labels and policies. Shadow memory is used to track world-level taint. Sensitive data flow and computation is tracked using taint-tracking techniques. Use HW virtualisation support, and use speculative user OS execution.
  • Communities as a first-class abstraction for information sharing. Online social networks are a popular way for people to connect and share content. Increasing variety and volume of content being shared on these sites. First challenge is of privacy and access control. Second is the issue of relevance — what is relevant to a user? Think communities can help us here: groups of densely connected users. Members of a community often share interest and enjoy mutual trust. In practice, we can automatically infer these from the social graph. Today, access control is limited to friends (or subsets thereof) or the entire world. Communities are a natural middle ground, enabling sharing beyond just friends. As for relevance, search facilities return aggregated global option. Communities represent shared interest and can natually be leveraged to improve relevance. PeerSpective is a social network-based web search tool. Indexes browsed pages, and web search collects results from Google and from friends. Results aggregated by community, to allow community-based sharing. And also enables finding more relevant results.

Tags:

Comments are closed.