EuroSys
The European Professional Society on Computer Systems

European chapter of the
Special Interest Group on Operating Systems (SIGOPS)
of the Association for Computing Machinery (ACM)
Home Join or renew membership EuroSys for Students EuroSys for Faculty Job Offers Activities Systems Directory Systems Events and Blog Eastern Europe Initiative Member Area Member News Officers and Volunteers Useful links Press releases

OSDI 2008: Day 3

Various Good Things

Difference Engine: Harnessing Memory Redundancy in Virtual Machines

  • Awarded a best paper award.
  • Trend towards consolidation in data centres. Want to consolidate large numbers of VMs on a small number of physical machines. e.g. Cloud, Enterprise, Thin-client computing. Also interested as virtualisation as a basis for testing: can test larger systems with less infrastructure.
  • Hurdles to consolidation: CPU utilisation is often highly bursty, and multi-core makes VM multiplexing more appealing. But machine memory is a hard limit on the number of VMs. Could just add more memory, which becomes expensive after some point, and has high power costs.
  • State of the art is identical page sharing (Waldspurger ‘02).
  • Opportunities for sharing beyond entire page sharing. Can introduce a second level of page fault (to memory) which is much faster than paging to disk.
  • Contributed set of memory management techniques to Xen: identical page sharing, similar page patching and in-memory compression. Explored a range of policies and mechanism. Did a detailed performance evaluation and found savings up to 90% on homogeneous workloads and 65% on heterogeneous, which is 1.6–2.5x what ESX gets.
  • Look at pesudo-physical to machine memory map. Identical pages, similar pages and cold pages. Identical pages map to the same machine frames, marking them both read-only and copy-on-write. Similar pages: one page is unmapped and metadata is stored, and the other page is mapped read-only. Write to one is trapped; read/write to the other is trapped. Finally, cold pages are unmapped, the metadata is tracked, and a smaller representation is stored in memory.
  • Implement a global clock for not-recently-used page selection policy.
  • Implement multiple hashes over page sub-regions to find similar page candidates.
  • Implement demand paging for physical memory overcommitment.
  • Global NRU clock by iterating over Xen’s P2M map, using referenced and modified bits. If recently modified, don’t choose as a candidate. If recently accessed (not modified), candidate for sharing. If no recent access, candidate for sharing or patching. If seen no recent access on several passes, then a candidate for compression as well.
  • Maintain two hashtables in Xen, one for sharing and the other for similarity. Request not-recently-modified pages from clock and insert hashes into the similary table, and a whole-page hash into the sharing table. On an identical hash, do sharing CoW.
  • For the not recently accessed pages, check the similarity table for candidate matches, calculate the best possible patch size and if there are savings then do the patching.
  • Identifying similar pages must be done without too much effort. Related work in finding similar web pages (Rabin fingerprints, min-wise hashing) but these are too resource intensive. So just do hashes over sub-page chunks and insert entries into the sharing hash table. If savings are more than 2kb, then do the patching. Two hashes over 64-byte chunks (policy question). If at least one chunk matches, calculate the patch and consider the overhead. This policy was often the best trade-off.
  • Demand paging because Xen assumes all physical memory is available. Difference Engine may require more memory than is available for a VM. Could end up overcommitted. Xen defers all I/O to Dom0, so implement a user-level swap daemon, communicating with Xen via an event channel. Swapd monitors total available memory monitor which will see a low watermark of available memory and do
  • Modifications to Xen 3.0.4. Available online. 15kloc for Difference Enginge. Additional 20kloc for patching (xdelta) and compression.
  • Evaluated with microbenchmarks (cost of operations), considered techniques in isolation to see relative benefits. Measured memory savings and performance effects for homogeneous and heterogeneous workloads. Compared to ESX with a scan of 10k pages/s.
  • 4 VMs, 512MB each, Debian 3.1 running dbench. Difference Engine achieves better savings than aggressive ESX at all points in time, though they tend to the same savings in the limit. While the benchmark runs, 1.5x more memory savings.
  • Heterogeneous workload: 3 512MB VMs running different OSs and applications. Win XP SP1 with Apache. Debian 3.1 running SysBench. Slackware 10.2 running dbench followed by IOZone. DE gets much better savings than ESX. Identical sharing gives most of the benefits, but patching also contributes a large proportion (less from compression).
  • Overhead vs. Xen was < 7%, and < 5% vs ESX aggressive.
  • Should be able to increase aggregate system performance. Ran RUBiS (eBay-style benchmark) with 4x 650MB VMs. Baseline of 4 VMs. Aggregate throughput increases most when using 6 VMs on Difference Engine (declines with 7). 40% more requests per second on 6 VMs versus the baseline. Before saturation, there is some performance overheads. Diminishing returns caused by paging to disk.
  • Why do this now? Driven by continuing increase in the gap between CPU processing and Disk I/O, especially with multi-core. Previously, the effort required was too great, and there were performance overheads. Page patching is more expensive than page compression. But overall sharing can improve performance.
  • Address Space Layout Randomisation is making it less pointful to do identical page sharing. Energy is a significant system resource: reducing memory footprint reduces both capital and operating expenditure.
  • Extending the approach to clusters, to allow VM colocation based on memory footprint.
  • Correction: the aggressive scan rate explains the delta — as actually capped at 500 pages per second per VM.
  • Question: did you look at the overhead delta between identical page sharing and compression and patching? Found lower CPU use but less savings. Believe this is a potential
  • Question: compression seems to give 5% savings at most… is it worth the complexity? Could flip the ordering and do compression before patching, which would make it yield more savings. Incorporating compression did not add significant complexity.
  • Question: keep a reserve of free pages in the hypervisor to service page faults quickly. How big is it, a percentage or fixed number? Currently a fixed number.
  • Question: in benchmarks comparing ESX and DE, did you have to do any synchronous paging or was the reserve enough? In the performance comparison carried out, paging had no significant effect. The 7% overhead included paging.
  • Question: if you do demand paging and identical page sharing alone, what do you think you would get? Not sure how to interpret memory savings? Interested in how much memory I need to get something done in a particular way? Results show that you can get a factor of 1.6–2.5x savings.
  • Question: 7% overhead seems like a lot, even though doing lazy compression and tricks like that. Have you done work to track whether the compression was worthwhile (or were you just going to throw it away anyway)? Global clock evaluated in the paper. 7% is worst case (in the applications we looked at). Know that there is more work to do in improving savings and reducing overhead.

Quanto: Tracking Energy in Networked Embedded Systems

  • Where have all the joules gone? When have they gone and why have they gone?
  • Looking at mote-class sensor networks. Wide dynamic range in power draw: 10uA to 10mA (sleep and active), with a 0.1–1% duty cycle.
  • Energy-efficient design has pervaded the research agenda: all innovations have energy as a first-class concern. But energy is measured in packets sent or time sleeping or bytes written. These are all crude approximates.
  • Want to measure energy itself. But this is challenging: collecting the results themselves, breaking the aggregate down into different components, and tracking activity on behalf of applications (and causal connections).
  • What is an activity? About connecting causal dots. An activity is a causally-connected set of operations whose distinct resource consumptions shuld be grouped together for accounting. Notion used in the Rialto OS, and also considered in Resource Containers.
  • Look at a toy example: the hello world example from TinyOS — blinking LEDs. First want to slice the energy usage by device (LEDs, CPU, etc.), and by logical activity (blinking the red LED, blinking the blue one, etc.).
  • Measurement is challenging. Wide horizontal (timescale) and vertical dynamic range. The particular mix is very application dependent. Vertical: 1uW to 50mW for different components.
  • Quanto uses iCount energy meter (IPSN paper). A microcontroller and counter attached to a voltmeter, switching regulator, a couple of capacitors. Simply count the cycles of a switching regulator.
  • Slicing challenge: breaking into parts… could add sensors to each domain, which is cumbersome and inaccurate. Not all energy sinks can be instrumented like this: variable voltages e.g.. Borrow notion of power state tracking: when a device’s power state changes, track that. Instrument device drivers, which export their power state through a very narrow interface making it simple to track changes. Estimate energy breakdowns with regression: snapshot the system-wide power states, the global energy use and the system clock for each transition. Generate a linear equation, which can be solved using weighted multivariate least squares.
  • Activity tracking challenge: there is a gap between what is measured and what matters. Previously people have looked at breakdown by subsystem, application, EIP and PID. We care about activities. Want to know why energy is being spent, and activity tracking answers this (gives us meaningful resource principles).
  • First annotate any abstraction which associates a label with an execution. Label is (origin-node, activity-id) pair. System software propagates these lables across subsystems, nodes, etc.
  • Causally-connected actions are painted with the same label. Initiate annotation with a runtime call to CPUActivity.set(LABEL_NAME).
  • Hard part. Issue of deferred computations: CPU might post a task (deferred function call), or put an object in a queue which will be dequeued by a worker. Modify task scheduler to add an activity field which is set on task posting and restored on task invocation. Queue items are tagged with an activity label as well.
  • Node-to-node communication: packets are tagged with a hidden activity field. Problem: every interrupt causes energy consumption before you know to whom to attribute it. Proxy activities are an ephemeral label, carried with a computation that hasn’t yet been identified, but which is later set when you know what label should go with it.
  • Concurrent activities on shared devices are described in the paper.
  • Look at a 48-second run of Blink. Compare oscilloscope ground-truth to the results from Quanto. R-values of 0.99 or better. Very small differences in the measured current per device.
  • Works well, but how much does it cost? Changed 38 files in the core OS, and added 13.
  • Space cost is 12 bytes per energy or activity sample, with an 800-sample RAM buffer.
  • Read system time in 19 cycles, read energy in 24 cycles and log a sample in 102 cycles (CPU running at 1MHz). The logging Blink app using 0.12% of CPU time, and 71% of CPU active time (CPU is almost always asleep, so using any CPU looks like using a lot of it). Energy cost is 0.08%, 0.41mJ. Quanto can monitor its own energy usage.
  • Can look at where have all the milliseconds gone? Noticed that CPU does 2x work for red LED as for green, and 2x for for green as for blue.
  • What’s the cost of false alarms in low-power listening? Send a preample that is longer than the listen-check period, to ensure that the receiver will hear it and wake-up. Could have noise that wakes up the receiver, wasting energy. How much, and how bad is it? Tried different channels and saw that channel 17 uses much more energy than channel 26, due to a few false positives. Supposed to be 2.2% duty cycle, but false alarms cause this to raise to 5.6%.
  • Does it work over the network? Wrote a ping-pong application between two nodes, and showed how the energy use propagates.
  • Found some bugs in TinyOS using power profiling.
  • Some limitations. Energy metering requires hardware support and calibration. Energy breakdown makes several assumptions. Activity tracking doesn’t support reentrancy.
  • A very simple bit of hardware plus a few lines of code lets you measure, slice, dice and track energy usage.
  • Future work is: deploying on new platforms, scaling from 2 to 1000 nodes, rolling out to the community. Research direction is to explore new frontiers in energy efficient sensor networks.
  • Question: some of the applications shown seemed fairly simplistic, so how would it scale with more complex subsystems, and how would it work? Saw too much data coming off the nodes. Rectified this problem and now do compression of the log data. Now looking at larger applications.
  • Question: most of Quanto did energy analysis offline, and benefits would come if you could do it online (and make decisions based upon it) so how do you think that you could adapt it to do it online? If you can do the regressions online, you can keep counters rather than all the log data, which would improve scalability. But wouldn’t that greatly increase the overhead? You would do regressions periodically, and future work will address that.
  • Question: you spend some effort in the device drivers to model the device state and find out the appropriate transitions etc. Do you want things from the hardware manufacturers to make it easier to extract this? Device drivers export power state information. If hardware had APIs to extract that information, that would be great. The fact that this is a small embedded system makes this easier; might be harder to do for something like a disk drive (and you might actually want to run Quanto inside the disk drive because of all the differnent activities going on in there).

Leveraging Legacy Code to Deploy Desktop Applications on the Web

  • Xax takes legacy code and uses it to enhance web applications.
  • Probably noticed that we are in the Web 2.0 world with AJAX doing interesting things in the browser. Gives a similar experience in the browser to what you used to need a desktop application for.
  • Microsoft is building Office in JavaScript. Google is doing something similar. Why?!
  • One reason is location and OS independence: can use it anywhere. Also have a strong notion of isolation: would be insane to run arbitrary code from a website, but web browsers let us do this (safely).
  • Write applications in a virtual execution environment. JavaScript (and Java and Flash) is type-safe which is safe but prohibits non-type-safe legacy code, typically written in C or C++ with unsafe pointer aliasing and unions, etc.
  • There are decades on knowledge in C/C++ code. Some components have multiple implementations (GUI framework, XML parser). Some are shipped with browser (JPEG decoder) and included in the browser to be used by the type-safe code. Heavy tail of valuable legacy code that isn’t ubiquitous (like a speech synthesiser, a RAW decoder for images, or GraphViz). They tried to port GraphViz to Java, but it was too much work to port and maintain it: but it’s still being maintained in a legacy application.
  • Could split application, putting graphical part on the client and run the legacy part on the server. Don’t get scaling that we would like to see and also introduces latency.
  • Could just run it in a client process (what ActiveX does). This is plainly wrong. Difficult to isolate the code effectively: a large surface inside the OS which is vulnerable to privilege escalation. Moreover, it isn’t OS independent. ActiveX is pretty deeply dependent on the Windows OS running underneath it.
  • No web-app framework provides legacy support, security and OS independence.
  • Likely you already have Firefox JS container and a Flash container… to these add a Xax container. Has four properties: an OS process with two features (MMU isolation from the browser and other applications, and no system call interface (a tiny hole that plumbs it to the Xax monitor). Use a device driver that disables the system call table for the “picoprocess” and plumbs all calls through to the Xax monitor.
  • The Xax monitor can give it more memory. All other services via the browser (so plumb the Xax monitor through to the browser). The Xax process looks like a remote web server: make an HTTP request to the Xax monitor and picoprocess.
  • A simple reduction argument about why it’s secure: no libraries are being shipped for the browser. If you patch a bug in the browser, it’s patched for Xax.
  • Added a platform abstraction layer on top of the minimal syscall interface, running inside the container. Not part of the TCB, but delivered with the monitor. Lets the application see a common interface, which gives OS independence.
  • Also get native CPU performance: no software overhead, limited only by the quality of the developers compiler and the hardware.
  • How do we get legacy code working in this environment? Will it really work for legacy code? Surely real programs make lots of syscalls! But it’s just a porting task: libraries and applications are portable, and the browser can be our kernel. C is portable!
  • Showed it working for a simple rasteriser. And then for OpenGL. And then for GhostScript. (Probably got the Adobe Reader, which is clearly part of the TCB, going by some of the privilege escalation bugs that have been seen in the past.) Putting GhostScript in Xax removes it from the TCB.
  • Got to the point where porting was fairly routine. Not much more than a few changes to the Makefiles.
  • Wrote a Python program using GraphViz to visualise IMDB. Ran the program on the desktop and it called 2700 syscalls to 29 different interfaces. To port, needed to disable static dependencies (e.g. on pango text processing library, might not be relevant; also X dependencies, so ask for an SVG file rather than an X window), restricting interfaces, rejecting unneeded syscalls (can turn off signal handling, for example, which Python uses), emulating syscalls internally (no filesystem, so ship data to Xax container in a big tar file) and letting the browser supplying remaining I/O (write Python bindings to JavaScript, reach into the DOM from Python, which gets sent to the JS in the browser).
  • Support 21 libraries and 3.7 million lines of legacy code. TCB is < 4500 loc. Xen, Java and Flash are 1–2 orders of magnitude bigger.
  • OS independence? Each demo came from Linux-based sourced code compiled with gcc to an ELF binary, but ran on Windows. Also taken Visual Studio-compiled binaries and run them on Linux.
  • Performance? Context switch times are ugly, especially in Linux, which uses ptrace to block system calls. Haven’t fixed it yet because overall performance doesn’t depend on context switches (do a large amount of work then ship bulk data to the browser). Full quantiative analysis in the paper.
  • Xax is secure and can increase security by subsuming existing trusted extensions (e.g. Acrobat Reader). It’s fast, deployable and portable.
  • Question: are there any limitations of this work? What would suck really bad? There is nothing you cannot do! Don’t want any special interface so that the security is never less than the browser. Performance limited by e.g. sending compressed PNGs of the OpenGL demo. If you have low-level blitting and sound interfaces from the browser, you could do better. Maybe put the browser renderer in Xax, or the Flash container to do video. Not sure how to use the GPU directly.
  • Question: the porting effort was stunning… presumably the chunk cut out must be huge. What portion of the lines of code are actually being touched? Certainly are pieces not being used. Don’t think it’s just 10%. Most of the machinery in OpenGL, e.g., are being used. Just chopping out the shim that talks to different interfaces. For each library, get most of the value out of it. Haven’t quantified which parts of the count count for the parts that will be used.
  • Question: what types of legacy code do you see being ported to this? What if it wants to do something a web browser doesn’t inherently support (e.g. opening a socket)? Not database servers, certainly. What about IM, Skype, etc.? Things like direct network access and disk access are orthogonal to whether the legacy code can be run. What should web browsers allow applications to do? Look at Google Gears, which adds a facility to talk to the local disk, and Xax could use this too. There are policy questions about what applications should be able to access. Would like to just port Office.
  • Question: existing frameworks like Flash have a fairly complicated security model, including whether Flash instances can intercommunicate, and also look at resource usage to prevent starvation, so there is a lot of infrastructure to limit the abuse that is possible. Have you thought about how to limit this? Not doing any resource constraints in the existing application. Most commodity OSs allow the monitor to ask the OS about this, and then throttle it.
  • Comment: for all graduate students who add features to the API to support their feature, this work shows how reducing the interface is possible and desirable. Thompson and Ritchie were right.
  • Question: could you do reasonable work to have communication between Xax silo? Will look at that, but haven’t yet.
  • Question: shouldn’t you just use a use-once VM? It’s extreme paravirtualisation. Should it just run on hardware directly? Considered that initially, as it makes porting easier, but you have a lot of heavyweight machinery, which you would want to optimise away and end up with something looking very like Xax.
  • Comment: chosen a different API, but there is an ABI as well, and could these concepts be applied by an ABI expert to turn a browser into a hypervisor.
  • Question: performance-related… how big is the tarball that you put together to run the legacy code, and how long does it take to deliver? Deliberately didn’t worry about that. Some of them were pretty big: didn’t try to optimise, so several megabytes. But we know how to cache and do differential compression, so there are plenty of techniques to make this less of a problem. Not a fundamental question about whether the technique will work.

Wide-Area Distributed Systems

FlightPath: Obedience vs. Choice in Cooperative Services

  • P2P is an increasingly popular way to deploy services, like internet radio, internet TV and file-sharing. Hard to ensure that every participant is following the protocol faithfully. Should tolerate Byzantine and Rational peers (e.g. free-riders).
  • Deal with rational participants informally by providing incentives and punishments, but there’s no formal basis for this.
  • BAR gossip is a P2P system that uses Nash equilibria to give a rigorous framework, at the sacrifice of performance.
  • Look at approximate equilibria as a trade-off between the two.
  • Epsilon (e) Nash equilibria. e = 0 ensures obedience is in the best interest, which offers few choices and hurts adaptability (to churn or transient failures). Cripple P2P to achieve game theoretic goals.
  • e >0 provides more choices and makes it possible for systems to adapt. Retain the rigour of a formal approach but also gives the opportunity for choices.
  • Compared jitter of FlightPath to BAR Gossip: much less for FlightPath.
  • Strict Nash approach makes it hard to reason about the protocol and encourages micromanagement. Increasing e improves this.
  • Application is P2P live streaming.
  • Basic trading protocol. Source sends stream updates to clients or peers. Peers run gossip protocol to disseminate updates between themselves. Updates are useful for finite time, the membership is static and everyone knows full membership lists. Everyone has public/private key pairs.
  • Measure jitter and average and peak upload bandwidth. Jitter is % of rounds when a peer failed to deliver everything it was supposed.
  • Assume rational peers derive benefit from delivering a jitter-free stream.
  • In each round, select a partner, exchange histories, and trade equal numbers of updates. Peer puts updates in a locked briefcase and sends it to the partner, along with a “promise” that the data is legitimate (from BAR gossip). Then exchange keys. This does not achieve no jitter 20% of peers get at least 4% jitter. For 200kbps data stream, use 100kbps extra bandwidth. Peak bandwidth is much larger than 512kbps (regular broadband).
  • Choices: avoided overloaded peers, avoiding departed peers, trading more when in trouble, erasure coding updates, preferring older updates to newer ones, etc. Made possible by shifting to e-Nash.
  • Need to temper choice with obedience. Allowing to choose partners permits more attacks on the system and allows rational abuse. Look at how we get controlled choice in selecting partners.
  • A membership list is divided into log N bins. A has a seed for round r, which is fed into a random number generator. Can further constrain partners by putting a constraint on the hash of the partnership pairs. Limits the number of Byzantine peers you could interact with.
  • This gives very low jitter, reduces overhead in average bandwidth and caps peak bandwidth.
  • Extending to dynamic membership, need to propagate membership changes. Divide rounds into epochs, tracker sends list of members to source, and the source turns the list into updates, sent out to trackers in future epochs. To join, talk to a tracker, and join the list for a future epoch, but causes some latency before you can begin trading. Let peers immediately begin trading by organising peers into tubs ordered by when they join. Trade with peers in the same tub and earlier tubs with some probability.
  • Evaluated on 400 emulab clients. Rounds of 2s, epochs of 40 rounds, 200kbps stream and 10 round deadline to update stream. Much less jitter than BAR Gossip. Impact on average bandwidth when nodes join is manageable and the tubs mean that this impact is fairly invariant (and doesn’t grow) as more nodes grow. Some jitter caused when nodes leave: 75% need to leave before the remaining ones are jitter… caused by the choice in partner selection. As churn per minute is increased, jitter gets worse, but nodes that have been in the system for longer are less affected than new arrivals. Jitter with attack only becomes bad (<4% nodes affected) with 16% malicious nodes. Several attacks explored, but this one hurts the most. FlightPath handles 10% of the system acting maliciously.
  • Based on the assumption that rational peers follow the protocol. Can have a utility function, based on jitter events per minute and the upload bandwidth used. Epsilon is defined as max utility gain from cheating over the utility from obeying. The maximum upload savings possible over the (expected benefit:cost ratio minus 1). Cannot provide small epsilon where the benefit to cost ratio is less than 3.36. For bigger benefit:cost ratio, can provide very small epsilon. Want e close to 0. Can think of this in terms of how much videos cost, and the cost of bandwidth.
  • The analysis provides rigour. Gives P2P live streaming with good stability, and tolerance to Byzantine and rational nodes, and churn.
  • Can join a live stream online.
  • Question: did you look at collusion between Byzantine attackers? For collusion between attackers to bring down the system, looked at all attackers acting at the same time, that was shown in the experiment. As for collusion to extract maximal data, the most effective ways to collude are a “beachhead” in which one peer is part of the system, and everyone else is outside, receiving forwarded data from the participant, but not giving anything back. Also a problem of reclusive nodes, but would need at least 30 nodes acting together to be a problem.
  • Question: challenge the assumption that we want equal contribution from all peers. Studies showed that people had great diversity in the resources available to them, and we should take advantage of the ones with an abundance of bandwidth. How well would you do without the tit-for-tat assumption? Currently working on dealing with network heterogenity, and it would be nice to do that, but there is a danger from the game theoretical perspective that adundant nodes would incur more cost, and it’s hard to incentivise this. Game theory is difficult.
  • Question: have you given thought to looking more at general content distribution (where utility stops being a step function based on a deadline and jitter)? It would be interesting to see how you could do bulk file distribution on a rigorous footing, but lots of this work wouldn’t apply, because we have focussed on timely delivery.

Mencius: Building Efficient Replicated State Machine for WANs

  • Replicated state machines are a widely-applicable approach. Got some servers and several clients. All servers will agree on the same result, using repeated consensus algorithms. Existing algorithms such as Paxos and Fast Paxos, etc.
  • Why wide-area? Want reliability and availability. Want to cope with the correlated failure of a colo due to fire, earthquake, etc. Also for consistency: would enable the coordination of wide-area applications.
  • Are existing protocols efficient enough in the wide-area? If not, how can we improve it?
  • Wide-area is more complex than simply not local area. Depends on distribution of servers and the distribution of clients. Fast Paxos is slower in terms of latency than Paxos with wide clients and local servers. Interested in the wide by wide case.
  • Assume up to f of n servers can fail by crashing. System load can be variable from each site, can be network-bound (large requests) or CPU-bound (small requests). Want low latency under low client load, and high throughput under high client load.
  • Paxos is a leader-based algorithm: one distinguished server. Leader gets a request, proposes it to the replicas, which accept it, then send a result back and the replicas learn the request. Low latency at the leader (2 steps), simple and with low message complexity. But high latency at non-leader replicas. And load balancing problem.
  • Fast Paxos in steady state: acceptances are broadcast among replicas, which saves a round. Load balanced and low latency at all servers under no contention. Under contention, risk of collisions, resulting in additional latency under contention. Also high message complexity, and requires more replicas.
  • Want a new algorithm that combines the pros of both. This is Mencius.
  • Mencius derived from Paxos: rotating leader, variant of consensus, and optimisations. Ensures safety (derived from existing protocol) but has a flexible design (allowing others to derive their own protocol).
  • Rotating the leader. Each consensus instance is assigned to a coordinator server. The coordinator is the default instance leader. Could use round-robin to assign this. Assignment scheme needs to be well-known. Server proposes client request immediately to next available coordinated instance (don’t wait for other servers). Server only proposes requests to clients it coordinates.
  • Benefits: all servers can propose requests directly. Leader is no longer the bottleneck (load balancing). Balanced network communication pattern.
  • Challenges: ensuring liveness while maintaining efficiency. Servers may observe different client loads, but requests can’t be committed until seeing all prior requests. Servers can skip their turns (propose a no-op) which leaves no gaps in the history. Challenge now to cut down the cost of skips: use simple consensus (restricted form), in which a coordinator can propose either a client request or a no-op, and non-coordinator can only propose no-op. So no-op can be learned in one message delay if the coordinator skips, and it is easy to piggyback no-ops to improve efficiency (essentially no extra cost).
  • If failure, faulty servers cannot do skips, so the solution is revocation in which non-faulty servers propose no-ops on behalf of faulty servers. Can optimised by revoking in advance to reduce delay, and revoking one large block at a time to amortise cost.
  • Mencius achieves all the good properties of both Paxos and Fast Paxos.
  • May have delayed commit which adds up to one round-trip extra delay (if two people propose at onces). Can have out-of-order commit if requests are commutable. This is a benefit of using simple consensus, and reduces latency.
  • Experimental evaluation: Mencius versus Paxos. Implemented in C and C++. Clique topology for three sites and a simple replicated register service. k registers: Mencius-k with rho bytes of dummy payload (rho = 4000 was network-bound and rho = 0 was CPU-bound). Wanted commutable requests (updates to different registers). Out of order commit option, and clients send requests at a fixed rate.
  • Maximum throughput for Paxos is 400 operations and latency goes to infinity. Mencius can achieve better latency, and doesn’t max out so quickly.
  • Mencius has a 200% improvement in throughput when network bound, and up to 50% improvement in the CPU-bound case. Enabling out-of-order commit hurts throughput.
  • Throughput measured under failure. Have time between when a failure is suspected and it is reported. Small decrease in throughput when a failure of a follower is initially suspected with Paxos. Big drop in throughput when Paxos leader crashes. Stall while leader fails, and also caused by duplicates. Mencius has a temporary stall when any server fails, then recovers quickly when the failure is reported. There is a peak after reporting caused by delayed commits. Then throughput decreases by the proportion to be expected by the decrease in resources, but still better than Paxos.
  • Look at fault scalablity by increasing the number of sites. Looked at 3, 5 and 7 sites. Mencius gets better throughput as the number of sites increases, whereas Paxos drops off.
  • More results in the paper (showed lower latency at all servers under low contention, and adaptability to available bandwidth).
  • Question: in the diagram of the recovery from failure, why was the throughput lower after failure? Mencius can use all links to improve throughput. 3 servers before the failure and 2 afterwards.
  • Question: what happens when a client contacts multiple coordinators (with rotating coordinators)? Best thing to do would be to contact a single server (unless it fails). So you may have to send requests to multiple failures, so replicated state machine would have to handle idempotency of requests.

Comments are closed.