EuroSys
The European Professional Society on Computer Systems

European chapter of the
Special Interest Group on Operating Systems (SIGOPS)
of the Association for Computing Machinery (ACM)
Home Join or renew membership EuroSys for Students EuroSys for Faculty Job Offers Activities Systems Directory Systems Events and Blog Eastern Europe Initiative Member Area Member News Officers and Volunteers Useful links Press releases

OSDI 2008 - Day 1 - Session 1

DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language

Yuan Yu, Michael Isard, Dennis Fetterly, and Mihai Budiu, Microsoft Research Silicon Valley;Úlfar Erlingsson, Reykjavík University, Iceland, and Microsoft Research Silicon Valley;Pradeep Kumar Gunda and Jon Currey, Microsoft Research Silicon Valley

DryadLINQ is a distributed programming and execution engine. It is integrated into Visual Studio .NET. It automatically generates execution plans for a .NET program and runs the program in a distributed environment (cloud).

Its main component in is the distributed execution engine. DryadLINQ provides powerful abstractions for distributed programming, through the LINQ interface. The LINQ expressions are compiled into an execution plan which is executed in the cloud. For the distributed execution, DryadLINQ performs optimizations like pipelining, eager aggregation, dynamic aggregation, data-dependent partitioning, redundancy removal or I/O reduction. The distributed execution mechanism is similar to MapReduce. Between the map and reduce phases, it introduces a dynamic aggregation phase. DryadLINQ runs on top of Dryad distributed execution framework. DryadLINQ can efficiently execute distributed algorithms like PageRank.

DryadLINQ translates the .NET calls to its interface (e.g., ToDryadTable call) into LINQ expressions, which are then compiled into distributed Dryad execution plans (involving SQL queries over some input tables), which are executed by the Dryad engine. When Dryad completes a job, it writes the data into the output tables. The results are encapsulated in .NET objects and returned to the application.

Everest: Scaling Down Peak Loads Through I/O Off-Loading

Dushyanth Narayanan, Austin Donnelly, Eno Thereska, Sameh Elnikety, and Antony Rowstron,Microsoft Research Cambridge, United Kingdom

The Everest system takes care of the bursts in data center workloads by temporarily off-loading requests to a virtual store, during load peaks. The store layout is a circular log, optimized for short term writes. Everest runs at the low level, underneath the OS.

The data is periodically reclaimed in the background from the virtual store to the base volume. After the data is reclaimed to the base volume, it is deleted from the virtual store. Data is reclaimed when the peak is over.

Off-loading works well only for short time peaks. The challenge is to synchronize the offloading with the normal workloads, and to ensure data consistency.

Improving MapReduce Performance in Heterogeneous Environments

Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, and Ion Stoica, University of California, Berkeley

The paper presents a new scheduler, called LATE (Longest Approximate Time to the End), for an open-source MapReduce implementation, called Hadoop. The contributions of the paper are essentially two. The paper describes how the assumptions on which Hadoop’s scheduler is based can break in heterogeneous environments, but also in homogeneous environments. Then, the paper proposes a simple improvement of the existing Hadoop scheduler and shows that the new scheduling algorithm (LATE) can improve the average response time by a factor of 2, compared to Hadoop.

The assumptions of the Hadoop scheduler, as presented in the paper, are: (1) it assumes that tasks progress at a constant rate throughout time, (2) nodes are equally fast, (3) speculatively reexecuting a slow task on a new node with a free task slot doesn’t disturb the other running tasks, (4) the progress score of a task is equal to the fraction of its total work that is completed, (5) tasks tend to finish in waves, so tasks with low progress scores are likely slow tasks, (6) different tasks of the same type (map or reduce) perform roughly the same amount of work.

All assumptions except (6) are broken in heterogeneous environments. To synthesize the problem with Hadoop scheduler, I would say it has two ma jor issues, which are address by LATE: (1) it spawns too many speculative tasks, that end up choking the other tasks, since in environments like EC2 tasks on the same physical host compete on resources (I/O, network), or preventing the system from starting new tasks by taking all the free task slots, and (2) the progress rate is too rigid, because it can mistake new tasks for slow tasks, due to the wrong assumption that tasks finish in waves.

They address these problems by speculatively executing the tasks that are estimated to finish the latest, rather than the tasks with smallest progress score (as Hadoop does). Therefore, instead of a task’s progress score, they estimate the progress rate of a task, as (progress score)/ (execution time), and then estimate the time left as (1- progress score)/(progress rate). The formula they use doesn’t address assumptions (1) and (2), i.e., they can still break, but they say this heuristic works well when the earlier phases of the tasks take longer than the last phases, which is the case with MapReduce. In consequence, I believe they just hide the problem with assumptions (1) and (2) by reducing the effects, instead of approaching it directly. However, they propose in future work section a direct approach — using machine learning to learn the capabilities of each node, in order to take more intelligent scheduling decisions. I would say that with this heuristic they directly address only assumptions (4) and (5).

They address the problem with assumption (3) by simply bounding (capping) the number of speculative tasks that are executed.

In conclusion, I would say they did a great job in presenting the drawbacks of the existing MapReduce task schedulers. The insights they provided already suggest possible solutions, which makes them very valuable. However they did a poor job in implementing possible solutions to the problems they mentioned. They implemented and evaluated only one heuristic, whereas they should have implemented and compared more heuristics, to show us why the heuristic they chose is the best they could find. Moreover, implementing the heuristic requires only small modifications in the Hadoop scheduler, because LATE seems to just slightly change some metrics — it just uses statistics about tasks (which were already available, I assume) in a slightly different way. Since it is so easy to slightly change the scheduler, they could have easily tried more heuristics.

Moreover, the future work section shows some interesting ideas that should have been explored: dynamic performance analysis using machine learning (to be used in scheduling decisions), or using a variance threshold instead of a percentile threshold. All these ideas could have been explored already; that would have made the paper more consistent. As of now, the paper is dry — its contribution was described in half a page.

Comments are closed.