EuroSys
The European Professional Society on Computer Systems

European chapter of the
Special Interest Group on Operating Systems (SIGOPS)
of the Association for Computing Machinery (ACM)
Home Join or renew membership EuroSys for Students EuroSys for Faculty Job Offers Activities Systems Directory Systems Events and Blog Eastern Europe Initiative Member Area Member News Officers and Volunteers Useful links Press releases

Archive for the ‘Middleware 2008’ Category

Middleware 2008 - Day 3

Saturday, December 6th, 2008

This is Middleware’s last day.

Again, these reports are my interpretation of what I got from the presentations. I do not claim that they are correct or that they reflect the intention of the authors. For some presentations I got a better understanding than for others and it will show in how coherent each report is. In all cases, read the proceedings for the real deal.

I skipped reporting on those papers for which I did understand the presentation well enough. In fact, since this is the last day and there are relatively few presentations, I will report only about the ones I liked the most.

Subscription Subsumption Evaluation for Content-Based Publish/Subscribe Systems

Hojjat Jafarpour et al. University of California at Irvine

The goal of this work is to prevent dissemination of redundant subscriptions in a content-based pub/sub system by using the concept of subsumption.

The main to approaches to solve this problem are covering and subsumption. A subscription S2 covers S1 if all messages that would be delivered to S1 are a subset of the messages that would be delivered to S2.

Subsumption extends this concept to a set of subscriptions. A set of subscriptions S1,S2,…,Sn subsumes Sx if the union S1 U S2 U … U Sn covers Sx.

The proposed system in this work targets a system whose content space is a d-dimensional space of numerical attributes, subscriptions are d-dimensional rectangles, and publications are d-dimensional points.

The basic problem is to determine if a new rectangle S is contained in the union of a set of pre-existing d-dimensional rectangles.

To solve this problem, the authors partition the content space into positive and negative spaces. The positive space is the parts of the space covered by at least one existing subscription, and the negative space is the complement, that is, the part not covered by any existing subscription. A new subscription is subsumed if its intersection with the negative space is empty.

They represent the negative space as a set of non-overlapping d-dim rectangles. If a new subscription intersects with any of these rectangles, it is not subsumed. For efficiency, the set of negative rectangles is indexed using a tree structure. For N subscriptions in a d-dimensional space, the basic algorithm generates O(N^d) neg-rectangles. This, of course, does not scale for high-dimensional spaces.

The alternative proposed by this work is to use an approximation algorithm and restrict the number of new negative rectangles added for each new subscription to be less than a constant K. This results in at most O(NK) after N active subscriptions. Their algorithm is guaranteed not to leaad to false negatives, although it may generate some false positives, but in this case correctness is not compromised.

To select the K rectangles to add, they propose to choose the top-K rectangles obtained with the basic algorithm that optimize a benefit/cost model. Subscription forwarding in this system is very simple. When a new subscription arrives, If the set of intersecting negative rectangles is empty, the subscription is subsumed and not forwarded. If not empty, it is forwarded and negative rectangles are recalculated.

The evaluated the system in a simulation environment with 10,000 subscriptions; 2,3,4 and 5 dimensional spaces with each dimension having a range between 0 and 1000 and with actual ranges for subscriptions taken from a Zipf distribution. They compared subsumption with covering and obtained an improvement of more than 50% in redundant subscription detection. They also found out that, while larger values of K result in greater reduction of redundant subscriptions, varying K from 50 to 100 does not yield much improvement.

They also compared optimizing benefit/cost against optimizing benefit and 1/cost alone and confirmed that benefit/cost results in a lower number of neg-rectangles and better detection of subsumed subscriptions.

Dexter - An Extensible Framework for Declarative Parameter Passing in Distributed Object Systems

Eli Tilevich et al. - Virginia Tech

This work presents an extensible framework for declarative parameter passing semantics for RMI-like Java distributed systems.

RPC/RMI-like systems work really well on a LAN, but one problem they have is that parameter passing semantics are not flexible. For example, in Java RMI, parameters that implement the marker interface Remote are passed by reference, while parameters that implement the marker interface Serializable are passed by copy.

This is problematic when one wants to pass objects of the same class using different semantics. For example, one may want to pass an object with a small sequence of strings by value, and another one by reference, and one may not be able to modify the code because it is part of a third-party library. Moreover, sometimes, one may want to use semantics that differ from the two provided by the system, especially in high latency and low bandwidth environments. Another problem is the fact that sometimes it is difficult to understand the application without examining the entire logic, especially when there are many subtypes.

This work proposes a declarative remote parameter passing scheme called Dexter for Java based on Java 5 annotations. For example, the following method declarations can be possible:
align(@RemoteRef list all, @Copy list candidates)
mutate(@CopyRestore list data)

The principle this system is based on is to treat parameter passing as a cross-cutting concern and implement it in the application space. Their framework combines aspect-oriented and generative techniques, and exposes the invocation context before and after call both on the client and the server using interceptors.

To implement new parameter passing semantics using the interceptor-based scheme, a developer needs to implement an interface called InterceptorPoint. To test their system, the authors implemented several parameter-passing semantics.

The first is Lazy semantics, which works by first passing the object by reference, and then passing it by value upon first use. This is useful in asynchronous distributed environments such as P2P applications.

Another implemented semantics is CopyRestore, which efficiently emulates the effect of local call-by-reference parameters. This is applicable for single-threaded clients and stateless servers.

Finally, they implement CopyRestore with Delta semantics, which is similar to CopyRestore, but instead of sending the entire object graph back, the server only sends back the changes. This provides performance gains proportional to the size of the graph.

The authors conclude that this system is more expressive and easier to read, maintain, extend and reuse than the standard schemes traditionally provided by RMI systems. Another advantage is efficiency as no transformations take place before the parameter is used in a remote way.

The overhead introduced by Dexter is dominated by the fact that the call is remote, as the latency of remote calls is orders of magnitude higher than for local calls.

Guido Urdaneta

Middleware 2008 - Day 2

Friday, December 5th, 2008

Again, these reports are my interpretation of what I got from the presentations. I do not claim that they are correct or that they reflect the intention of the authors. For some presentations I got a better understanding than for others and it will show in how coherent each report is. In all cases, read the proceedings for the real deal.

I skipped reporting on those papers for which I did understand the presentation well enough.

Profiling and Modeling Resource Usage of Virtualized Applications

Lucy Cherkasova et al. HP Labs

Virtualized data ceners have benefits such as lower hardware and energy costs through server consolidation, capacity on demand, agile and dynamic IT.

But there are many challenges. Applications are characterized by a collection of resource usage traces in a native environment, and the overhead caused by virtualization is difficult to figure out, as well as the effects
of consolidating multiple VMs in one host. These issues are important for capacity planning and efficient server consolidation.

Virtualization overhead has been measured in many papers, but none provides a methodology to predict it in a general way.

The authors observe that in most virtualization environments (such as Xen), the I/O is run by a domain separate from the virtual machine where the application runs. Since I/O is critical ni most server applications, any prediction methodology must predict both the resource usage by the I/O domain and the resource usage by the application itself within the VM.

The approach used for prediction uses an automated robust model generation scheme. First, some microbenchmarks are run on both native and virtual platforms. These benchmarks perform a range of I/O (net and disk) and CPU intensive tasks. Traces are gathered from these runs and used to generate a linear regression model of the relatioship between native and virtual executions. The generated models are specific to the platform, but not applications.

This general model can then be applied to any application traces to predict its requirements in a virtualized environment. The microbenchmark suite used in the experiments is based on well known, easy-to-run benchmarks like httperf and Apache JMeter, among others.

The models use in principle 11 variables (3 cpu-related, 4 net and 4 disk) and predict CPU utilization in the VM and I/O domain. A robust regression technique eliminates outliers, while stepwise regression eliminates irrelevant metrics.

The methodology was tested on two hardware platforms (2-way AMD Opteron 2.6GHz 64bit, and 4-Way Intel Xeon 1.6 GHz 32bit) with two applications: RuBBIS, which simulates an auction site and TPC-W, which simulates an e-commerce site. The hypervisor used was Xen. Monitoring was done with systat in the native environment, and with xenmon and xentop in the virtualized Xen environment. Measurements where every 30 seconds.

The results show that the model is very accurate for predicting virtualized I/O domain and VM performance, with 90% of I/O Domain predictions within 4% error, and 90% of VM predictions within 11% error.

They also show that that for different hardware platforms, the virtualization overhead is different, with the intel 4-way platform having an overhead up to 1.7, while the AMD 2-way platform has up to 1.4 overhead.

This model could also help assess and compare performance overhead of different virtualization software releases.

In the future, they want to try different model structures (not only linear regression), use VMware instead of Xen, try to find a minimal set of microbenchmarks to build the model, and evaluate virtual device capacity.

Prism: Providing flexible and fast filesystem cloning service for virtual servers

Atul Prakash et al. University of Michigan

This work is about an efficient approach to cloning of file systems. This scheme is designed to take advantage of the fact that multiple users share a significant amount of data and services such as backup and malware scanning.

The main goals of this project are to create a file system called Prism to that provides isolation (a separate filesystem to each user), efficient cloning (rapid creation of a new file system from an existing one), high performance (as fast as typical file systems), and efficient centralized services (malware scanning, etc.)

Compared to alternative file systems, Prism offers a number of advantages. For example, versioned file systems have no isolation; writeable snapshots (ZFS, Flexcone, others) provide no gain in centralized services and make it difficult to exclude specific files or directories; cloning via namespace manipulation (unionfs, alcatraz, etc.) adds translation costs, only clones a base file system (not clone of a clone) and gives private files to each user.

Prism is based on the ext3 linux file system and can be mounted via NFS, Samba or VM guests. It can create clones, manage permissions, and provide centralized services.

Prism uses a technique called asynchronous cloning, as opposed to synchronous cloning. Synchronous cloning refers to a system where the clone is usable only after the cloning operation finishes. It is possible

to exclude parts of a file system according to policy, and file-inodes are shared with copy on write. In this case, cloning time depends on the size of the file system and can be relatively slow (in the order of minutes). Asynchronous cloning, on the other hand, provides a usable clone instantaneously, while some actual cloning occurs in the background. The main challenge for asynchronous cloning is to provide the same semantics as synchronous cloning.

Prism runs two threads. One clones the file system in the background and the other clones files on demand as they are requested. To handle requests in the cloned file system while it is being cloned directories are expanded on demand and copy-on-write is used to write files. Writing a file in the parent file system being cloned is more difficult since simple solutions such as locking or marking files before cloning would be too slow. The solution is to add a logical timestamp field to the directory entry for each file and a “clone-start” logical timestamp field added to the file system. Comparing these two timestamps allows identifying files that are yet to be cloned. When a file nto yet cloned is written, it is aggressively cloned and then the write can be completed.

They tested Prism in a virtualized environment where Dom0 had the original file system and DomU the cloned file systems.Cloning a file system with 6 GB of data took 630 seconds using regular file copying;
synchronous cloning took 58.7 seconds; and asynchronous cloning took 0.2 seconds to provide a usable cloned file system, while the total background cloning took 58.7 seconds.

They also ran apache build and connectathon benchmarks and the results showed negligible overhead over a regular file system. The biggest gain was when malware-scanning multiple file systems when there are 2 or more file systems because files with shared inodes do not need to be scanned twice.

A disadvantage of Prism is that currently it does not support multiple concurrent cloning operations.

Toward Massive Query Optimization in Large-Scale Distributed Stream Systems

Younglan Zhou - University of South Denmark

This work proposes a system to optimize SQL-like queries in large-scale distribued stream systems such as data coming from sensors in a WSN, stock values from a financial application, telecomunications system. In this type of system, queries usually take a long time.

This work uses an extended SQL syntax with new semantics. An example is the statement:

select * from S1[NOW], S2[1MIN] where s1.a = s2.b

A first approach to implement these queries would be to fetch remote streams directly from remote servers to the node where the queries are to be executed, but this has disadvantages such as inefficient communication and tight coupling.

Distributed aggregation of stream data can improve communication efficiency but cannot avoid tight copupling. Tight coupling can be prevented by using a pub/sub system. In this case selections on attributes can be defined as subscriptions and brokers can build routing tables using those subscriptions.

The new problem is now how to take advantage of this pub/sub scheme for efficient distributed stream processing.

For each query, a subscription can be generated by using the unique name of the result stream. Queries can have overlapping contents, which might incur in duplicate data transfer. The solution for this is to partition the queries into multiple groups and generate a representative query for each group.

Subscriptions in a pub/sub environment contain a list of stream names S, a list of attributes P and a list of filters F. There are two main objectives. First, to balance the load; and second, to minimize communication costs.

This leads to two modeling problems. Constructing a network graph, and a query graph. Query distributions are modeled as a graph mapping problem: how to map each vertex in the query graph to exactly one vertex in the network graph.

Solving this problem requires considering the characteristics of the pub/sub overlay, the difficulty of constructing a global network graph and query graph, that queries and stream statistics change at runtime, and that producing this mapping is an NP-hard problem.

The approach taken by the authors is to construct the graphs hierarchically by grouping vertices into coarse vertexes that represent multiple vertexes and execute queries on these coarsed nodes. To produce the mapping, they use a greedy approach in a first phase and they refine the mapping in a second phase.

They evaluated their approach in a network topology with 4096 nodes and compared four approaches: a centralized ideal approach, the hierarchical approach, greedy and naive.

Their results show that the hierarchical approach is very close to centralized in efficiency, but in running time centralized is much worse.

This system can be summarized as a stream query system that uses pub/sub to optimize communications using a hierarchical query distribution algorithm.

iDataGuard: An interoperable security middleware for untrusted internet Data storage

Ravi Chandra Jammalamadaka et all. - University of California at Irvine
Currently there is a trend to outsource personal data to services such as Amazon S3 and others. However, those services are not necessarily trusted since they may be vulnerable to insider and outsider attacks. Many applications are being developed to leverage storage infrastructure of these service providers, and sold to individuals and organizations. If customers want to change provider, these applications have to be rewritten.

iDataGuard’s objective is to provide a trusted interoperable middleware with a uniform interface for those providers.

iDataGuard faces a number of challenges. The first concerns security: how to do key management? what is the granularity of the encryption? what is the encrypted storage model at the storage provier?. The second concerns the heterogeneity caused by having multiple providers. The third is the definition of an abstract service model that contains an abstract data model (data is a series of objects) and an abstract operation model (fetch, store, etc.).

Data objects are secured by a cryptographic subsystem and all operations on the files are translated into operations on the object model. Security is enforced on the content of the file system, the metadata of the file system and the file system structure. To guarantee data confidentiality, objects are encrypted with a randomly generated key and large objects are split into smaller objects. HMACs are used for data integrity.

iDataGuard uses an index to allow search on encrypted data. It supports the execution of keyword and pattern queries.

iDataGuard currently has implementations for gmail.com and Amazon S3 and performance tests show that its overhead is negligible.

Hybrid Pub/Sub Protocol

Mark Linderman et al. Air Force Research Lab USA

This work studies the problem of scalable content-based pub/sub in dynamic network environments with objects of greatly varying popularity.

They propose to exploit a modified bittorrent protocol to achieve dissemination of popular objects and use anti-entropy and aggressive propagation to make up for limitations of BitTorrent for less popular objects.

This system is subject to several constraints. It must use filtering to provide content-based functionality within topics, must support fixed and mobile nodes, access control, long term archiving query capabilities, and a simple client interface called CAPI.

The goals for this system are simplicity and robustness. Subscriber peers require only occasional connection to a broker/tracker, publishers require connection to a broker, and churn is expected to occur. Publishers publish to the broker, and the broker decides how to break the published data into pieces, and adds torrent functionality by hashing, assigning piece number, etc.

The system uses three methods for data delivery. For popular data, gossiping algorithms are used. For mid-popular data, they use a combination of gossiping with aggressive propagation, and unpopular data is sent directly to subscribers.

They show that aggressive propagation reduces server load, but that the load increases superlinearly with scale if timeouts are configured to be similar to the network latency.

Moara: Flexible and Scalable Group-Based Queryng System

Steve Ko et al. University of Illinois at Urbana-Champaign

This work deals with the problem of monitoring groups in large-scale infrastructures such as grids, clouds, etc. One of the biggest challenges with these infrastructures is management, which requires monitoring. Traditionally, monitoring has been done with centralized solutions like IBM Tivoly and HP OpenView, which are very expressive, but not very scalable. An alternative is to use distributed aggregation systems such as Astrolabe, Mon and Seaweed, which are scalable and responsive, but they do not support groups.

Groups are formed naturally in large-scale infrastructures. For example, given a set of machines running a particular service in a datacenter, management might be interested in queris such as the average CPU utilization of a slice, or determining which are the top-3 loaded machines running linux and executing a mapreduce task, or the list of machines in rack R with less than 2 VMs and CPU utilization less than 10% and memory utilization less than 5%.

It can be seen that users can target unions and intersections of groups, static and dynamic attributes, and that groups can be defined by predicates rather than by listing a set of nodes.

A scalable group monitoring system requires expressive aggregation queries and responsiveness while handling massive amounts of data.

The authors propose Moara, a group based querying system based on the notion of group trees. Moara implements a Pastry-based aggregation framework where different groups can share a tree and uses three optimization techniques: multi-group optimization, single-group optimization for dynamic groups, and single-group node-bypassing optimization.

As already noted, the basic mechanism is a group tree. Queries are sent to the root, flooded to the children and then aggregated back to the root.

When a query expression is complex it may be difficult to discover optimization opportunities for bandwidth and latency. A naive inefficient approach is to send a query to every group and then collect the data. The solution used in Moara is to rewrite queries using Conjuctive Normal Form (CNF) where an expression is always shown as a series of AND-ed terms. Then, one needs to choose only the smallest term. Semantic information can also be used. For example, if one group includes another one in an AND
operation, the bigger one can be ignored.

Moara tries to strike a balance between aggresive and lazy group-tree management. Aggressive management saves per-query cost, but increases management cost; while lazy management increases per-query cost, but saves management cost. Moara uses an addaptive strategy where they continuously track query cost and management cost.

They evaluated Moara in three testbeds: 200 nodes in PlanetLab (WAN), 500 instances in 50 machines (Emulab) and 10,000 nodes in a simulation environment. The results showed that latency grows sublinearly with respect to group size.

Guido Urdaneta

Middleware 2008 report - Day 1

Thursday, December 4th, 2008

Middleware started yesterday and I will be reporting on some of the presented papers.

These reports are only my interpretation of the presentations made in the conference. For a better understanding of the works please read the proceedings.

I skipped reporting on those papers for which I did understand the presentation well enough.

Diagnosing Distributed Systems with Self-Propelled Instrumentation

(Alexander Mirgorodsky - VMWare and Barton P Miller University of Wisconsin-Madison)

In this paper they propose a diagnosis tool targeted for distributed systems where problems are hard to detect, reproduce and debug, such as e-commerce web systems and HPC clusters and grids.

Their approach is to start an agent from a web browser that propagates through the system across processes. Users provide activation and deactivation events for the diagnostics system by means such as a link in a web page.

The agent is application-agnostic and follows the control flow within the process it sits in. It uses a technique called process hijacking to inject itself into the process to monitor. Once the agent is activated, it changes all call instructions so that they first call instrumentation code which in turn calls the original function.

To cross processes/hosts it sends messages through a socket to the other process/host, which must be running a daemon that hijacks the remote process and performs the same function call instrumentation procedure.

To diagnose problems, the tool analyzes control flows (e.g. per-request traces in a web server) and when something unusual is found it is flagged as a potential bug. A function coverage profile <p1,p2,…,pn> (pi=1 means function fi was called, pi=0 means fi was not called) is also constructed and analyzed. Bugs often result in call-path coverage differences.

One technique used to detect anomalies is to find outliers using unsupervised algorithm (they do not require examples of what an anomalous sample is).

After detecting anomalies, the next step is to find the cause. To do this, call paths in the anomalous flow are considered and compared with call paths in normal flows. Differences in the paths can lead to the cause of the problem. This is called “Coverage Analysis”.

Another problem is separating concurrent flows, which make things more complicated to analyze. They use a flow separation algorithm that uses certain heuristics such as the following:
given that a send normally goes before a recv call, some edges in a path where this does not happen can be removed, thus braking the original flow in multiple flows that correspond to concurrent users.

They use other similar solutions to support programs with queues, for which the previous heuristic does not work.

They show the usefulness of their system by testing it with the Condor grid middleware, deliberately making one of Condor’s components fail.

Performance Comparison of PHP and JSP as Server-Side Scripting Languages

Scott Trent et al. IBM Tokyo.
They compare the performance of PHP and JSP and try to answer the primary question: Is PHP really slower than JSP?
And the secondary questions: Does web server (apache, lighttpd) affect performance?
How to improve performance of scripting lanaguges and web servers tested?

They used the SPECweb2005 benchmark various PHP, JSP, Apache and Lighttpd configurations and found the following answers: 1. JSP is faster than PHP. 2. Lighttpd outperforms Apache. 3. Performance differences are so small that other considerations may influence web server systems design. Bonus: SSL web server implementations should cache and share negotiation data.

SPECweb2005 measures throughput in three scenarios: A banking system, an ecommerce, and a vendor support system. Each scenario presents different amounts of dynamic data and use of encrypted communication.
Banking: 100% encyption, 60% dynamic data
E-commerce: 14% encyption, 71% dynamic data
Vendor support: 0% encryption, 12% dynamic

The system architecture is a typical 3-tier web server-app server / DB server. Obvious performance bottlenecks were elminated through tuning of OS and web server parameters.

In general, lighttpd-JSP was the best, but only with a slight advantage over Apache-JSP and lighttpd-PHP which are essentially tied in 2nd place. 4th place goes to Apache-PHP.

Profiling data shows that encryption dominates performance, especially in lighttpd. Apache caches expensive SSL negotiation data, while lighttpd does not, which means that lighttpd performance can be improved further.

They also performed some microbenchmarks testing the scripting language only. In most pure CPU microbenchmarks Java is orders of magnitude faster than PHP. However, in some algorithms like MD5 or Levenshtein, PHP was better since it uses a C-coded library.

Something of note is that most official SPECweb2005 evaluations use JSP, even while but PHP is more commonly used. Also, rare web servers like Rock, Zeus and Sun are more evaluated than the more popular Apache (3 of 75 evaluations) and IIS (0 of 75). This paper claims to be the first to compare popular configurations.

Debugging and Testing Middleware with Aspect-Based Control-Flow and Causal Patterns

Luis D. Benavides Navaro et al. - EMN-INRIA, LINA

In this paper, they propose an aspect-based debugging system for distributed systems. Most current distributed debugging tools essentially apply sequential debugging techniques to distributed applications. The problem with this approach is that this approach provides little added value to sequential debugging. For example, they lack means for expressive definitions of distributed breakpoints or non-deterministic event relations.

This makes it difficult to debug errors that are triggered by a specific sequence of events. Moreover, the non-determinism inherent to concurrent execution of code makes this even more difficult.

The authors propose to extend the AWED system, which allows to define pointcuts (points of interest) in a disributed application, allowing code to be executed before and after the point where the pointcut is defined.

AWED provides a syntax to express remote pointcuts and stateful aspects, which apply to sequences of multiple events. However, this syntax still presents problems for many distributed applications. It is possible to get False Negatives because the pointcut may not match sequences that arrive in the wrong order. It is also possible to get Flase Positives because the pointcut may match wrong sequences that com from events at different hosts.

To solve these problems, they propose to extensions. The first is to add additional syntax to define causal sequences using the traditional causality definition defined for Lamport and Vector clocks. The second is to provide causal pointcuts with reordering. This applies to cases where a sequence of operations is well known, for example, a web cache issues a “prepare commit”. If a commit is monitored before a prepare, this would cause a spurious error.

The system is implemented using a decentalized architecture based on dynamic weaving and distributed aspects, unlike current centralized tools (e.g. eclipse, alinea). They provide two protocols: one based on causal tags for the AWED runtime, and one based on causal tags+clock increase for JGroups and Java RMI.

They evaluated the system by successfully debugging a real world bug reported for the JBoss Cache system.
They were able to use causal pointcuts to define “concrete breakpoints” matching the error condition and to define a deterministic test case.

They also compared their system with the eclipse debugger using a microbenchmark based on the JBoss Cache framework. Their distributed system outperformed the centralized remote debugging eclipse architecture in terms of requests handled per second.

Burstiness in multitier applications: symptoms, causes, and new models

Ningfang Mi et al. College of William and Mary
The traditional approach to capacity planning is to use mean value analysis (MVA) focusing on
the mean service time. This approach accounts for randomness and variability, but not burstiness.

The research question for this work is: How to characterize, quantify, measure and model burstiness?

The first part is to characterize burstiness. To do this they used TPC-W (a bookstore web application) and applied a bursty workload to it. They observed that when a burst occurs, it was typical that the bottleneck of the system switched from the application server to the database server. It is hard to take into account all things that cause this behavior (database locks, memory constraints and other low-level phenomena), so it is better to study the problem using a black-box approach.

They use an index of dispersion I (normalized variance of autocorrelation coefficients) to characterize burstiness. Their finding is that this index characterizes burstiness very well (shown by three workloads with zero, medium and high burstiness). Practical measurement of I is difficult because autocorrelation estimation is hard. They use an alternative definition: lim[t-->Inf] Var(Nt)/E[Nt]

To calculate this in practice, they repeat a measurement of the parameters for a moving time window.
After repeating this windows measurement, the variance of the parameters can be used to give a measure of the burstiness. Having a high variance means a high index of dispersion and high burstiness.

To model burstiness they use a MAP queueing network model (Markovian Arrival Processing). They parametrized this queueing model and compared it with a MVA model. They get similar results with workload with no burstiness, but when the workload has burstiness, MAPQN model is much better than MVA model.

Industry Track

A P2P Approach to Providing QoS Monitoring in Web Service Activities

Fariaz Karim - Intel

In this work the term web service activity refers to two or more web services that automate business process and use open standards such as SOAP. They can cross geographic regions and can be mission critical.

Some examples of QoS types are end-to-end latency, throughput and exception/errors. Anomaly detection in these variables must be done in real time, end to end, and over WANs. Poor QoS of activities may impact factory operations, which translates into economic losses.

One option is to use a centralized approach, for which there are many products from well-known vendors. This approach has scalability limitations such as the delay imposed by WANs, but also deployment limitations such as legal restrictions in certain countries with international network restrictions such as China.

Intel has instead adopted a P2P model in which every service monitors its own peers and QoS data flows together with activity data as SOAP headers. This also deals with deployment restrictions since firewalls inspect only the payload of SOAP headers, and if the payload is valid, the QoS headers are allowed to pass as well.

This P2P approach is combined with standard middleware technology provided by Microsoft’s .Net framework and allows for good QoS monitoring.

Demystifying Data Deduplication

Pin Zhou et al. - IBM Research Almaden

This work deals with the process of eliminating redundant copies of data. According to a study by IDC, data storage needs are outgrowing storage capacity both for individuals (due to content-rich data) and enterprises (due to digital records, regulatory compliance laws, etc.)

The idea is to reduce storage needs by removing redundant data. For example, 100 emails with the same 1MB attachment.

Some storage vendors offer deduplication products and claim 20-30X space reduction.

This work studies this problem by developing a taxonomy of available approaches studying three aspects: placement, timing and algorithm. The metrics studied for each approach are space savings, deduplication time, reconstruction time and resource consumption (CPU, network and disk I/O). These metrics are evaluated experimentally using a real-world dataset.

From the placement point of view, the available choices are to place the deduplication product at the client, to use a deduplication appliance, and to use storage arrays. In client-based solutions the client sends hashes to the deduplication server, then the server determines if duplicates exist and notifies the client. The client transmits only unique data. This option saves bandwidth, but increases client CPU usage.

Other solutions are deduplication applianced, which are special purpose systems that are generally used for easing backups; and storage arrays, which are similar to RAID arrays, but implement deduplication technology at the block, rather than file level.

With respect to timing, the options are in band and out of band. In-band deduplication is performed before actually writing. It is the only choice for client deduplication and can add latency to write operations. Out-of-bound deduplication is performed at regular intervals or when a high watermark is reached. This provides higher data ingestion speeds, but also lots of additional disk I/O since data is first written without deduplication, then read, and then rewritten with deduplication.

There are also multiple deduplication algorithms. Whole file hashing (WFH) hashes whole files with SHA1/MD5 or similar, and keeps only one instance of duplicate files. The disadvantage is that it does not help with files that differ only at small parts.

Fixed Block Hashing (FBH) divides files into fixed-sized blocks which are subsequently hashed discarding duplicate blocks. This provides better space saving, but more metadata and longer reconstruction time.

Variable Block Hashing (VBH) uses a rolling hash to divide the file in variable size blocks using algorithms such as Rabin fingerprinting.

Delta encoding is an algorithm that, given two files, generates a delta (patch) between both of them and only stores the reference file and the delta.

Data deduplication techniques were evaluated using three hash-based algorithms, a dataset consisting of backups of 16 employees in an enterprise environment totalling 2 million files and 450GB, and a one-time backup.

The folding factor (original size/deduplicated size) was about 1.5, which is not much. Varying the block size for the hashing algorithms changed the results by about 30%. It is clear that to achieve large factors (20 or 30) multiple full backups are required.

The metadata overhead is small in size compared with the data, but it can impact performance significantly. It can differ by more than 7 times for different block sizes.

Smaller block sizes provide little benefit in space saving, but cause significant degradation in reconstruction time. It can vary by more than 15 times for different block sizes.

The main conclusions obtanied by this study are that the efficiency of deduplication time depends on design choices; deduplication metadata is small in size, but can have a large impact in performance; deduplication inherent in backup data is not as large as 20-30 times as claimed by some vendors unless multiple full backups are used; and deduplication is data type dependent.

Note: an audience member argued that the motivation of this study is invalid, since it is a graph produced by IDC that claims that by 2011 storage manufacturers will be producing 800 Exabytes, while 1800EB will be required. The audience member says that this makes no sense and I agree with him. That graph says that in 2008 storage produced was going to be less than the information created during the year. How can that be true when disk prices are incredibly low?. However, I think this data deduplication business can probably still be valid for backups in enterprises, otherwise there would probably not be so many deduplication products on the market.

Darjeeling: a Java Compatible Virtual Machine for Microcontrollers

Niels Brouwers et al. - TU Delft

This work is about a Java compatible execution system suitable for wireless sensor nodes. The authors claim that middleware challenges for WSNs involve robustness, memory efficiency, multiple threads, power consumption and network heterogeneity, and that C language-based solutions fail are not robust due to lack of garbage collection, and do not support network heterogeneity. Moreover, memory efficiency and multithreading in C are mutually exclusive due to multiple stacks. Java, on the other hand, does not suffer from these limitations (according to the authors), except that it being slower makes it consume more power.

Prior work in this area is either proprietary software or not feature rich.

The authors propose a system called Darjeeling targeted at microcontrollers. It is not a JVM, since it uses static linking and a modified instruction set.

Darjeeling uses an “infuser” to, based on class files generated by an ordinary Java compiler, produce “infusion files” and “infusion headers” with linking information. It has a mark&sweep GC, multithreading, exceptions, ASCII strings and a small footprint.

Its limitations are that it supports only a Java subset, and does not support floating point, 64-bit types, class loaders, reflection, or multi-dimensional arrays (Note: Java never supported these anyway, probably refers to arrays of arrays). The class library is also limited.

Probably the most important feature of Darjeeling is its execution model, in which stack frames are heap objects. This allows stacks to grow and shrink in memory as needed without need to pre-allocate a stack with a worst-case size. Threads are lightweight and can be stopped at any time.

The main disadvantage is performance, since it is about 100 times slower than C in microbenchmarks.

Experiments with the threading system shows the advantage of the dynamic stacks.

Guido Urdaneta