Middleware started yesterday and I will be reporting on some of the presented papers.
These reports are only my interpretation of the presentations made in the conference. For a better understanding of the works please read the proceedings.
I skipped reporting on those papers for which I did understand the presentation well enough.
Diagnosing Distributed Systems with Self-Propelled Instrumentation
(Alexander Mirgorodsky - VMWare and Barton P Miller University of Wisconsin-Madison)
In this paper they propose a diagnosis tool targeted for distributed systems where problems are hard to detect, reproduce and debug, such as e-commerce web systems and HPC clusters and grids.
Their approach is to start an agent from a web browser that propagates through the system across processes. Users provide activation and deactivation events for the diagnostics system by means such as a link in a web page.
The agent is application-agnostic and follows the control flow within the process it sits in. It uses a technique called process hijacking to inject itself into the process to monitor. Once the agent is activated, it changes all call instructions so that they first call instrumentation code which in turn calls the original function.
To cross processes/hosts it sends messages through a socket to the other process/host, which must be running a daemon that hijacks the remote process and performs the same function call instrumentation procedure.
To diagnose problems, the tool analyzes control flows (e.g. per-request traces in a web server) and when something unusual is found it is flagged as a potential bug. A function coverage profile <p1,p2,…,pn> (pi=1 means function fi was called, pi=0 means fi was not called) is also constructed and analyzed. Bugs often result in call-path coverage differences.
One technique used to detect anomalies is to find outliers using unsupervised algorithm (they do not require examples of what an anomalous sample is).
After detecting anomalies, the next step is to find the cause. To do this, call paths in the anomalous flow are considered and compared with call paths in normal flows. Differences in the paths can lead to the cause of the problem. This is called “Coverage Analysis”.
Another problem is separating concurrent flows, which make things more complicated to analyze. They use a flow separation algorithm that uses certain heuristics such as the following:
given that a send normally goes before a recv call, some edges in a path where this does not happen can be removed, thus braking the original flow in multiple flows that correspond to concurrent users.
They use other similar solutions to support programs with queues, for which the previous heuristic does not work.
They show the usefulness of their system by testing it with the Condor grid middleware, deliberately making one of Condor’s components fail.
Performance Comparison of PHP and JSP as Server-Side Scripting Languages
Scott Trent et al. IBM Tokyo.
They compare the performance of PHP and JSP and try to answer the primary question: Is PHP really slower than JSP?
And the secondary questions: Does web server (apache, lighttpd) affect performance?
How to improve performance of scripting lanaguges and web servers tested?
They used the SPECweb2005 benchmark various PHP, JSP, Apache and Lighttpd configurations and found the following answers: 1. JSP is faster than PHP. 2. Lighttpd outperforms Apache. 3. Performance differences are so small that other considerations may influence web server systems design. Bonus: SSL web server implementations should cache and share negotiation data.
SPECweb2005 measures throughput in three scenarios: A banking system, an ecommerce, and a vendor support system. Each scenario presents different amounts of dynamic data and use of encrypted communication.
Banking: 100% encyption, 60% dynamic data
E-commerce: 14% encyption, 71% dynamic data
Vendor support: 0% encryption, 12% dynamic
The system architecture is a typical 3-tier web server-app server / DB server. Obvious performance bottlenecks were elminated through tuning of OS and web server parameters.
In general, lighttpd-JSP was the best, but only with a slight advantage over Apache-JSP and lighttpd-PHP which are essentially tied in 2nd place. 4th place goes to Apache-PHP.
Profiling data shows that encryption dominates performance, especially in lighttpd. Apache caches expensive SSL negotiation data, while lighttpd does not, which means that lighttpd performance can be improved further.
They also performed some microbenchmarks testing the scripting language only. In most pure CPU microbenchmarks Java is orders of magnitude faster than PHP. However, in some algorithms like MD5 or Levenshtein, PHP was better since it uses a C-coded library.
Something of note is that most official SPECweb2005 evaluations use JSP, even while but PHP is more commonly used. Also, rare web servers like Rock, Zeus and Sun are more evaluated than the more popular Apache (3 of 75 evaluations) and IIS (0 of 75). This paper claims to be the first to compare popular configurations.
Debugging and Testing Middleware with Aspect-Based Control-Flow and Causal Patterns
Luis D. Benavides Navaro et al. - EMN-INRIA, LINA
In this paper, they propose an aspect-based debugging system for distributed systems. Most current distributed debugging tools essentially apply sequential debugging techniques to distributed applications. The problem with this approach is that this approach provides little added value to sequential debugging. For example, they lack means for expressive definitions of distributed breakpoints or non-deterministic event relations.
This makes it difficult to debug errors that are triggered by a specific sequence of events. Moreover, the non-determinism inherent to concurrent execution of code makes this even more difficult.
The authors propose to extend the AWED system, which allows to define pointcuts (points of interest) in a disributed application, allowing code to be executed before and after the point where the pointcut is defined.
AWED provides a syntax to express remote pointcuts and stateful aspects, which apply to sequences of multiple events. However, this syntax still presents problems for many distributed applications. It is possible to get False Negatives because the pointcut may not match sequences that arrive in the wrong order. It is also possible to get Flase Positives because the pointcut may match wrong sequences that com from events at different hosts.
To solve these problems, they propose to extensions. The first is to add additional syntax to define causal sequences using the traditional causality definition defined for Lamport and Vector clocks. The second is to provide causal pointcuts with reordering. This applies to cases where a sequence of operations is well known, for example, a web cache issues a “prepare commit”. If a commit is monitored before a prepare, this would cause a spurious error.
The system is implemented using a decentalized architecture based on dynamic weaving and distributed aspects, unlike current centralized tools (e.g. eclipse, alinea). They provide two protocols: one based on causal tags for the AWED runtime, and one based on causal tags+clock increase for JGroups and Java RMI.
They evaluated the system by successfully debugging a real world bug reported for the JBoss Cache system.
They were able to use causal pointcuts to define “concrete breakpoints” matching the error condition and to define a deterministic test case.
They also compared their system with the eclipse debugger using a microbenchmark based on the JBoss Cache framework. Their distributed system outperformed the centralized remote debugging eclipse architecture in terms of requests handled per second.
Burstiness in multitier applications: symptoms, causes, and new models
Ningfang Mi et al. College of William and Mary
The traditional approach to capacity planning is to use mean value analysis (MVA) focusing on
the mean service time. This approach accounts for randomness and variability, but not burstiness.
The research question for this work is: How to characterize, quantify, measure and model burstiness?
The first part is to characterize burstiness. To do this they used TPC-W (a bookstore web application) and applied a bursty workload to it. They observed that when a burst occurs, it was typical that the bottleneck of the system switched from the application server to the database server. It is hard to take into account all things that cause this behavior (database locks, memory constraints and other low-level phenomena), so it is better to study the problem using a black-box approach.
They use an index of dispersion I (normalized variance of autocorrelation coefficients) to characterize burstiness. Their finding is that this index characterizes burstiness very well (shown by three workloads with zero, medium and high burstiness). Practical measurement of I is difficult because autocorrelation estimation is hard. They use an alternative definition: lim[t-->Inf] Var(Nt)/E[Nt]
To calculate this in practice, they repeat a measurement of the parameters for a moving time window.
After repeating this windows measurement, the variance of the parameters can be used to give a measure of the burstiness. Having a high variance means a high index of dispersion and high burstiness.
To model burstiness they use a MAP queueing network model (Markovian Arrival Processing). They parametrized this queueing model and compared it with a MVA model. They get similar results with workload with no burstiness, but when the workload has burstiness, MAPQN model is much better than MVA model.
Industry Track
A P2P Approach to Providing QoS Monitoring in Web Service Activities
Fariaz Karim - Intel
In this work the term web service activity refers to two or more web services that automate business process and use open standards such as SOAP. They can cross geographic regions and can be mission critical.
Some examples of QoS types are end-to-end latency, throughput and exception/errors. Anomaly detection in these variables must be done in real time, end to end, and over WANs. Poor QoS of activities may impact factory operations, which translates into economic losses.
One option is to use a centralized approach, for which there are many products from well-known vendors. This approach has scalability limitations such as the delay imposed by WANs, but also deployment limitations such as legal restrictions in certain countries with international network restrictions such as China.
Intel has instead adopted a P2P model in which every service monitors its own peers and QoS data flows together with activity data as SOAP headers. This also deals with deployment restrictions since firewalls inspect only the payload of SOAP headers, and if the payload is valid, the QoS headers are allowed to pass as well.
This P2P approach is combined with standard middleware technology provided by Microsoft’s .Net framework and allows for good QoS monitoring.
Demystifying Data Deduplication
Pin Zhou et al. - IBM Research Almaden
This work deals with the process of eliminating redundant copies of data. According to a study by IDC, data storage needs are outgrowing storage capacity both for individuals (due to content-rich data) and enterprises (due to digital records, regulatory compliance laws, etc.)
The idea is to reduce storage needs by removing redundant data. For example, 100 emails with the same 1MB attachment.
Some storage vendors offer deduplication products and claim 20-30X space reduction.
This work studies this problem by developing a taxonomy of available approaches studying three aspects: placement, timing and algorithm. The metrics studied for each approach are space savings, deduplication time, reconstruction time and resource consumption (CPU, network and disk I/O). These metrics are evaluated experimentally using a real-world dataset.
From the placement point of view, the available choices are to place the deduplication product at the client, to use a deduplication appliance, and to use storage arrays. In client-based solutions the client sends hashes to the deduplication server, then the server determines if duplicates exist and notifies the client. The client transmits only unique data. This option saves bandwidth, but increases client CPU usage.
Other solutions are deduplication applianced, which are special purpose systems that are generally used for easing backups; and storage arrays, which are similar to RAID arrays, but implement deduplication technology at the block, rather than file level.
With respect to timing, the options are in band and out of band. In-band deduplication is performed before actually writing. It is the only choice for client deduplication and can add latency to write operations. Out-of-bound deduplication is performed at regular intervals or when a high watermark is reached. This provides higher data ingestion speeds, but also lots of additional disk I/O since data is first written without deduplication, then read, and then rewritten with deduplication.
There are also multiple deduplication algorithms. Whole file hashing (WFH) hashes whole files with SHA1/MD5 or similar, and keeps only one instance of duplicate files. The disadvantage is that it does not help with files that differ only at small parts.
Fixed Block Hashing (FBH) divides files into fixed-sized blocks which are subsequently hashed discarding duplicate blocks. This provides better space saving, but more metadata and longer reconstruction time.
Variable Block Hashing (VBH) uses a rolling hash to divide the file in variable size blocks using algorithms such as Rabin fingerprinting.
Delta encoding is an algorithm that, given two files, generates a delta (patch) between both of them and only stores the reference file and the delta.
Data deduplication techniques were evaluated using three hash-based algorithms, a dataset consisting of backups of 16 employees in an enterprise environment totalling 2 million files and 450GB, and a one-time backup.
The folding factor (original size/deduplicated size) was about 1.5, which is not much. Varying the block size for the hashing algorithms changed the results by about 30%. It is clear that to achieve large factors (20 or 30) multiple full backups are required.
The metadata overhead is small in size compared with the data, but it can impact performance significantly. It can differ by more than 7 times for different block sizes.
Smaller block sizes provide little benefit in space saving, but cause significant degradation in reconstruction time. It can vary by more than 15 times for different block sizes.
The main conclusions obtanied by this study are that the efficiency of deduplication time depends on design choices; deduplication metadata is small in size, but can have a large impact in performance; deduplication inherent in backup data is not as large as 20-30 times as claimed by some vendors unless multiple full backups are used; and deduplication is data type dependent.
Note: an audience member argued that the motivation of this study is invalid, since it is a graph produced by IDC that claims that by 2011 storage manufacturers will be producing 800 Exabytes, while 1800EB will be required. The audience member says that this makes no sense and I agree with him. That graph says that in 2008 storage produced was going to be less than the information created during the year. How can that be true when disk prices are incredibly low?. However, I think this data deduplication business can probably still be valid for backups in enterprises, otherwise there would probably not be so many deduplication products on the market.
Darjeeling: a Java Compatible Virtual Machine for Microcontrollers
Niels Brouwers et al. - TU Delft
This work is about a Java compatible execution system suitable for wireless sensor nodes. The authors claim that middleware challenges for WSNs involve robustness, memory efficiency, multiple threads, power consumption and network heterogeneity, and that C language-based solutions fail are not robust due to lack of garbage collection, and do not support network heterogeneity. Moreover, memory efficiency and multithreading in C are mutually exclusive due to multiple stacks. Java, on the other hand, does not suffer from these limitations (according to the authors), except that it being slower makes it consume more power.
Prior work in this area is either proprietary software or not feature rich.
The authors propose a system called Darjeeling targeted at microcontrollers. It is not a JVM, since it uses static linking and a modified instruction set.
Darjeeling uses an “infuser” to, based on class files generated by an ordinary Java compiler, produce “infusion files” and “infusion headers” with linking information. It has a mark&sweep GC, multithreading, exceptions, ASCII strings and a small footprint.
Its limitations are that it supports only a Java subset, and does not support floating point, 64-bit types, class loaders, reflection, or multi-dimensional arrays (Note: Java never supported these anyway, probably refers to arrays of arrays). The class library is also limited.
Probably the most important feature of Darjeeling is its execution model, in which stack frames are heap objects. This allows stacks to grow and shrink in memory as needed without need to pre-allocate a stack with a worst-case size. Threads are lightweight and can be stopped at any time.
The main disadvantage is performance, since it is about 100 times slower than C in microbenchmarks.
Experiments with the threading system shows the advantage of the dynamic stacks.
Guido Urdaneta