SOSP09: LADIS: Session 4: Monitoring and Repair
October 11th, 2009 by Maysam YabandehToward automatic policy refinement in repair services for large distributed systems
Moises Goldszmidt (Microsoft Research), Mihai Budiu (Microsoft Research), Yue zhang (Microsoft) and Michael Pechuk (Microsoft)
———————–
Evaluate the effectiveness of a repair action and give alternatives
Effectiveness -> time that a machine is ‘usable’
What is a successful repair: measure the number of good signals
- Use machine learning to pick the right signals
A case for the accountable cloud
Andreas HAeberlen (MPI-SWS)
———————–
Problem of cloud computing: the admistrative domain is splited between the software provider and the cloud provider
- When a problem arises, how the software provider can say it was its software problem or the cloud’s fault
– How can prove the other party, if they believe the problem is from them
Solution: use peer-review techniques
Completeness: can be relaxed -> probabilistic log of action
Accuracy: can not be relaxed -> the missed action might be critical to detect the problem
Learning from the Past for Resolving Dilemmas of Asynchrony
Paul Ezhilchelvan (Newcastle University) and Santosh Shrivastava (Newcastle University)
———————–
Cost of Asynchrony: We do not have perfect failure detector
In new emerged managed environments (e.g., data centers), we do not need asynchronous model
- delays are predictable
- probabilistically synchronous model
Assumption: mostly, the performance in the recent past indicates performance in the near future
design steps:
- measure delays
- design the protocol with tunable parameters
- choose the parameters at run-time
big picture:
- assume a probability
- detection of failure is guaranteed
- react to failure in application-specific way
- adapt the probability
Q: we already have probabilistic consensus protocols.
A: they are more expensive, because they mean that repeat the process till we have consensus