Direction des Relations Internationales (DRI)

Programme INRIA "Equipes Associées"

I. DEFINITION

EQUIPE ASSOCIEE

CloudComputing@home
sélection
2009

Equipe-Projet INRIA : MESCAL Organisme étranger partenaire : UC Berkeley
Centre de recherche INRIA : Grenoble Rhône-Alpes
Thème INRIA : Num
Pays : USA
 
 
Coordinateur français
Coordinateur étranger
Nom, prénom  Kondo, Derrick   Anderson, David P.
Grade/statut  CR INRIA  Research scientist, Director of BOINC & SETI@home
Organisme d'appartenance
(précisez le département et/ou le laboratoire)
 MESCAL project team  U.C. Berkeley Space Sciences Laboratory  
Adresse postale  Laboratoire LIG
 ENSIMAG - antenne de Montbonnot
 ZIRST 51, avenue Jean Kuntzmann
 38330 MONBONNOT SAINT MARTIN, France
 7 Gauss Way
Berkeley, CA 94720
URL  http://mescal.imag.fr/membres/derrick.kondo/index.html  http://boinc.berkeley.edu/anderson  
Téléphone  +33/0.4.76.61.20.61  +1.510.642.4921  
Télécopie  +33/0.4.76.61.20.99
 +1.510.643.7629  
Courriel  derrick.kondo@inria.fr  davea@ssl.berkeley.edu  

La proposition en bref

Titre de la thématique de collaboration (en français et en anglais) 

Cloud Computing over Internet Volunteer Resources. 

Cloud computing sur ressources de calcul bénévoles.

Descriptif:

Recently, a new vision of cloud computing has emerged where the complexity of an IT infrastructure is completely hidden from its users.  At the same time, cloud computing platforms provide massive scalability, 99.999% reliability, and speedy performance at relatively low costs for complex applications and services.   In this proposed collaboration, we investigate the use of cloud computing for large-scale and demanding applications and services over the most unreliable but also most powerful resources in the world, namely volunteered resources over the Internet.  The motivation is the immense collective power of volunteer resources (evident by FOLDING@home's 3.9 PetaFLOPS system), and the relatively low cost of using such resources.  We will focus on three main challenges.  First, we will determine, design and implement statistical and prediction methods for ensuring reliability, in particular, that a set of N resources are available for T time.  Second, we will incorporate the notion of network distance among resources.  Third, we will create a resource specification language so that applications can easily leverage these mechanisms for guaranteed availability and network proximity.  We will address these challenges drawing on the experience of the BOINC team which designed and implemented BOINC (a middleware for volunteer computing that is the underlying infrastructure for SETI@home), and the MESCAL team which designed and implemented OAR (an industrial-strength resource management system that runs across France's main 5000-node Grid called Grid'5000). 

La notion de «cloud computing» apparue récemment a pour objectif de complètement masquer la complexité des infrastructures numériques actuelles à leurs utilisateurs. Ces plates-formes de cloud computing doivent donc massivement passer à l'échelle, être extrêmement fiables et fournir d'excellentes performances à faible coût aux applications et aux services qu'elles offrent. Dans le cadre de la collaboration que nous proposons, nous regarderons comment mettre en oeuvre cette idée de cloud computing pour des applications complexes et exigeantes sur les plates-formes les plus instables qui soient, mais également les plus puissantes: les ressources de calcul bénévole accessibles à partir d'Internet. La motivation derrière ce défi est l'incroyable puissance de calcul résultant de ces ressources (3.9 PetaFLOPS, rien que pour FOLDING@home) et leur très faible coût. Nous nous concentrerons sur trois axes principaux. Premièrement, nous chercherons à concevoir et à mettre en oeuvre des méthodes statistiques et de prédiction permettant d'améliorer la fiabilité de ces ressources. En particulier, nous nous intéresserons à la probabilité qu'un ensemble de N machines soit disponible durant une période de temps T. Deuxièmement, nous couplerons ces garanties de disponibilité de ressources avec des informations de proximité réseau entre les unes et les autres. Enfin, nous étendrons un langage de spécification de ressources classique afin que les applications et les services soient à même d'exprimer des besoins en terme de disponibilité et de proximité réseau. Nous relèverons ces défis en nous appuyant sur les expériences respectives des équipes impliquées: l'équipe BOINC a conçu et mis en oeuvre le middleware BOINC qui est utilisé à travers le monde entier et est est l'infrastructure sous-jacente à SETI@home; l'équipe MESCAL a conçu et mis en oeuvre OAR, un outil de gestion de ressources et de tâches de qualité industrielle et qui est notamment au coeur de Grid'5000.

Présentation détaillée de l'Équipe Associée

1. Objectifs scientifiques de la proposition

Background on Cloud Computing

Recently, a new vision of cloud computing has emerged where the complexity of an IT infrastructure is completely hidden from its users.  At the same time, cloud computing platforms are attractive because they provide massive scalability, 99.999% reliability, and speedy performance at relatively low costs.  

From the perspective of a user, there are two main types of cloud computing platforms.  The first type is Software-as-a-Service, where applications (such as Google App's) are accessed through the web.  The second type is Platform-as-a-Service, where a third party provides and manages a computing or storage infrastructure, and users purchase time or resources in this infrastructure.  Typical Platform-as-a-Service architectures, such Amazon's EC2 and S3, are large-scale and centralized data and compute centers.

From the perspective of developers, computing clouds consists of three types of entities [2].  The first type is Enablers, which provide data center automation or server virtualization technologies that form the fundamental building blocks for cloud computing.  An example of an Enablers are hypervisors from Xen or VMWare.  The second type is Providers, which provide the infrastructure that provide Software or Platform as-a-Service, and which maintain their own billing models for use of those services.  An example of a Provider is Amazon's EC2 cloud.  The third type is Consumers which utilize cloud infrastructures for software or platform services.  An example of a Consumer is Google Apps.

In terms of the user perspective, we propose to develop a novel Platform-as-a-Service architecture that is orders of magnitude cheaper and more powerful than current systems.  In terms of the developer perspective, we will build this system using Enablers in novel ways and create a new type of Provider based on volunteered resources over the Internet.

Background on Volunteer Computing

While most cloud computing infrastructures are commercial, dedicated and centralized architectures, most of the world's computing power is distributed across the hundreds of millions of Internet hosts on residential broadband networks.  Volunteer computing has sought to harness the free collective computational and storage resources of desktop PC's throughout the Internet.  For over a decade, volunteer computing systems (VCS's) have been one of the largest and most powerful distributed computing systems on the planet, offering a high return on investment for applications from a wide range of scientific domains (including computational biology, climate prediction, and high-energy physics). For example, currently the FOLDING@home project uses over 3.9 PetaFLOPS from about 344,000 active CPU's.  Since 2000, over 100 scientific publications (in the world's most prestigious scientific journals such as Science and Nature) have documented real scientific results achieved through volunteer computing.  The primary middleware for volunteer computing is the Berkeley Open Infrastructure for Network Computing (BOINC), developed by the Berkeley team in this proposed collaboration.  BOINC has proven to scale to millions of resources, and has a number of mechanisms for dealing with resource heterogeneity.  However, the types of applications and services that can run over volunteer resources has been largely limited to trivially parallel ones.  This has been mainly due to the extreme unreliability of resources (as they are shared and controlled with the owner) and the lack of network proximity information (as one cannot determine whether a node is close or far away in terms of network distance).  We aim to broaden the range of application and services that can leverage volunteer computing in this collaboration, and we detail our approach with respect to the cloud computing vision in the next section.

Goals of collaboration

The goal of this proposal is to develop a cloud computing middleware that enables the execution of complex services and applications over unreliable volunteered resources over the Internet.  In terms of reliability, these resources are often unavailable 40% of the time, and exhibit frequent churn (several times a day).   In terms of "real, complex services and applications", we refer to large-scale service deployments, such as Amazon's EC2, the TeraGrid, and the EGEE, and also applications with complex dependencies among tasks.  These commercial and scientific services and applications need guaranteed availability levels of 99.999% and knowledge of network proximity in order to have efficient and timely execution.  

The main motivation of using volunteered resources is that they can be orders of magnitude cheaper and more powerful than dedicated cloud environments.  Computing clouds such as Amazon's EC2 have a pay-per-use billing model where charges are incurred per CPU hour or per GB of data transferred.  Given that FOLDING@home uses 3.9 PetaFLOPS, this would cost the project about 6.7 million USD per year if it could run over Amazon's EC2 (assuming a rate of $0.10 per small-instance hour).  Even for an "average" volunteer computing project that uses 20 TeraFLOPS, this equates to about 340,000 USD per year.  By contrast, we propose to offer a cloud computing service over volunteered resources where costs for the application are almost zero.  For example, we would offer our cloud computing platform as a computational co-op where the relatively low monetary costs to maintain the infrastructure would not increase with resource usage or the number of CPU's or volunteers.

With this goal, our collaboration will focus on the following three research thrusts:

  1. To create a virtually dedicated platform built from unreliable resources
  2. To incorporate the notion of network distance among resources
  3. To enable advanced resource selection so that applications can easily leverage these mechanisms for guaranteed reliability and network proximity

The collaboration will drawn upon the complementary research interests and experience of the two teams.  The Berkeley team will provide their expertise in developing the scalable and fault-tolerant middleware BOINC for use over volunteered resources.  The MESCAL team will provide their expertise in availability prediction, simulation of distributed systems, and development of resource management systems.  In particular, we will leverage the work in the MESCAL team used to develop the OAR resource management system over unreliable resources.   OAR is an industrial-strength resource management system for the main nation-wide French Grid called Grid'5000.  It is used daily to execute thousands of jobs on over 5000 nodes in Grid'5000.  Development of the OAR resource management system is led by Olivier Richard, a member of this proposed collaboration.  In essence, we aim to combine and augment the resource management mechanisms OAR with novel mechanisms for fault-tolerance in BOINC, allowing complex applications and services to leverage the power of volunteer computing.

Approach

    Research thrust #1: Virtually dedicated platform

    Volunteer resources are arguably the most volatile computational resources world-wide.  They exhibit low availability (on average 60%), and high churn (on the order of hours).  Nevertheless, our vision is to provide an illusion of a fully dedicated platform where resources appear to be 100% available.  This in turn would help enable the deployment of complex applications over volatile resources.  Our approach is use statistical and predictive methods combined with virtual machine technology to ensure the availability of a collection of resources.  

    Our preliminary results on prediction of collective availability described in [1] are promising.  In that joint study, we gathered real-world availability data from over 48,000 Internet hosts participating in the SETI@home project.  With this trace data, we showed how to reliably and efficiently predict that a collection of N hosts will be available for T time.  The results indicate that by using prediction combined with replication it is feasible to deploy enterprise services or applications over volatile resource pools with little overheads.

    We propose to combine this predictive method with virtual machine technology to provide a virtually-dedicated platform to applications and users.  Specifically, we envision the following architecture:

Image of architecture

    In step 1, we use our prediction model to determine which set of machines is available during some prediction interval T.  Out of those machines, we choose N+X machines where N is the number of machines required by the application, and X is the number of extra machines used to replicate the service or application.  The application or service is then deployed over this initial group over T time.  The idea is that if a certain number of predictions are wrong and several machines fail (up to X in number), the application or service can continue to execute over the remaining N machines.  In step 2, after the period T expires, we rerun the predictor to determine a new set of available machines and replace any failed machines from the selected pool with these machines.

    One major issue is how to ensure uninterrupted execution of an application or service despite machine failures during the process described above.  We propose to use virtual machines as a means of conducting lightweight process migration across resources.  Clearly, in low-bandwidth Internet environments, one cannot afford to migrate entire virtual machines gigabytes in size to checkpoint servers.

    Instead our approach is to only migrate a portion of the virtual machine (VM) that has changed locally on a volunteer resources.  Specifically, virtual machines deployed over volunteer machines are often stored locally as a read-only raw image.  VM instances are created from that raw image, and modifications to that image cause a copy-on-write (COW).  An overlay image stores modifications to the raw image as the application or service executes within the VM instance.  These overlay images are often relatively much smaller than the whole raw images.  As such, we propose to have overlay images stored periodically at a remote site.  The remote site could be a centralized and dedicated data server, or a fully P2P network such as BitTorrent.

    Research thrust #2: Network distance

    Clearly, in a distributed Internet environment, distributed and parallel applications need not only high resource availability, but also the notion of network distance.  Fortunately, a plethora of related work has focused on methods for determining network distance in distributed systems.  As such, we will apply the most applicable results of this research rather than invent our own.  To the best of our knowledge, there is no volunteer computing system that currently provides knowledge of network proximity, and the details of network location in commercial cloud architectures are cloudy.   Thus, our contribution will be the design and implementation of a network distance service in the context of a new cloud computing middleware, and to show its effectiveness.

    In particular we will build upon the work in [3].  In this work, the authors describe a method where resources divide and place themselves in bins;  nodes are placed in bins depending on how close they are in terms of network latency.  This is determined by pinging a small set of well-known landmarks.  

    There are three main advantages of this approach.  First, it is simple in that no special measurement infrastructure is needed nor is intra-node communication required. (Note that intra-node communication would be a major issue as BOINC does not have a P2P architecture but instead relies on a distributed set of project servers.)  Nodes simply need to know their distance from a relatively few number (~15) of well-known landmarks, which can be discovered with a simple ping.  Nodes do not need to know the distance among landmarks nor distance of other nodes from these landmarks.  Second, it is scalable in that nodes do not need any global information to determine network proximity.  Also, the method has been shown to work with over 1 million nodes using less than 15 landmarks.  The BOINC client that runs on each volunteer resource by default is already installed with knowledge of 25 project servers.  As such, these project servers could serve as landmarks required for determining network proximity.  Third, it is accurate as the authors showed that the method will place nearby nodes in the same bin.

    Resource thrust #3: Resource Description Language

     In the previous sections, we described mechanisms for ensuring availability of machines and determining their network proximity.  Certainly, services and applications will have different requirements for resource selection when using these mechanisms.

    In particular, we would like to support resource selection that describe hard constraints of the form:

My jobs needs N nodes for T time with a maximum latency of L.

    We would also like to design this language to support soft constraints of applications of the form:

My job needs {stable, volatile} set of resources that are {loosely-connected, tightly-connected}.  

    In this case, "stable" or "volatile" describes qualitatively levels of availability of resources that the application can tolerate.   "Loosely-connected" and "tightly-connected" refer to the latency among nodes required by the application.  Clearly, this language should continue to support constraints on other types of resources.  For example, "heterogeneous" and "homogeneous" with respect to a node's hardware.

    We will augment the OAR resource management system by integrating information about resource availability and predictions, and also network distance data.  OAR itself has a resource management system based on the relational database engine MySQL.  OAR users can already specify detailed resource requirements for their jobs through a human-readable, structured, and hierarchical language.  For example, one can specify requirements for clusters, switches, nodes, CPU's and even cores.  We will add new capabilities to the resource description language of OAR so that job requirements concerning resource availability and network proximity can easily be expressed.

2. Présentation des partenaires

Les différentes équipes participantes.

BOINC, UC Berkeley

The Berkeley Open Infrastructure for Network Computing (BOINC) is a research project team that develops middleware for volunteer computing. The BOINC middleware is used by SETI@home, Climateprediction.net, Einstein@home, and a number of other scientific computing projects.  Since 2000, over 100 scientific publications (appearing in prestigious journals such as Nature and Science) have resulted from volunteer computing projects. Since the project's initial creation in 2002, BOINC has become one of the most powerful, and most influential distributed computing systems in the world, harnessing about 1.1 PetaFLOPS of computing power from over a million Internet hosts.

MESCAL, INRIA

MESCAL (Middleware Efficiently SCALable) is a joint team with members from CNRS, INPG, INRIA and UJF, all of whom are in the LIG laboratory.

The goal of MESCAL is to design software solutions for the efficient exploitation of large distributed architectures at metropolitan, national and international scales. The main applications are intensive scientific computations.

The methodology is based on:

Of particular relevance is the work of Olivier Richard on the OAR project, and Arnaud Legrand on the SimGrid project, both of whom are in this proposed collaboration.  Olivier Richard leads the OAR project, which is a industrial-strength resource management system.   OAR has been used to execute over 5 million jobs across France's main Grid called Grid'5000 with over 5,000 nodes.  OAR has advance features such as support for hierarchical resource requests, Gantt-chart scheduling, and batch and interactive jobs.  We will use OAR as the basis of our resource management system.

Arnaud Legrand leads the SimGrid project.  SimGrid is a toolkit that provides core functionalities for the simulation of distributed applications in heterogeneous distributed environments. The specific goal of the project is to facilitate research in the area of distributed and parallel application scheduling on distributed computing platforms ranging from simple network of workstations to Computational Grids.  SimGrid has proven capable of simulating volunteer computing environments, and it will serve as the main tool for the evaluation of concepts described in this proposal.

La liste des chercheurs impliqués dans la proposition ainsi qu'un bref CV du responsable.

David P. Anderson (Research Scientist, UC Berkeley)is a pioneer and global leader in volunteer and distributed computing. D. Anderson received his PhD in Computer Science from the University of Wisconsin - Madison in 1985. Currently, he is director of BOINC (Berkeley Open Infrastructure for Network Computing), a research project that develops middleware for volunteer computing.  He also directs the SETI@home project, which is a research project that uses Internet-connected computers to analyze radio-telescope data in the search for extraterrestrial intelligence (SETI). Since its public launch in May 1999, over 3,000,000 people have contributed 2 million years of computer time, making SETI@home the largest computation ever performed.

In the past, he served on the faculty in the Computer Science Department at UC Berkeley, where he led research in operating systems, distributed systems and networks, and multimedia systems. He has authored 69 papers in Computer Science, with 17 papers in refereed journals (including Communications of the ACM, ACM Transactions on Graphics, ACM Transactions on Computer Systems, IEEE Computer, and IEEE Transactions on Software Engineering), 28 conference papers, and 13 unpublished technical reports. He has held a number of leadership positions in industry, including Chief Technology Officer of United Devices, an enterprise desktop grid company.  For details, please see David Anderson's CV.

Derrick Kondo (CR, INRIA MESCAL) received his PhD in Computer Science from the University of California at San Diego, and his BS from Stanford University.  His interests lie in the area of volunteer computing and desktop grids.  In particular, he leads research on the measurement and characterization of Internet distributed systems, their simulation and modelling, and resource management.   He founded and continues to serve as co-chair of the Workshop on Volunteer Computing and Desktop Grids, and also co-chaired the last BOINC workshop on volunteer computing and distributed thinking.  Currently, he is serving as guest co-editor of a special issue of the Journal of Grid Computing on volunteer computing and desktop grids.  His recent paper on characterizing computation errors in Internet-distributed systems won the best paper award at the 13th European Conference on Parallel and Distributed Computing.  For details, please see Derrick Kondo's CV.

Rom Walton (Senior Staff, UC Berkeley)

Olivier Richard (Prof. UJF, INRIA MESCAL)

Arnaud Legrand (CR, CNRS, INRIA MESCAL) 

Les étudiants impliqués dans la proposition.

L'historique de la collaboration entre les équipes.

    Joint publications:

    Joint workshop:

    4th BOINC Workshop on Volunteer Computing and Distributed Thinking
    Organized jointly by the BOINC and MESCAL teams
    Hosted by INRIA MESCAL project team at INRIA Rhône-Alpes, 
    Sponsored by INRIA, IBM through UC Berkeley   
    Grenoble, France
    10-12 September, 2008

    Statistics: 65 attendees (17 young researchers/students from INRIA [MOAIS, GRAAL, MESCAL, ALGORILLE], CNRS, LIP6, or other French organizations), 13 countries, 22 affiliations.

    Short visits:

Visitor (team) Host (team) Date Purpose
David Anderson, Rom Walton (BOINC) Derrick Kondo (MESCAL) 09/2008 Discussed use of prediction in BOINC
David Anderson (BOINC) Derrick Kondo (MESCAL) 04/2008 Discussed characterization of availability traces
Derrick Kondo (MESCAL) David Anderson (BOINC) 12/2007 Discussed scheduling in BOINC
Derrick Kondo (MESCAL) David Anderson (BOINC) 12/2006 Discussed simulation of scheduling heuristics in BOINC
Derrick Kondo (MESCAL) David Anderson (BOINC) 05/2006 Discussed trace collection and scheduling in BOINC. 
David Anderson (BOINC) Derrick Kondo (MESCAL) 11/2004 Discussed design and implementation of BOINC
Derrick Kondo (MESCAL) David Anderson (BOINC) 04/2002 Discussed statistics collected in SETI@home

    Joint and Related Professional Activities

David Anderson et al. and Derrick Kondo et al. each separately started two volunteer computing workshops (the Pan-Galactic BOINC workshop and the PCGrid workshop respectively). D. Anderson has served on the program committee of PCGrid for the past two years since the workshop's creation, and he has also been an invited speaker. D. Kondo presented his most recent work at the last two BOINC workshops.

Derrick Kondo hopes to extend the collaboration between UC Berkeley and INRIA by forming an Associate Team.  Arnaud Legrand is currently co-principle investigator on a "Young Researchers" ANR-funded project on volunteer computing titled "The Design and Optimization of Collaborative Computing Architectures" (DOCCA). The main goal is to combine theoretical tools and metrics from the parallel computing and network communities to develop algorithmic and analytical solutions to specific resource management problems of volunteer computing systems. For example, one issue is the incentives needed to ensure both fair and optimal usage of resources.  Potentially the algorithms in that project could be applied and integrated with the system detailed in this proposal.

Thus the expertise and research interests of the MESCAL team complements those of the UC Berkeley team that designed and developed BOINC.

Les pages des personnes, laboratoires, organismes.

3. Impact

In terms of scientific objectives of the two teams, the formation of this Associate Team would allow the collaboration between INRIA MESCAL and UC Berkeley to grow team-wide, facilitating a bi-directional transfer of science and technology.  The strength of the BOINC system is its mechanisms for fault-tolerance and error detection as it was designed to run over the world's most volatile resources.  BOINC was designed to handle failures or errors anywhere in the software (application, operating system) or hardware stack (CPU's, memory, disk).  As such, it exemplifies a software system built to handle all types failures.  The strength of OAR is its advance mechanisms for resource management (for example, hierarchical resource requests, Gantt-chart scheduling, and batch and interactive jobs), and its robust and scalable design.  For the MESCAL team, the collaboration would orient the design and implementation of OAR to consider the extensive fault-tolerant techniques used in BOINC.  For the BOINC team, the collaboration would allow for the improvement of resource management in BOINC so that users can more easily submit jobs and express their requirements.  Moreover, with joint efforts focused on building a cloud computing platform from BOINC and OAR components with integration of new network proximity, VM and prediction technologies, the collaboration could result in a new middleware with the "best of both worlds".

At a high level, the real-world issues affecting BOINC (one of the world's largest, most volatile distributed systems) would be used to steer applicable research in the INRIA MESCAL team and conversely, the novel scheduling and predictive algorithms, and advanced resource management systems (such as OAR) designed by INRIA MESCAL could influence and be implemented and integrated in a planetary-scale distributed system.

In terms of impact for other project teams at INRIA and UC Berkeley, the resulting platform could complement dedicated computational platforms such as Grid'5000 for computer science and other domain science experiments.  Through our experiences with Grid'5000 users and interaction with scientists at Berkeley, we believe a large class of applications would be able to use the compute power provided by a CloudComputing@home platform if there were reliability and network proximity guarantees.  Moreover, with an expressive and simple resource specification language, we believe this would make computing over volunteer resources much more accessible to application scientists.

II. PREVISIONS 2009

Programme de travail

Introduction

The goal of this proposal is to develop a cloud computing middleware that enables the execution of real, complex services and applications over unreliable resources.  We target computational resources across the Internet that are volunteered.  In terms of reliability, these resources are often unavailable 40% of the time, and exhibit frequent churn (several times a day).   In terms of "real, complex services and applications", we refer to large-scale service deployments, such as Amazon's EC2, the TeraGrid, and the EGEE, and also applications with complex dependencies among tasks.  These commercial and scientific services and applications need guaranteed availability levels of 99.999% and knowledge of network proximity in order to have efficient and timely execution.

With this goal, our collaboration will focus on the following three research thrusts:

  1. To create a virtually dedicated platform over unreliable resources
  2. To incorporate the notion of network distance among resources
  3. To enable advanced resource selection mechanisms that leverage these virtually dedicated resources and knowledge of their network proximity

To achieve these goals, we would like to leverage research and development efforts in both teams.  In particular, we would like to integrate the systems of BOINC and OAR shown in the figure below, which depicts the high-level system architecture.

System architecture

In the sections below, we detail each research thrust and new component (highlighted in yellow shown the figure above), describing the challenges, our current progress in addressing them, and the proposed collaborative work. 

Research Thrust #1: Virtually-dedicated platform

Our approach to provide the illusion of a dedicated platform is as follows.  First, we will use statistical and prediction methods that can provide guarantees of availability for groups of resources.  Second, we will create a novel architecture based on virtual machines as a way of masking changes in the (predicted) state of machines.

    Statistical and predictive methods
   
    We would like to achieve collective availability of resources.  That is, we would like to achieve with high probability that in a group of N+X hosts, at least N of the same hosts remain available over T time;  X is the number of hosts use for redundancy purposes.    As described in the "Scientific objectives" section, we envision an architecture where predictions are reevaluated after each time interval T for long running applications.  An overview of the predictive process is shown in the figure below.

Prediction process

    In step 1, we collect measurements of availability.  In step 2, we separate hosts according to their predictability.  Clearly, some hosts have predictable and repetitive behavior whereas other have random and entropic behavior.  We would like to identify these predictable hosts a priori and exclude unpredictable hosts.  Note that predictability is different from availability.  For example, a host that is available for 3 hours every Monday is predictable (and useful) but not highly available.

    In step 3, we would then like to predict which hosts will be available for some time period T.  Then, in step 4, we would like to select N + X hosts needed by the application or service, where X hosts are used for redundancy.  Finally, in step 5, if the application needs the resources for a period longer than T, we re-evaluate the prediction and add or remove members from the group accordingly.

    In [1], a collaboration between UC Berkeley and INRIA MESCAL, we determined efficient and domain-adjusted methods for predicting CPU availability.  We tested the above procedure with novel prediction methods using real traces of CPU availability.  In particular, we collaborated with the BOINC team to retrieve CPU availability traces from over 48,000 Internet hosts in BOINC and SETI@home.  This data set (unsurpassed in breadth and accuracy) was essential for validating our prediction hypotheses.  Our initial results showed that we can predict accurately and efficiently that N resources will have CPU's available for T time.

        Collaborative prediction tasks for 2009

        While we have initial results obtained collaboratively, several open issues remain.  First, in our initial study, we obtained traces for only CPU availability.  We would like to also determine the time dynamics of available disk, network, and memory in Internet hosts.  Here, the knowledge of BOINC and SETI@home of the UCB team will be critical for obtaining accurate traces.  Furthermore, we would like to determine how to predict the availability accurately and efficiently of collections of hardware resources.  The experience of the MESCAL team in characterization, modelling, and prediction will be applied to determine accurate models and predictive methods.  Second, we would like to consider confidence intervals for our predictions.  Clearly, given two hosts predicted to be available, one might have a higher chance than the other.  Third, we would like to investigate different criteria for the predictability of availability, such as the number of recent state changes or weekly average availability.
   
    Virtual machines

    The set of hosts predicted to be available from one time period T to the next (as in Step 5 in the previous section) may change.  In this case, the middleware must ensure the transparent migration of the executing application or service from one resource to another.  We will use an approach based on virtual machines (VM).  The idea is that every BOINC client running on a volunteered host stores a raw image of the VM with a static OS, libraries, and packages.  As the application or service executes on the host, it writes to an overlay image (using copy-on-write) that stores modifications of the raw image.  At no time, does the raw image get modified.  Periodically, the overlay image is transferred to a remote location, such as a centralized checkpoint server or a P2P network.   If a machine fails or is predicted to fail, another replacement host downloads this overlay image from the remote location and restarts the computation from that point onward.

        Collaborative VM tasks for 2009

        Three main tasks exist.  First, one must determine the best method to store the overlay image remotely.  We will compare the cost and benefits of the following methods.  One could simply store the overlay images on a centralized server, but this may be too bandwidth intensive and may not scale.  Another option is that one could maintain a clique network among a subset of nodes and store the overlay images on nodes in this clique.  A final option is that one could store the overlay images in a DHT over a fully distributed P2P network.

        Second, one must determine the frequency by which the overlay images are stored remotely.   With an active method, one may proactively and periodically transfer the overlay images to remote machines.  We may want to use a fixed frequency or also perhaps vary the frequency depending on the unavailability or churn of the machine.  With a passive method, one may transfer overlay images only when the machine becomes unavailable.  We have found, as a result of our monitoring of SETI@home resources, that the majority of resource unavailability is due to soft failures (for example, a user reclaiming his/her machine) versus hard failures (for example, hard machine crashes).  So the BOINC client could be just wait until a machine becomes unavailable for computation, and then start the transfer of state to a remote location.

        The third main challenge is how VM's should be integrated with the BOINC client.  Currently, there are many issues, for example, whether to have the VM on the "outside" and run the BOINC client as a sub-process "inside", or vice versa.  Both options have their own advantages and disadvantages which we will investigate.  For example, some VM's such as Xen must run as root to access specific system management functions.
 
        Our initial analysis with the QEMU virtual machine shows that this approach is feasible for several reasons.  First, volunteer computing applications tend to have small memory footprints and checkpoints are often ten's in MB in size.  Thus, the amount of data to checkpoint remotely is relatively small.  The raw image of the VM (~500MB in size) can be replicated and downloaded scalably via P2P protocols such as BitTorrent.  In our initial experiments with QEMU, the size of the overlay image is 10's of MB in size, and thus could be reasonably transferred to a remote location.

         In terms of collaboration, the BOINC developers are currently incorporating sandboxing techniques in the BOINC client.  Using VM's is one approach.  At the same time, a number of Grid'5000 scientists and engineers are currently integrating VM's with the OAR resource management system installed on the Grid'5000 platform.  

         To determine the best option (centralized or decentralized) method for storing remote checkpoints, we will use simulation to evaluate the costs and benefits of different strategies.  Arnaud Legrand is leading the SimGrid project, which provides a toolkit for simulation large platforms.  We aim to use SimGrid along with its proven flow-based models of networks to determine the best checkpointing strategies.

Research Thrust #2: Network distance

The creation of topology-aware distributed (peer-2-peer) systems have been the focus of much related research.  We will leverage this work and incorporate this into the design and implementation of our cloud computing middleware.  In particular, we will apply the work of Shenker et al. described in [3].  As discussed in the "Scientific Objectives" section, this method is attractive because it has been proven to be simple, scalable, and accurate in determining network distance among Internet distributed resources.  At a high level, this method makes each node ping a small set of landmarks to determine which bin it belongs to.  Nodes in the same bin are within some network distance of each other.

        Collaborative tasks on network distance beyond 2009

        In terms of applying of network proximity, we must resolve the following issues.  First, we must determine what set of landmarks to use.  The best landmarks must be separated from each other by a certain number of hops and accessible by all resources.  One obvious method is to use BOINC project servers as landmarks, as these servers are already distributed by default to the BOINC client that runs on a volunteered host.   Another option is to use public DNS servers as landmarks but this requires accessibility among all volunteer hosts, which may not always be allowed.

        Second, we must determine if use of these landmarks is scalable.  The question is whether a BOINC server, which exists for work distribution among clients, can withstand periodic pings from the set of hosts registered with it.  We will conduct an analysis of this using availability traces and SimGrid simulation to determine this feasibility.  If a single project server cannot withstand this load, we will investigate use of nodes close to it that can be used instead to distribute load.

        Third, if we use BOINC project servers as landmarks, we must determine the frequency of pinging required to obtain accurate network distances.  The tests described in [3] use a frequency of once per hour.  Perhaps one could adapt the rate of pinging based on perceived network load or the type of machine (for example, the network distance of portable or mobile devices probably changes more frequently and thus they need to ping more often than stationary desktop machines).

        In terms of collaboration, clearly both teams must work together to coordinate the network location of the BOINC clients, and the propagation of this information to OAR, which will be responsible for resource management.

Research Thrust #3: Resource Description Language

With these new capabilities described in the previous two sections, we would like to provide a simple resource description language by which a user can specify the requirements of an application. With this language, one should be able to easily describe the levels of resource availability required and thresholds for network distance among nodes (in addition to other resource constraints).  In OAR, one can already specify requirements for clusters, switches, nodes, CPU's and even cores using a interface similar to PBS.   We would like to augment this interface so that one can give both hard and software constraints for a collection of resources, as described in the Scientific Objectives" section.

        Collaborative tasks on description language beyond 2009

        One main task is to define the elements and structure of the resource specification language.  Fortunately, OAR already has advance mechanisms for resource selection, and so the MESCAL team will work with the BOINC team to augment OAR's capabilities to include resource availability and network proximity.  As a proof-of-concept, we have already implemented and tested an interface between BOINC and OAR so that OAR jobs can run over a BOINC platform.

        Another task is to ensure that real applications are able to use this resource specification language.  We will test the effectiveness of our language and  infrastructure on a real molecular dynamics application that conducts periodic barrier synchronization.  This application breaks the computation into multiple rounds.  In each round, several hundred tasks on the order of minutes must be distributed to resources.  A round is completed only after the result of each task is determined and assimilated; this is the barrier synchronization stage.  Based on these results, the next round of tasks is distributed.  Clearly availability guarantees and low network latency are critical for the effective execution of this barrier synchronization computation.

Conclusion

We propose to develop a new cloud computing middleware based on the BOINC software for volunteer computing and the OAR resource management system for clusters.  The goal is to be able to run complex applications and services across unreliable volunteer resources.  The approach is to use statistical and prediction methods to guarantee the availability of collections of resources, to use virtual machines to hide machine unavailability, and to use node-binning techniques to provide network proximity information.   In addition, we will provide a resource management system that will allow applications or services to easily leverage these new capabilities.  Ultimately, we will test the utility of this system for a real barrier synchronization application.

Bibliography

  1. On Correlated Availability in Internet Distributed Systems, D. Kondo, A. Andrzejak, D. P. Anderson, 9th IEEE/ACM International Conference on Grid Computing (Grid 2008), Tsukuba, Japan, September 2008.
  2. Elastic Vapor :: Life in the clouds
  3. Topologically-aware overlay construction and server selection, S. Ratnasamy, M. Handley, R. Karp, S. Shenker", IEEE INFOCOM '02, New York, NY, June 2002.

 Programme d'échanges avec budget prévisionnel

1. Echanges

The goal of each of the following meetings (ordered chronologically) will be to coordinate the joint scientific and professional activities of the two teams.  For each participant, there will be one short visit to the other team's site, grouping visits at the same time when possible.

   In addition, we plan to organize two 3-day workshops, one in the 18th month on cloud and volunteer computing at UC Berkeley, and the other in the 36th month at INRIA Rhône-Alpes. The workshops, which will include tutorials, aim to stimulate new developments and activities related to cloud and volunteer computing, by allowing users to share their experience and requirements, giving developers the opportunity to outline their plans, and stimulating new collaborations between participants.  We expect about 65 international participants.  Note that these workshops would follow last year's successful workshop, organized jointly by the MESCAL and BOINC teams, and hosted at INRIA Rhône-Alpes.  We estimate the cost of organizing the workshop to be 3,000 euros each year (based on previous years' costs).

 1. ESTIMATION DES DÉPENSES EN MISSIONS INRIA VERS LE PARTENAIRE
Nombre de personnes
Coût estimé
Chercheurs confirmés  3  9,999
Post-doctorants
 1  3,333
Doctorants  2  6,666

Stagiaires

   
Autre (précisez) : 2 workshops
   5,000
   Total
 6  25,000


 2. ESTIMATION DES DÉPENSES EN INVITATIONS DES PARTENAIRES
Nombre de personnes
Coût estimé
Chercheurs confirmés  2  6,666
Post-doctorants
   
Doctorants    

Stagiaires

 
Autre (précisez) :
   
   Total
 2  6,666

2. Cofinancement

In the past (including for last year), IBM through UC Berkeley has given 2000 euros towards the organization of a workshop.  For the two workshops, IBM should be able to cover 4,000 euros in total.  Last year, the conference organization committee of INRIA Rhône-Alpes contributed 1,000 euros towards the workshop organization, and we believe they would also be able to contribute 1,000 euros for the workshop held in France.

 ESTIMATION PROSPECTIVE DES CO-FINANCEMENTS
Organisme
Montant
UC Berkeley 6,666
IBM through UC Berkeley 4,000
INRIA Rhône-Alpes 1,000
   Total
11,666

We will actively pursue other opportunities for co-financing this collaboration in order to strengthen ties between UC Berkeley and INRIA.  We will apply for a grant offered by the France-Berkeley Fund for supporting collaborative research between the University of California and France.

3. Demande budgétaire

Commentaires
Montant
A. Coût global de la proposition (total des tableaux 1 et 2 : invitations, missions, ...)  31,666
B. Cofinancements utilisés (financements autres que Equipe Associée)  11,666
Financement "Équipe Associée" demandé (A.-B.)
(maximum 20 K€)
 20,000

 

 

© INRIA