Keynote abstract

Virtual Cluster based Parallel Computing on Volunteer Nodes.

Prof. Jaspal Subhlok
Professor of Computer Science,
University of Houston.

Ordinary desktops and PCs have been employed successfully for large scale scientific computing, commonly employing the CONDOR scheduler on LANs and BOINC middleware for volunteered public PCs. However this approach is currently largely deployed for sequential bag of parallel applications only. Clusters remain the only viable option for communicating parallel programs. While low latency dedicated clusters are essential for many applications, s PC hardware and LANs are sufficient for many other parallel applications. This talk discusses VolPEx (Parallel Execution on Volunteer Nodes) project, which has the goal of enabling robust execution of parallel applications on volunteer PCs. The fundamental concept employed is autonomous redundant processes: a logical process may have multiple distributed instances that are not aware of each other, process instances can be replicated, checkpointed or re-created independently, and individual process instances can fail without application failure. The overall program progress is solely dictated by the state of the fastest active instance of each process. The talk will focus on two programming frameworks that have been successfully developed and validated: a) Dataspace API: An abstract dataspace that supports asynchronous anonymous Put/Get operations for inter-task communication and allows redundancy for fault tolerance. Codes developed in this model include Replica Exchange Molecular Dynamics and a Map-Reduce framework. And b) Volpex MPI: A highly failure resistant MPI implementation based on sender based logging and receiver initiated communication. Codes evaluated include the NAS Benchmarks.