# Toward Better Simulation of MPI Applications on Ethernet/TCP Networks

## Sitemap

Paul Bédaride, Augustin Degomme, Stéphane Genaud, Arnaud Legrand, George S. Markomanolis, Martin Quinson, Mark Stillwell, Frédéric Suter, Brice Videau

Simulation and modeling for performance prediction and profiling is essential for developing and maintaining HPC code that is expected to scale for next-generation exascale systems, and correctly modeling network behavior is essential for creating realistic simulations. In this article we describe an implementation of a flow-based hybrid network model that accounts for factors such as network topology and contention, which are commonly ignored by other approaches. We focus on large-scale, Ethernet-connected systems, as these currently compose 37.8\% of the TOP500 index, and this share is expected to increase as higher-speed 10 and 100GbE become more available. The European Mont-Blanc project to study exascale computing by developing prototype systems with low-power embedded devices will also use Ethernet-based interconnect. Our model is implemented within SMPI, an open-source MPI implementation that connects real applications to the SimGrid simulation framework. SMPI provides implementations of collective communications based on current versions of both OpenMPI and MPICH. SMPI and SimGrid also provide methods for easing the simulation of large-scale systems, including shadow execution, memory folding, and support for both online and offline (\ie post-mortem) simulation. We validate our proposed model by comparing traces produced by SMPI with those from real world experiments, as well as with those obtained using other established network models. Our study shows that SMPI has a consistently better predictive power than classical LogP-based models for a wide range of scenarios including both established HPC benchmarks and real applications.

## 1 Introduction

In the High Performance Computing (HPC) field, accurately predicting the execution time of parallel applications is of utmost importance to assess their scalability, and this is particularly true for applications slated for deployment on next-generation exascale systems. While much effort has been put towards understanding the high-level behavior of these applications based on abstract communication primitives, real-world implementations often provide a number of confounding factors that may break basic assumptions and undermine the applicability of these higher level models. For example, implementations of the MPI standard can select different protocols and transmission mechanisms (e.g., eager vs. rendez-vous) depending on message size and network capabilities. Simple delay-based models also do not account for network realities such as network topology and contention. In this work we demonstrate how even relatively minor deviations in low-level implementation can adversely affect the ability of simulations to predict real-world performance, and propose a new network model that extends previous approaches to better account for topology and contention in high-speed TCP networks. We focus on large-scale, Ethernet-connected systems, which currently compose 37.8\% of the TOP500 index\cite{top500}. This share is only expected to increase as higher-speed 10 and 100GbE become more available. The European Mont-Blanc project to study exascale computing by developing prototype systems with low-power embedded devices will also use Ethernet-based interconnect\cite{montblanc}.

Packet-level simulation has long been considered the "gold standard" for modeling network communication, and is available for use in a number of simulation frameworks\cite{mpi-netsim_09,bigsim_04}. However, there are a number of reasons to consider alternatives when simulating parallel applications. First, such applications are likely to generate large amounts of network traffic, and packet-level simulation has high overheads, resulting in simulations that may take significantly longer to run than the corresponding physical experiments. Second, implementing packet-level simulations that accurately model real world behavior requires correctly accounting for a vast array of factors\cite{lucio2003opnet}. In practice, there is little difference between an inaccurate model and an accurate model of the wrong system.

When packet-level simulation becomes too costly or intractable, the most common approach is to resort to simpler delay models that ignore the network complexity. Among these models, the most famous are those of the LogP family\cite{LogP,LogGP,pLogP,LogGPS}. The LogP model was originally proposed by Culler et al.\cite{LogP} as a realistic model of parallel machines for algorithm design. It was claimed as more realistic than more abstract models such as PRAM or BSP. This model captures key characteristics of real communication platforms while remaining amenable to complexity analysis. Unfortunately, while this model may reflect the behavior of specialized HPC networks from the early 1990s, it ignores potentially confounding factors present in modern-day systems.

Flow-based models are a reasonable alternative to both simple analytic models and expensive and difficult-to-instantiate packet-level simulation. In a flow based model, network traffic is treated as a steady state fluid flow through interconnected pipes of varying lengths and sizes (representing delay and bandwidth). Flow-based models are computationally tractable while being able to account for factors such as network heterogeneity. We seek to capture the advantages of a flow-based approach by extending existing validated models of point-to-point communication to better account for network topology and message contention. Recent work suggests that well-tuned flow-based simulation may be able to provide reasonably accurate results for less effort than packet-based simulation, and at much lower cost\cite{Velho_TOMACS13}.

Our model is implemented within SMPI \cite{SMPI}, an open-source MPI implementation that connects real-world applications to the SimGrid\cite{simgrid} simulation framework. With SMPI, standard MPI applications written in C or FORTRAN can be compiled and run in a simulated network environment, and traces documenting computation and communication events can be captured without incurring errors from tracing overheads or improper synchronization of clocks as in physical experiments. SMPI has recently been extended so that the low-level implementations of MPI collective operations more accurately reflect current production versions of both OpenMPI and MPICH. SMPI and SimGrid also provide a number of useful features for simulating applications that may require large amounts of time or system resources to run, including shadow execution, memory folding, and off-line simulation by replay of execution traces\cite{PMBS}. We validate our results by comparing application traces produced by SMPI using our network model with those from real world environments. We also compare with traces obtained using models from the literature.

The specific contributions described in this paper are as follows:

• we propose a new flow-based network model that extends previous LogP-based approaches to better account for network topology and message contention;
• we describe SMPI, the simulation platform, and some useful extensions that we have developed to make this tool more useful to developers and researchers alike (\eg SMPI now implements all the collective algorithms and selection logics of both OpenMPI and MPICH for more faithful comparisons);
• we provide a number of experimental results demonstrating how these extensions improve the ability of SMPI to accurately model the behavior of real-world applications on existing platforms, and also show that competing frameworks using previous models from the literature are unlikely to obtain consistently good results;
• we demonstrate the effectiveness of our proposal by making a thorough study of the validity of our models against hierarchical clusters using TCP over Gigabit Ethernet.

This paper is organized as follows: In Section\ref{sec:models} we discuss essential background information in the area of network modeling. In Section\ref{sec:framework} we describe SMPI, our chosen implementation framework, along with some key features that make it suitable for simulation of large scale systems, and comparisons to competing simulation platforms. In Section\ref{sec:hybrid} we describe our proposed hybrid network model, focusing particularly on how this model captures the complexities of network topology and message contention. In Section\ref{sec:validation} we describe experiments conducted to [in]validate the model, including comparisons of traces from real-world environments as well as traces produced using other network models. In Section\ref{sec:bigdft} we demonstrate the capacity of SMPI to simulate complex MPI applications on a single machine. In Section\ref{sec:related} we discuss related work, including other frameworks for simulating parallel applications and their relevant drawbacks. We conclude in Section\ref{sec:ccl} and also provide a description of proposed areas of future work.

## 2 Network Modeling Background

\label{sec:models}

In the LogP model, a parallel machine is abstracted with four parameters: $$L$$ is an upper bound on the latency of the network, i.e., the maximum