SkaMPI measurements on griffon G5K

Table of Contents

Sitemap

---> misc
| ---> 2016
| ---> 2015
| ---> 2014
| ---> 2013
| ---> 2012
`--> Agenda

It's now time for me to learn how perform my own MPI measurements on G5K (yeah, it's been a few years I haven't actually run a real MPI code). My goal is to play a little bit with SkaMPI and check whether I can statistically trust the measurements of Mark and Stéphane as well as possibly improve them.

Commandes pour accéder à griffon

http://www.grid5000.fr/mediawiki/images/G5k_cheat_sheet.pdf Ajout de export OMPI_MCA_plm_rsh_agent=oarsh dans le .bashrc (http://www.grid5000.fr/mediawiki/index.php/Run_MPI_On_Grid'5000)

ssh nancy.grid5000 # curieusement griffon nancy.g5K ne marche pas, voir avec Mt
oarsub -I -l nodes=1,walltime=2 -p "cluster='griffon'"

Fichier ski utilisé

# ----------------------------------------------------------------------
# pt2pt measurements

      set_min_repetitions(8)
      set_max_repetitions(16)
      set_max_relative_standard_error(0.03)

      set_skampi_buffer(64mb)
      datatype = MPI_CHAR

# ----------------------------------------------------------------------

      comm_pt2pt = comm2_max_latency_with_root()

      begin measurement "Pingpong_Send_Recv"
      for count = 1 to ... step *sqrt(2) do
      measure comm_pt2pt : Pingpong_Send_Recv(count, datatype, 0, 1)
      od
      end measurement

Informations sur Griffon

https://api.grid5000.fr/2.0/ui/nodes.html

network_adapters_2_interface: InfiniBand
network_adapters_2_rate: 20000000000
network_adapters_0_device: eth0
network_adapters_0_rate: 1000000000
On est donc en Ethernet 1Gbit et en IB 20Gbit.

Lancements de skampi

Stephan's measurements

/home/sgenaud/openmpi.install/bin/mpiexec --mca btl_tcp_if_include eth0 --mca btl_tcp_if_exclude ib0 --mca btl tcp,self -mca orte_base_help_aggregate 0 -mca plm_rsh_ag
ent oarsh -machinefile machinefile -n 2 skampi -i ski_smpi/skampi_pt2pt.ski
count= 8388608  8388608  144965.0      15.2       32  144965.0  143205.7

-> Bw = 8388608/144965.0 = 57.8664367261063 Mbytes/s

Use tcp

       mpirun --mca btl self,tcp -machinefile $OAR_NODEFILE ./skampi -i ski/skampi_pt2pt_alvin.ski -o p2p.log 2>/dev/null ; tail p2p.log 
       count= 8388608  8388608   19509.3      14.2        8   19448.5   19484.8
       count= 11863283 11863283   28949.4     485.2        8   27303.0   28949.4
       count= 16777216 16777216   41638.3      71.9        8   41607.3   41525.5
# end result "Pingpong_Send_Recv"
# duration = 1.71 sec
# Finished at Mon Jan 23 16:50:46 2012
# Total runtime 2 seconds

-> Bw= 8388608/19509.3 = 429.979958276309 Mb/s Interrestingly, this measure is not stable. I sometimes get 435.502624351699, 464.717079386184, … Note that I use at most 16 measurements but this means the speed is not really stable.

Use openib

       mpirun --mca btl self,openib -machinefile $OAR_NODEFILE ./skampi -i ski/skampi_pt2pt_alvin.ski -o p2p.log 2>/dev/null ; tail p2p.log
       count= 8388608  8388608   13706.0     275.7        8   13696.0   13434.9
       count= 11863283 11863283   18585.0     112.0        8   18580.2   18565.8
       count= 16777216 16777216   26465.1       5.3        8   26417.7   26465.1
# end result "Pingpong_Send_Recv"
# duration = 1.30 sec
# Finished at Mon Jan 23 16:54:39 2012
# Total runtime 2 seconds

-> Bw = 8388608/13706 = 612.039106960455. This is quite stable.

Use ipoib

       mpirun --mca btl self,ipoib -machinefile $OAR_NODEFILE ./skampi -i ski/skampi_pt2pt_alvin.ski -o p2p.log 2>/dev/null ; tail p2p.log 
       count= 8388608  8388608   13706.0     275.7        8   13696.0   13434.9
       count= 11863283 11863283   18585.0     112.0        8   18580.2   18565.8
       count= 16777216 16777216   26465.1       5.3        8   26417.7   26465.1
# end result "Pingpong_Send_Recv"
# duration = 1.30 sec
# Finished at Mon Jan 23 16:54:39 2012
# Total runtime 2 seconds

Same behavior as openib here.

Sebastien's arguments

       mpirun --mca btl self,tcp --mca btl_tcp_if_include lo,br0 -machinefile $OAR_NODEFILE ./skampi -i ski/skampi_pt2pt_alvin.ski -o p2p.log 2>/dev/null ; tail p2p.log 
       count= 8388608  8388608   20787.5      13.7        8   20779.5   20711.0
       count= 11863283 11863283   29042.0      26.2        8   28986.8   28333.2
       count= 16777216 16777216   41008.8      17.0        8   40965.3   40983.5
# end result "Pingpong_Send_Recv"
# duration = 1.71 sec
# Finished at Mon Jan 23 16:58:00 2012
# Total runtime 2 seconds

-> Bw = 8388608/20787.5 = 403.540974143115 Even when asking to use br0 (eth0 is an epic failure), it keeps using IB (400Mbytes/s is too large for a Gb ethernet link).

Stéphane's arguments

/home/sgenaud/openmpi.install/bin/mpiexec --mca btl_tcp_if_include eth0 --mca btl_tcp_if_exclude ib0 --mca btl tcp,self -mca orte_base_help_aggregate 0 -mca plm_rsh_agent oarsh -machinefile machinefile -n 2 skampi -i ski_smpi/skampi_pt2pt.ski

Stéphane utilisait son propre openmpi et il a fallu que je remplace eth0 par br0 car ça ne marche pas.

       mpirun --mca btl_tcp_if_include br0 --mca btl_tcp_if_exclude ib0 --mca btl tcp,self -mca orte_base_help_aggregate 0 --mca plm_rsh_agent "oarsh" -machinefile $OAR_NODEFILE ./skampi -i ski/skampi_pt2pt_alvin.ski -o p2p.log 2>/dev/null; tail p2p.log 
       count= 8388608  8388608   20786.2       8.1        8   20754.7   20694.3
       count= 11863283 11863283   29065.3      60.1        8   29065.3   28288.8
       count= 16777216 16777216   41059.5      16.5        8   41038.8   40960.0
# end result "Pingpong_Send_Recv"
# duration = 1.72 sec
# Finished at Mon Jan 23 17:06:39 2012
# Total runtime 2 seconds

Pas mieux donc. Et la même chose en disant juste exclude ib0:

       mpirun --mca btl_tcp_if_exclude ib0 --mca btl tcp,self -mca orte_base_help_aggregate 0 --mca plm_rsh_agent "oarsh" -machinefile $OAR_NODEFILE ./skampi -i ski/skampi_pt2pt_alvin.ski -o p2p.log 2>/dev/null; tail p2p.log 
       count= 8388608  8388608   19631.0      27.1        8   19631.0   19536.7
       count= 11863283 11863283   28554.7     616.4        8   28554.7   27378.3
       count= 16777216 16777216   38772.5      49.2        8   38626.3   38730.5
# end result "Pingpong_Send_Recv"
# duration = 1.66 sec
# Finished at Mon Jan 23 17:08:08 2012
# Total runtime 1 seconds

Toujours pareil. Pas moyen de l'empécher d'utiliser IB et pas moyen d'enlever IB:

/sbin/modprobe -r `lsmod | grep '^ib' | sed 's/ .*//' `
FATAL: Error removing ib_ipoib (/lib/modules/2.6.32-5-amd64/kernel/drivers/infiniband/ulp/ipoib/ib_ipoib.ko): Operation not permitted

Deploying my own image:

       dtach -A /tmp/alegrand-dtach-socket bash
       oarsub -I -l 'nodes=1,walltime=1' -p "cluster='griffon'" -t deploy
       kadeploy3 -e squeeze-x64-nfs -f $OAR_NODE_FILE -k ~/.ssh/id_rsa.pub
       /etc/init.d/openibd stop
       /etc/init.d/mx stop

       mpirun --mca btl self,tcp -machinefile $OAR_NODEFILE ./skampi -i ski/skampi_pt2pt_alvin.ski -o p2p.log 2>/dev/null ; tail p2p.log 
       count= 8388608  8388608   19631.0      27.1        8   19631.0   19536.7
       count= 11863283 11863283   28554.7     616.4        8   28554.7   27378.3
       count= 16777216 16777216   38772.5      49.2        8   38626.3   38730.5
# end result "Pingpong_Send_Recv"
# duration = 1.66 sec

-> Bw = 8388608/19631 = 427.314349752942

       mpirun --mca btl_tcp_if_include br0 --mca btl_tcp_if_exclude ib0 --mca btl tcp,self -mca orte_base_help_aggregate 0 --mca plm_rsh_agent "oarsh" -machinefile $OAR_NODEFILE ./skampi -i ski/skampi_pt2pt_alvin.ski -o p2p.log 2>/dev/null; tail p2p.log 
       count= 8388608  8388608   19631.0      27.1        8   19631.0   19536.7
       count= 11863283 11863283   28554.7     616.4        8   28554.7   27378.3
       count= 16777216 16777216   38772.5      49.2        8   38626.3   38730.5
# end result "Pingpong_Send_Recv"
# duration = 1.66 sec

Pareil!

/sbin/ifconfig 
eth0      Link encap:Ethernet  HWaddr 00:e0:81:b2:c0:46  
inet addr:172.16.65.90  Bcast:172.16.79.255  Mask:255.255.240.0
inet6 addr: fe80::2e0:81ff:feb2:c046/64 Scope:Link
UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
RX packets:1640 errors:0 dropped:0 overruns:0 frame:0
TX packets:928 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000 
RX bytes:353702 (345.4 KiB)  TX bytes:139603 (136.3 KiB)
Interrupt:18 Memory:dca00000-dca20000 

lo        Link encap:Local Loopback  
inet addr:127.0.0.1  Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING  MTU:16436  Metric:1
RX packets:54 errors:0 dropped:0 overruns:0 frame:0
TX packets:54 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0 
RX bytes:4908 (4.7 KiB)  TX bytes:4908 (4.7 KiB)

Rhaaaa, le boulet, je passe par le loopback!!! C'est pour ça. Y'a plus qu'a recommencer.

Use two production nodes (and not just one… :( )

      uniq $OAR_NODEFILE  > machinefile
      mpirun --mca btl self,tcp -machinefile machinefile ./skampi -i ski/skampi_pt2pt_alvin.ski -o p2p.log 2>/dev/null ; tail p2p.log 
      count= 8388608  8388608   74580.0      41.5        8   74580.0   73331.5
      count= 11863283 11863283  104334.2      62.9        8  104334.2  102909.4
      count= 16777216 16777216  145859.3     117.3        8  145859.3  144548.6
# end result "Pingpong_Send_Recv"
# duration = 6.46 sec

-> Bw= 8388608/74580 = 112.477983373559 Mb/s YEAAAH!!!!

Getting Stéphane's Ski file

# ----------------------------------------------------------------------
# pt2pt measurements

      set_min_repetitions(32)
      set_max_repetitions(64)
      set_max_relative_standard_error(0.03)

      set_skampi_buffer(32768kb)
      datatype = MPI_CHAR

# ----------------------------------------------------------------------

      comm_pt2pt = comm2_max_latency_with_root()

      begin measurement "Pingpong_Send_Recv"
      for count = 1 to ... step *2  do
      measure comm_pt2pt : Pingpong_Send_Recv(count, datatype, 0, 1)
      od
      for count = 1024 to 8192 step +512  do
      measure comm_pt2pt : Pingpong_Send_Recv(count, datatype, 0, 1)
      od
      for count = 32768 to 262144 step +1024  do
      measure comm_pt2pt : Pingpong_Send_Recv(count, datatype, 0, 1)
      od

      end measurement

Comparing two measurements

I have performed two measurements to see whether the measurements were stable or not. Everything is detailed in this Sweave document. The conclusion is that there is some noise, a piecewise (not necessarily linear) model could be just fine and a sound experiment plan could be set up.

Entered on [2012-01-23 lun. 12:01]