A nice trip to Barcelona
Table of Contents
Sitemap
This page is now outdated. All this work has been moved to http://simgrid.gforge.inria.fr/contrib/smpi-paraver.html. Please, consider using the new up-to-date version.
Achievements
Links to generated files
cp /usr/local/stow/wxparaver-4.5.4-linux-x86_64/bin/dimemas-wrapper.sh ./ for i in *.pl *.sh ; do echo "- [[file:./$i][$i]]" ; done
Presentation of current work from both sides
- Simulation of MPI programs (Arnaud Legrand)
- Spatial and Temporal Aggregation of Traces of Parallel Systems (Damien Dosimont)
- Evolution of the BigDFT code (Luigi Genovese)
- Presentation of the Paraver Format to improve interoperability (Juan Gonzalez)
- Clustering techniques applied to BigDFT (Harald Servat)
- Modeling and Simulation of a Dynamic Task-Based Runtime System for Heterogeneous Multi-Core Architectures (Luka Stanisic)
- Raising the Level of Abstraction: Simulation of Large Chip Multiprocessors Running Multithreaded Applications (Alejandro Rico)
TODO BigDFT simulation [2/3]
[X]
Simulate order(n) BigDFT with SMPI with no modification.[X]
Obtained an unbalanced (paje) trace where we could observe the same kind of (paraver) trace as what Luigi, Brice and our BSC colleagues obtained on a real run. The timing obviously do not make any sense as the platform model was completely different from the real platform but the general unbalanced shape was the same and the same process was slowing the whole application.[ ]
Instrument order(n) BigDFT to speed up the simulation ?
TODO Interaction between Paraver and SMPI [5/8]
[-]
Paraver conversion[X]
Wrote a paraver to csv/pjdump/smpi converter (in perl) that worked on an old small 8 node BigDFT paraver trace.[ ]
A few uggly things had to be done here (reduce, alltoallV, no handling of p2p operations, second/nanosecond issue, …) and need to be cleaned.[ ]
Maybe it would be interesting to have an option that allows extrae to trace all the parameters ?
[X]
Wrote a simple shell script to replay this trace with SMPI and generated an SMPI paje trace.[X]
I still need to improve the shell script so that it takes arguments on the command line.[X]
Wrote a perl script that converts an SMPI paje trace to the paraver file format.[-]
Improve this perl script[X]
improve the conversion to export events so that collective operation names are the same and things are easily comparable. (Edit: this was done in Chicago with Harald)[ ]
Currently there are two scripts (pjdump2prv.pl and pjsmpi2prv.pl). The 1st one is for ocelotl/pjdump output while the second one is intended for the SMPI -> PRV final step. I'm currently merging them together.[ ]
add links (arrows) so that bandwidth can be computed in paraver
[X]
Managed to open the resulting paraver trace in paraver.[X]
Have a prototype integration of SMPI within Paraver. (Edit: this was done in Chicago. If you use the dimemas-wrapper.sh) instead of the original one, it will launch smpi. Better integration to allow to specify platform and deployment would be nice.[ ]
Make a model of Mare Nostrum, the Mont-blanc prototype, so that BSC staff can really play with SMPI. (Edit: this was discussed in Chicago with Judit. I explained here the SimGrid XML plaform representation and she will try to play with SMPI and come back to me with questions).
TODO Trace Aggregation [4/5]
All this is better sumarized in the blog entry Damien wrote about this.
[X]
The paraver to pjdump converter was integrated in framesoc.[X]
Damien managed to load several paraver traces in ocelotl and to play with aggregation.[X]
Managed to load a SMPI replayed trace of order(n) BigDFT and could aggregate it and easily spot the disturbing process and the application phases.[X]
Convert the real O(n) BigDFT paraver trace and aggregates it[ ]
Convert the 12 GB Nancy LU trace (700 process on 3 clusters) to paraver to see whether the behavior exhibited by ocelotl can be observer in Paraver. This involves slightly modifying the paje to paraver converter which was designed for SMPI paje traces.
This trace was on flutin and I got it here: file:///exports/nancy_700_lu.C.700.pjdump.bz2
[ ]
Fix the state name conversion and the event conversion[ ]
The ',9' at end of the header is the number of communicators…[ ]
The resulting prv starts from the pjdump and I forgot to sort it. Could we give an option to pjdump so that it sorts it according to time?[ ]
Do not use state 0 as it's reserved for computation[ ]
Create a state and event for MPI application (derived from being outside MPI calls)[ ]
clock resolution issue
Interaction between Paraver and SMPI
A year and a half ago, I needed to write paraver converter because in a particular setup I could not trace BigDFT neither with TAU not Scalasca. My goal was simply to compute statistics on the trace using R. Today, we're in Barcelona and we're discussing on whether SMPI could be used as an alternative to Dimemas within the paraver framework. To this end, we need to make sure that SMPI can simulate paraver traces and output paraver traces. Ideally, we would modify SMPI to that it can parse and generate such traces but it's probably more work than what we can achieve in two days so we'll go for simple trace conversions, i.e., a paraver to SMPI time-independent trace format conversion and a Paje to paraver conversion.
Let's start from the traces I used at that time.
cp -r ../../../2013/04/03/paraver_trace ./ ls paraver_trace/
EXTRAE_Paraver_trace_mpich.pcf EXTRAE_Paraver_trace_mpich.prv EXTRAE_Paraver_trace_mpich.row
Paraver to CSV and SMPI format Conversion
Juan Gonzalez provided us a description of the Paraver and Dimemas
format. The Paraver description is available here, i.e., from the
Paraver documentation. Remember the pcf
file describes events, the
row
file defines the cpu/node/thread mapping and the prv
is the
trace with all events. I reworked my old script to convert from
paraver to csv, pjdump and SMPI time-independant trace format during
the night. Unfortunately, on the morning, Juan explained me I should
not trust the state records but only the the event and communication
records. Ideally, I should have worked from the dimemas trace
instead of the paraver trace to obtain SMPI trace but at least, this
allowed me to get a converter to csv/pjdump, which is very useful to
Damien for framesoc/ocelotl.
So I really struggled to make it work and had to make several assumptions and "Uggly hacks" (indicated in the code). In particular, something that is really uggly at the moment is that the V collective operations where send and receive are process specific appear as many times as there are process and since I translate on the fly, I do not produce a correct input for SMPI. The easiest solution to handle this is probably to have two pass but nevermind for a first proof of concept.
use strict; use Data::Dumper; my $power_reference=286.087E-3; # in flop/mus sub main { # default values for $input, $output and $format may have be # defined when tangling from babel but command line arguments # should always override them. my($arg); while(defined($arg=shift(@ARGV))) { for ($arg) { if (/^-i$/) { $input = shift(@ARGV); last; } if (/^-o$/) { $output = shift(@ARGV); last; } if (/^-f$/) { $format = shift(@ARGV); last; } print "unrecognized argument '$arg'"; } } if(!defined($input) || $input eq "") { die "No valid input file provided.\n"; } if(!defined($output) || $output eq "") { die "No valid input file provided.\n"; } print "Input: '$input'\n"; print "Output: '$output'\n"; print "Format: '$format'\n"; my($state_name,$event_name) = parse_pcf($input.".pcf"); my($resource_name) = parse_row($input.".row"); convert_prv($input.".prv",$state_name,$event_name,$resource_name,$output,$format); } sub parse_row { my($row) = shift; my $line; my(%resource_name); open(INPUT,$row) or die "Cannot open $row. $!"; while(defined($line=<INPUT>)) { chomp $line; if($line =~ /^LEVEL (.*) SIZE/) { my $type = $1; $resource_name{$type}= []; while((defined($line=<INPUT>)) && !($line =~ /^\s*$/g)) { chomp $line; push @{$resource_name{$type}}, $line; } } } return (\%resource_name); } sub parse_pcf { my($pcf) = shift; my $line; my(%state_name, %event_name) ; open(INPUT,$pcf) or die "Cannot open $pcf. $!"; while(defined($line=<INPUT>)) { chomp $line; if($line =~ /^STATES$/) { while((defined($line=<INPUT>)) && ($line =~ /^(\d+)\s+(.*)/g)) { $state_name{$1} = $2; } } if($line =~ /^EVENT_TYPE$/) { while($line=<INPUT>) { if($line =~ /VALUES/g) {last;} $line =~ /[6|9]\s+(\d+)\s+(.*)/g or next; #E.g. , EVENT_TYPE\n 1 50100001 Send Size in MPI Global OP my($id)=$1; $event_name{$id}{type} = $2; } while((defined($line=<INPUT>)) && ($line =~ /^(\d+)\s+(.*)/g)) { my($id); foreach $id (keys %event_name) { $event_name{$id}{value}{$1} = $2; } } } } # print Dumper(\%state_name); # print Dumper(\%event_name); return (\%state_name,\%event_name); } my(%pcf_coll_arg) = ( "send" => "50100001", "recv" => "50100002", "root" => "50100003", "communicator" => "50100003", "compute" => "my_reduce_compute_amount", ); my(%tit_translate) = ( "Running" => "compute", "Not created" => "", # skip me "I/O" => "", # skip me "Synchronization" => "", # skip me "MPI_Comm_size" => "", # skip me "MPI_Comm_rank" => "", # skip me "Outside MPI" => "", # skip me "End" => "", # skip me "MPI_Init" => "init", "MPI_Bcast" => "bcast", "MPI_Allreduce" => "allReduce", "MPI_Alltoallv" => "allToAllV", "MPI_Alltoall" => "allToAll", "MPI_Reduce" => "reduce", "MPI_Allgatherv" => "", # allGatherV Uggly hack "MPI_Gather" => "gather", "MPI_Gatherv" => "gatherV", "MPI_Reduce_scatter" => "reduceScatter", "MPI_Finalize" => "finalize", "MPI_Barrier" => "barrier", ); sub convert_prv { my($prv,$state_name,$event_name,$resource_name,$output,$format) = @_; my $line; my (%event); my(@fh)=(); open(INPUT,$prv) or die "Failed to open $prv:$!\n"; # Start parsing the header to get the trace hierarchy. # We should get something like # #Paraver (dd/mm/yy at hh:m):ftime:0:nAppl:applicationList[:applicationList] $line=<INPUT>; chomp $line; $line=~/^\#Paraver / or die "Invalid header '$line'\n"; my $header=$line; $header =~ s/^[^:\(]*\([^\)]*\):// or die "Invalid header '$line'\n"; $header =~ s/(\d+):(\d+)([^\(\d])/$1\_$2$3/g; $header =~ s/,\d+$//g; my ($max_duration,$resource,$nb_app,@appl) = split(/:/,$header); $max_duration =~ s/_.*$//g; $resource =~ /^(.*)\((.*)\)$/ or die "Invalid resource description '$resource'\n"; my($nb_nodes,$cpu_list)= ($1,$2); $nb_app==1 or die "I can handle only one application type at the moment\n"; my @cpu_list=split(/,/,$cpu_list); # print("$max_duration --> '$nb_nodes' '@cpu_list' $nb_app @appl \n"); my(%Appl); my($nb_task); foreach my $app (1..$nb_app) { my($task_list); $appl[$app-1] =~ /^(.*)\((.*)\)$/ or die "Invalid resource description '$resource'\n"; ($nb_task,$task_list) = ($1,$2); my(@task_list) = split(/,/,$task_list); my(%mapping); my($task); foreach $task (1..$nb_task) { my($nb_thread,$node_id) = split(/_/,$task_list[$task-1]); if(!defined($mapping{$node_id})) { $mapping{$node_id}=[]; } push @{$mapping{$node_id}},[$task,$nb_thread]; } $Appl{$app}{nb_task}=$nb_task; $Appl{$app}{mapping}=\%mapping; } for ($format) { if (/^csv$/) { $output .= ".csv"; open(OUTPUT,"> $output") or die "Cannot open $output. $!"; last; } if (/^pjdump$/) { $output .= ".pjdump"; open(OUTPUT,"> $output"); my @tab = split(/:/,`tail -n 1 $prv`); print OUTPUT "Container, 0, 0, 0.0, $max_duration, $max_duration, 0\n"; foreach my $node (1..$nb_nodes) { print OUTPUT "Container, 0, N, 0.0, $max_duration, $max_duration, node_$node\n"; } foreach my $app (values(%Appl)) { foreach my $node (keys%{$$app{mapping}}) { foreach my $t (@{$$app{mapping}{$node}}) { print OUTPUT "Container, node_$node, P, 0.0, $max_duration, $max_duration, MPI_Rank_$$t[0]\n"; foreach my $thread (1..$$t[1]) { print OUTPUT "Container, MPI_Rank_$$t[0], T, 0.0, $max_duration, $max_duration, Thread_$$t[0]_$thread\n"; } } } } last; } if(/^tit$/) { my $nb_proc = 0; foreach my $node (@{$$resource_name{NODE}}) { my $filename = $output."_$nb_proc.tit"; open($fh[$nb_proc], "> $filename") or die "Cannot open > $filename: $!"; $nb_proc++; } last; } die "Invalid format '$format'\n"; } # Now, let's process the records sub process_event { my(%event_list)=@_; my($sname); my($sname_param); if(defined($event_list{50000003})) { $sname = $$event_name{50000003}{value}{$event_list{50000003}}; $sname_param = ""; } elsif(defined($event_list{50000002})) { $sname = $$event_name{50000002}{value}{$event_list{50000002}}; my $t; if($tit_translate{$sname} =~ /V$/) { # Really Uggly hack because of "poor" tracing of V operations if($event_list{$pcf_coll_arg{"send"}}==251 || $event_list{$pcf_coll_arg{"recv"}}==251 ) { } $event_list{$pcf_coll_arg{"send"}} = 100000; $event_list{$pcf_coll_arg{"recv"}} = 100000; $sname =~ s/v$//i; } if($tit_translate{$sname} eq "reduce") { # Uggly hack because the amount of computation is not given $event_list{$pcf_coll_arg{"compute"}} = 1; } if($tit_translate{$sname} eq "gather") { # Uggly hack because the amount of receive does not make sense here $event_list{$pcf_coll_arg{"recv"}} = $event_list{$pcf_coll_arg{"send"}}; $event_list{$pcf_coll_arg{"root"}} = 1; # Uggly hack. AAAAARGH } if($tit_translate{$sname} eq "reduceScatter") { # Uggly hack because of "poor" tracing $event_list{$pcf_coll_arg{"recv"}} = $event_list{$pcf_coll_arg{"send"}}; my $foo=$event_list{$pcf_coll_arg{"recv"}}; $event_list{$pcf_coll_arg{"recv"}}=""; for (1..$nb_task) { $event_list{$pcf_coll_arg{"recv"}} .= $foo." "; } $event_list{$pcf_coll_arg{"compute"}} = 1; } foreach $t ("send","recv", "compute", "root") { if(defined($event_list{$pcf_coll_arg{$t}}) && $event_list{$pcf_coll_arg{$t}} ne "0") { if($t eq "root") { $event_list{$pcf_coll_arg{$t}}--; } $sname_param.= "$event_list{$pcf_coll_arg{$t}} "; } } } else { # This may be application of trace flushing event # and hardware counter, user function, ... my($warn)=1; for (40000018,40000003,40000001, 42009999,42001003,42001010,42001015,300, 70000001,70000002,70000003,80000001,80000002,80000003, 45000000) { if(defined($event_list{$_})) {$warn=0; last;} } if($warn) { print "Skipping event:\n"; print Dumper(%event_list);} next; } return($sname,$sname_param); } while(defined($line=<INPUT>)) { chomp($line); # State records 1:cpu:appl:task:thread : begin_time:end_time : state if($line =~ /^1/) { my($sname); my($sname_param); my($record,$cpu,$appli,$task,$thread,$begin_time,$end_time,$state) = split(/:/,$line); if($$state_name{$state} =~ /Group/ || $$state_name{$state} =~ /Others/ ) { $line=<INPUT>; chomp $line; my($event,$ecpu,$eappli,$etask,$ethread,$etime,%event_list) = split(/:/,$line); (($event==2) && ($ecpu eq $cpu) && ($eappli eq $appli) && ($etask eq $task) && ($ethread eq $thread) && ($etime >= $begin_time) && ($etime <= $end_time)) or die "Invalid event!"; ($sname,$sname_param)=process_event(%event_list); } else { $sname = $$state_name{$state}; } if($sname eq "Running") { $sname_param.= (($end_time-$begin_time)*$power_reference); } if($format eq "csv") { print OUTPUT "State, $task, MPI_STATE, $begin_time, $end_time, ". ($end_time-$begin_time).", 0, ". $sname."\n"; } if($format eq "pjdump") { print OUTPUT "State, Thread_${task}_$thread, STATE, $begin_time, $end_time, ". ($end_time-$begin_time).", 0, ". $sname."\n"; } if($format eq "tit") { $task=$task-1; defined($tit_translate{$sname}) or die "Unknown state '$sname' for tit\n"; if($tit_translate{$sname} ne "") { print { $fh[$task] } "$task $tit_translate{$sname} $sname_param\n", } } } elsif ($line =~ /^2/) { # Event records 2:cpu:appl:task:thread : time : event_type:event_value my($event,$cpu,$appli,$task,$thread,$time,%event_list) = split(/:/,$line); my($sname,$sname_param)=process_event(%event_list); if($format eq "tit") { $task=$task-1; defined($tit_translate{$sname}) or die "Unknown state '$sname' for tit:\n\t$line\n"; if($tit_translate{$sname} ne "") { print { $fh[$task] } "$task $tit_translate{$sname} $sname_param\n", } } } elsif($line =~ /^3/) { # Communication records 3: cpu_send:ptask_send:task_send:thread_send : logical_time_send: actual_time_send: cpu_recv:ptask_recv:task_recv:thread_recv : logical_time_recv: actual_time_recv: size: tag print STDERR "Skipping this communication event\n"; } if($line =~ /^c/) { # Communicator record c: app_id: communicator_id: number_of_process : thread_list (e.g., 1:2:3:4:5:6:7:8) print STDERR "Skipping communicator definition\n"; } } for ($format) { if (/^csv$/) { close(OUTPUT); print "Generated [[file:$output]]\n"; last; } if (/^pjdump$/) { close(OUTPUT); print "Generated [[file:$output]]\n"; last; } if(/^tit$/) { foreach my $f (@fh) { close($f) or die "Failed closing file descriptor. $!\n"; } print "Generated [[file:${output}_0.tit]] among other ones\n"; last; } die "Invalid format '$format'\n"; } } main();
Input: './paraver_trace/EXTRAE_Paraver_trace_mpich' Output: './paraver_trace/bigdft_8_rl' Format: 'tit' Generated [[file:./paraver_trace/bigdft_8_rl_0.tit]] among other ones
head paraver_trace/bigdft_8_rl.csv
State, 1, MPI_STATE, 0, 10668, 10668, 0, Not created State, 2, MPI_STATE, 0, 5118733, 5118733, 0, Not created State, 3, MPI_STATE, 0, 9374527, 9374527, 0, Not created State, 4, MPI_STATE, 0, 17510142, 17510142, 0, Not created State, 5, MPI_STATE, 0, 5989994, 5989994, 0, Not created State, 6, MPI_STATE, 0, 5737601, 5737601, 0, Not created State, 7, MPI_STATE, 0, 5866978, 5866978, 0, Not created State, 8, MPI_STATE, 0, 5891099, 5891099, 0, Not created State, 1, MPI_STATE, 10668, 25576057, 25565389, 0, Running State, 2, MPI_STATE, 5118733, 18655258, 13536525, 0, Running
Let's try to replay on SMPI
cp /home/alegrand/Work/SimGrid/infra-songs/WP4/SC13/graphene.xml ./graphene.xml
print_usage() { echo "Usage: $0 [OPTIONS]" cat <<'End-of-message' -i|--input Paraver input file -o|--output output file (in the paje format) -p|--platform XML platform file -m|--machine_file -h|help print help information End-of-message exit 1 } TEMP=`getopt -o i:o:p:m:h --long input:,output:,platform:,machine_file:,help -n 'smpi2pj.sh' -- "$@"` eval set -- "$TEMP" while true;do case "$1" in -i|--input) case "$2" in "") shift 2;; *) INPUT=$2;shift 2;; esac;; -o|--output) case "$2" in "") shift 2;; *) OUTPUT=$2;shift 2;; esac;; -p|--platform) case "$2" in "") shift 2;; *) PLATFORM=$2;shift 2;; esac;; -m|--machine) case "$2" in "") shift 2;; *) MACHINE_FILE=$2;shift 2;; esac;; -h|--help) print_usage;shift;; --) shift; break;; *) echo "Unknown option '$1'"; print_usage;; esac done TMP_WORKING_PATH=`mktemp -d` # Creating input for smpi_replay REPLAY_INPUT=$TMP_WORKING_PATH/smpi_replay.txt ls $INPUT*.tit > $REPLAY_INPUT # Get the number of MPI ranks export NP=`cat $REPLAY_INPUT | wc -l` # Generating a dumb deployment (machine_file) if needed if [ -z "$MACHINE_FILE" ]; then MACHINE_FILE=$TMP_WORKING_PATH/machine_file.txt; if [ -e "$MACHINE_FILE" ]; then echo "Ooups $MACHINE_FILE already exists. Do not want to overwrite" ; exit 1 ; fi; rm -f $MACHINE_FILE; touch $MACHINE_FILE; for i in `seq 1 144`; do echo graphene-${i}.nancy.grid5000.fr >> $MACHINE_FILE ; done cp $MACHINE_FILE $MACHINE_FILE.sav cat $MACHINE_FILE.sav $MACHINE_FILE.sav $MACHINE_FILE.sav $MACHINE_FILE.sav > $MACHINE_FILE fi ## To debug # $SMPIRUN -ext smpi_replay --log=replay.thresh:critical --log=smpi_replay.thresh:verbose \ # --cfg=smpi/cpu_threshold:-1 -hostfile machine_file -platform $PLATFORM \ # -np $NP gdb\ --args\ $REPLAY /tmp/smpi_replay.txt --log=smpi_kernel.thres:warning \ # --cfg=contexts/factory:thread $SMPIRUN -ext smpi_replay \ --cfg=smpi/cpu_threshold:-1 -trace --cfg=tracing/filename:$OUTPUT \ -hostfile $MACHINE_FILE -platform $PLATFORM -np $NP \ $REPLAY $REPLAY_INPUT --log=smpi_kernel.thres:warning \ --cfg=contexts/factory:thread 2>&1 # --log=replay.thresh:critical --log=smpi_replay.thresh:verbose
SMPI Paje to Paraver Conversion
This was quick and dirty and reused the original pcf file but in the end it kinda works… Yeepee! :)
use strict; use Env; my($arg); my($strict_option) = ""; while(defined($arg=shift(@ARGV))) { for ($arg) { print "$arg \n"; if (/^-i$/) { $input = shift(@ARGV); last; } if (/^-o$/) { $output = shift(@ARGV); last; } if (/^-ns$/){ $strict_option = "-n -z"; last; } print "unrecognized argument '$arg'"; } } my $pjfile=$input; $pjfile=~ s/\.trace$/.pjdump/; $pjfile ne $input or die; $ENV{LANG}="C"; system("pj_dump $strict_option $input | grep State | sed 's/ //g' | sort -n -t ',' -k 4n > $pjfile"); my $duration = `tail -n 1 $pjfile`; my @duration = split(/,/,$duration); $duration = $duration[4]; $duration *= 1E9; my $nb_nodes = `sed -e 's/.*rank-//' -e 's/,.*//' $pjfile | sort | uniq | wc -l`; chomp($nb_nodes); my(%smpi_to_pcf) = ( "action_allReduce" => "10", "action_allToAll" => "11", "action_barrier" => "8", "action_bcast" => "7", "action_gather" => "13", "action_reduce" => "9", "action_reducescatter" => "80", # "smpi_replay_finalize" => "32", # "smpi_replay_init" => "31" ); my($pcf_file_content)="DEFAULT_OPTIONS LEVEL THREAD UNITS NANOSEC LOOK_BACK 100 SPEED 1 FLAG_ICONS ENABLED NUM_OF_STATE_COLORS 1000 YMAX_SCALE 37 DEFAULT_SEMANTIC THREAD_FUNC State As Is STATES 0 Idle 1 Running 2 Not created 3 Waiting a message 4 Blocking Send 5 Synchronization 6 Test/Probe 7 Scheduling and Fork/Join 8 Wait/WaitAll 9 Blocked 10 Immediate Send 11 Immediate Receive 12 I/O 13 Group Communication 14 Tracing Disabled 15 Others 16 Send Receive 17 Memory transfer STATES_COLOR 0 {117,195,255} 1 {0,0,255} 2 {255,255,255} 3 {255,0,0} 4 {255,0,174} 5 {179,0,0} 6 {0,255,0} 7 {255,255,0} 8 {235,0,0} 9 {0,162,0} 10 {255,0,255} 11 {100,100,177} 12 {172,174,41} 13 {255,144,26} 14 {2,255,177} 15 {192,224,0} 16 {66,66,66} 17 {255,0,96} EVENT_TYPE 9 50000001 MPI Point-to-point VALUES 2 MPI_Recv 1 MPI_Send 0 Outside MPI EVENT_TYPE 9 50000002 MPI Collective Comm VALUES 18 MPI_Allgatherv 10 MPI_Allreduce 11 MPI_Alltoall 12 MPI_Alltoallv 8 MPI_Barrier 7 MPI_Bcast 13 MPI_Gather 14 MPI_Gatherv 80 MPI_Reduce_scatter 9 MPI_Reduce 0 Outside MPI EVENT_TYPE 9 50000003 MPI Other VALUES 21 MPI_Comm_create 19 MPI_Comm_rank 20 MPI_Comm_size 32 MPI_Finalize 31 MPI_Init 0 Outside MPI EVENT_TYPE 1 50100001 Send Size in MPI Global OP 1 50100002 Recv Size in MPI Global OP 1 50100003 Root in MPI Global OP 1 50100004 Communicator in MPI Global OP EVENT_TYPE 6 40000001 Application VALUES 0 End 1 Begin EVENT_TYPE 6 40000003 Flushing Traces VALUES 0 End 1 Begin GRADIENT_COLOR 0 {0,255,2} 1 {0,244,13} 2 {0,232,25} 3 {0,220,37} 4 {0,209,48} 5 {0,197,60} 6 {0,185,72} 7 {0,173,84} 8 {0,162,95} 9 {0,150,107} 10 {0,138,119} 11 {0,127,130} 12 {0,115,142} 13 {0,103,154} 14 {0,91,166} GRADIENT_NAMES 0 Gradient 0 1 Grad. 1/MPI Events 2 Grad. 2/OMP Events 3 Grad. 3/OMP locks 4 Grad. 4/User func 5 Grad. 5/User Events 6 Grad. 6/General Events 7 Grad. 7/Hardware Counters 8 Gradient 8 9 Gradient 9 10 Gradient 10 11 Gradient 11 12 Gradient 12 13 Gradient 13 14 Gradient 14 EVENT_TYPE 9 40000018 Tracing mode: VALUES 1 Detailed 2 CPU Bursts "; my($pcf_output)=$output; $pcf_output =~ s/\.prv$/.pcf/; open OUTPUT, "> $pcf_output"; print OUTPUT $pcf_file_content; close OUTPUT; my($line); open(INPUT,$pjfile) or die; open(OUTPUT,"> $output") or die; my(@tab); @tab=(); foreach (1..$nb_nodes) { push @tab,1; } my $node_list = join(',',@tab); @tab=(); foreach (1..$nb_nodes) { push @tab,"1:$_"; } my $thread_list = join(',',@tab); print OUTPUT "#Paraver (generated with perl from SMPI):${duration}_ns:$nb_nodes($node_list):1:$nb_nodes($thread_list),9\n"; my $comm_list = join(':',(1..$nb_nodes)); my $comm=1; print OUTPUT "c:1:$comm:$nb_nodes:$comm_list\n"; $comm++; foreach (1..$nb_nodes) { print OUTPUT "c:1:$comm:1:$_\n"; } while(defined($line=<INPUT>)) { chomp($line); my($Foo1,$rank,$Foo2,$start,$end,$duration,$Foo3,$type) = split(/,/,$line); $rank=~ s/\D*//g; $rank++; $start *= 1E9; $end *= 1E9; if(defined($smpi_to_pcf{$type})) { print OUTPUT "1:$rank:1:$rank:1:$start:$end:13\n"; # group communication print OUTPUT "2:$rank:1:$rank:1:$start:50000002:$smpi_to_pcf{$type}\n"; print OUTPUT "2:$rank:1:$rank:1:$end:50000002:0\n"; # Output MPI # print OUTPUT "1:$rank:1:$rank:1:$start:$end:$smpi_to_pcf{$type}\n"; } else { warn("Unknown type $type: Skipping $line\n"); } }
Gluing everything together to allow calling SMPI
The Dimemas wrapper called by paraver is file:///usr/local/stow/wxparaver-4.5.4-linux-x86_64/bin/dimemas-wrapper.sh
Let's make a copy of it.
mv /usr/local/stow/wxparaver-4.5.4-linux-x86_64/bin/dimemas-wrapper.sh /usr/local/stow/wxparaver-4.5.4-linux-x86_64/bin/dimemas-wrapper.sh.backup
Basically, what I want to do is something like
perl prv2pj.pl sh smpi2pj.sh >/dev/null perl pjsmpi2prv.pl
Here is an equivalent version inspired from the dimemas wrapper.
# # Simple wrapper for SMPI based on the Dimemas one # set -e function usage { echo "Usage: $0 source_trace dimemas_cfg output_trace reuse_dimemas_trace [extra_parameters] [-n]" echo " source_trace: Paraver trace" echo " dimemas_cfg: Simulation parameters" echo " output_trace: Output trace of Dimemas; must end with '.prv'" echo " reuse_dimemas_trace: 0 -> don't reuse, rerun prv2dim" echo " 1 -> reuse, don't rerun prv2dim" echo " extra_parameters: See complete list of Dimemas help with 'Dimemas -h'" echo " -n: prv2dim -n parameter => no generate initial idle states" } # Read and check parameters if [ $# -lt 4 ]; then usage exit 1 fi #PARAVER_TRACE=${1} PARAVER_TRACE=`readlink -eqs "${1}"` DIMEMAS_CFG=${2} OUTPUT_PARAVER_TRACE=${3} DIMEMAS_REUSE_TRACE=${4} if [[ ${DIMEMAS_REUSE_TRACE} != "0" && ${DIMEMAS_REUSE_TRACE} != "1" ]]; then usage exit 1 fi echo "Go to hell!" exit 12; echo "===============================================================================" # Check SMPI availability ### Oh right, we should do that... # Get tracename, without extensions TRACENAME=$(echo "$PARAVER_TRACE" | sed "s/\.[^\.]*$//") EXTENSION=$(echo "$PARAVER_TRACE" | sed "s/^.*\.//") #Is gzipped? if [[ ${EXTENSION} = "gz" ]]; then echo echo -n "[MSG] Decompressing $PARAVER_TRACE trace..." gunzip ${PARAVER_TRACE} TRACENAME=$(echo "${TRACENAME}" | sed "s/\.[^\.]*$//") PARAVER_TRACE=${TRACENAME}.prv echo "...Done!" fi DIMEMAS_TRACE=${TRACENAME}.dim # Adapt Dimemas CFG with new trace name DIMEMAS_CFG_NAME=$(echo "$DIMEMAS_CFG" | sed "s/\.[^\.]*$//") DIMEMAS_COPY_CFG_NAME=`basename ${DIMEMAS_CFG_NAME}` OLD_DIMEMAS_TRACENAME=`grep "mapping information" ${DIMEMAS_CFG} | grep ".dim" | awk -F'"' {'print $4'}` NEW_DIMEMAS_TRACENAME=`basename ${DIMEMAS_TRACE}` DIMEMAS_CFG_PATH=`dirname ${DIMEMAS_TRACE}` # Append extra parameters if they exist shift shift shift shift EXTRA_PARAMETERS="" PRV2DIM_N="" while [ -n "$1" ]; do if [[ ${1} == "-n" ]]; then # caution! this works because no -n parameters exists in Dimemas PRV2DIM_N="-n" else EXTRA_PARAMETERS="$EXTRA_PARAMETERS $1" fi shift done # Change directory to see .dim DIMEMAS_TRACE_DIR=`dirname ${DIMEMAS_TRACE}`/ pushd . > /dev/null cd ${DIMEMAS_TRACE_DIR} # Translate from .prv to SMPI time independant trace if [[ ${DIMEMAS_REUSE_TRACE} = "0" || \ ${DIMEMAS_REUSE_TRACE} = "1" && ! -f ${DIMEMAS_TRACE} ]]; then if [[ ${DIMEMAS_REUSE_TRACE} = "1" ]]; then echo echo "[WARN] Unable to find ${DIMEMAS_TRACE}" echo "[WARN] Generating it." fi PARAVER_TRACE_TRIMED=`echo ${PARAVER_TRACE} | sed 's/.prv$//'` echo echo "[COM] prv2pj.pl -i ${PARAVER_TRACE_TRIMED} -o ${DIMEMAS_TRACE} -f tit" echo prv2pj.pl -i ${PARAVER_TRACE_TRIMED} -o ${DIMEMAS_TRACE} -f tit echo fi # Simulate echo echo "*** Running SMPI :) ***" echo echo "[COM] smpi2pj.sh -i ${DIMEMAS_TRACE} -o ${OUTPUT_PAJE_TRACE}" echo OUTPUT_PAJE_TRACE=`echo ${OUTPUT_PARAVER_TRACE} | sed 's/.prv$/.trace/'` smpi2pj.sh -i ${DIMEMAS_TRACE} -o ${OUTPUT_PAJE_TRACE} # Convert back to paraver echo echo "[COM] pjsmpi2prv.pl -i ${OUTPUT_PAJE_TRACE} -o ${OUTPUT_PARAVER_TRACE}" echo pjsmpi2prv.pl -i ${OUTPUT_PAJE_TRACE} -o ${OUTPUT_PARAVER_TRACE} echo "===============================================================================" popd > /dev/null
For this to work "system wide", I need to put the previous perl and sh scripts in the PATH. Eventually, they will be shipped with SMPI.
TMP_FILENAME=`mktemp` for i in *.pl ; do mv $i $TMP_FILENAME; echo "#!/usr/bin/perl" > $i; cat $TMP_FILENAME >> $i; rm $TMP_FILENAME; chmod +x $i; cp $i ~/bin/ done for i in smpi2pj.sh ; do mv $i $TMP_FILENAME; echo "#!/bin/sh" > $i; cat $TMP_FILENAME >> $i; rm $TMP_FILENAME; chmod +x $i; cp $i ~/bin/ done for i in /usr/local/stow/wxparaver-4.5.4-linux-x86_64/bin/dimemas-wrapper.sh ; do mv $i $TMP_FILENAME; echo "#!/bin/bash" > $i; cat $TMP_FILENAME >> $i; rm $TMP_FILENAME; chmod +x $i; cp $i ~/bin/ done
Specific Pjdump to Paraver Conversion for Damien
#!/usr/bin/perl my $output=q(/exports/nancy_700_lu.C.700.prv); my $input=q(/exports/nancy_700_lu.C.700.pjdump.bz2); use strict; use Env; my($duration,$nb_nodes); my($strict_option) = ""; my($arg); while(defined($arg=shift(@ARGV))) { for ($arg) { if (/^-i$/) { $input = shift(@ARGV); last; } if (/^-o$/) { $output = shift(@ARGV); last; } if (/^-d$/) { $duration = shift(@ARGV); last; } if (/^-n$/) { $nb_nodes = shift(@ARGV); last; } if (/^-ns$/){ $strict_option = "-n -z"; last; } print "unrecognized argument '$arg'"; } } print " ---> $input \n"; my($pjfile); if($input =~/\.trace$/) { $ENV{LANG}="C"; $pjfile = $input; $pjfile =~ s/\.trace$/.pjdump/; my $command = "pj_dump $strict_option $input | grep State | sed 's/ //g' | sort -n -t ',' -k 4n > $pjfile"; print "---> $command\n"; system($command); } elsif($input =~/\.pjdump/) { $pjfile = $input; } else { die "Unknown input format '$input'\n"; } print " ---> $pjfile \n"; if(!defined($duration)) { $duration = `tail -n 1 $pjfile`; my @duration = split(/,/,$duration); $duration = $duration[4]; $duration *= 1E9; } if(!defined($nb_nodes)) { $nb_nodes = `sed -e 's/.*rank-//' -e 's/,.*//' $pjfile | sort | uniq | wc -l`; chomp($nb_nodes); } my($pcf_file_content)="DEFAULT_OPTIONS LEVEL THREAD UNITS NANOSEC LOOK_BACK 100 SPEED 1 FLAG_ICONS ENABLED NUM_OF_STATE_COLORS 1000 YMAX_SCALE 37 DEFAULT_SEMANTIC THREAD_FUNC State As Is STATES 0 Idle 1 Running 2 Not created 3 Waiting a message 4 Blocking Send 5 Synchronization 6 Test/Probe 7 Scheduling and Fork/Join 8 Wait/WaitAll 9 Blocked 10 Immediate Send 11 Immediate Receive 12 I/O 13 Group Communication 14 Tracing Disabled 15 Others 16 Send Receive 17 Memory transfer STATES_COLOR 0 {117,195,255} 1 {0,0,255} 2 {255,255,255} 3 {255,0,0} 4 {255,0,174} 5 {179,0,0} 6 {0,255,0} 7 {255,255,0} 8 {235,0,0} 9 {0,162,0} 10 {255,0,255} 11 {100,100,177} 12 {172,174,41} 13 {255,144,26} 14 {2,255,177} 15 {192,224,0} 16 {66,66,66} 17 {255,0,96} EVENT_TYPE 9 50000001 MPI Point-to-point VALUES 2 MPI_Recv 1 MPI_Send 0 Outside MPI EVENT_TYPE 9 50000002 MPI Collective Comm VALUES 18 MPI_Allgatherv 10 MPI_Allreduce 11 MPI_Alltoall 12 MPI_Alltoallv 8 MPI_Barrier 7 MPI_Bcast 13 MPI_Gather 14 MPI_Gatherv 80 MPI_Reduce_scatter 9 MPI_Reduce 0 Outside MPI EVENT_TYPE 9 50000003 MPI Other VALUES 21 MPI_Comm_create 19 MPI_Comm_rank 20 MPI_Comm_size 32 MPI_Finalize 31 MPI_Init 0 Outside MPI EVENT_TYPE 1 50100001 Send Size in MPI Global OP 1 50100002 Recv Size in MPI Global OP 1 50100003 Root in MPI Global OP 1 50100004 Communicator in MPI Global OP EVENT_TYPE 6 40000001 Application VALUES 0 End 1 Begin EVENT_TYPE 6 40000003 Flushing Traces VALUES 0 End 1 Begin GRADIENT_COLOR 0 {0,255,2} 1 {0,244,13} 2 {0,232,25} 3 {0,220,37} 4 {0,209,48} 5 {0,197,60} 6 {0,185,72} 7 {0,173,84} 8 {0,162,95} 9 {0,150,107} 10 {0,138,119} 11 {0,127,130} 12 {0,115,142} 13 {0,103,154} 14 {0,91,166} GRADIENT_NAMES 0 Gradient 0 1 Grad. 1/MPI Events 2 Grad. 2/OMP Events 3 Grad. 3/OMP locks 4 Grad. 4/User func 5 Grad. 5/User Events 6 Grad. 6/General Events 7 Grad. 7/Hardware Counters 8 Gradient 8 9 Gradient 9 10 Gradient 10 11 Gradient 11 12 Gradient 12 13 Gradient 13 14 Gradient 14 EVENT_TYPE 9 40000018 Tracing mode: VALUES 1 Detailed 2 CPU Bursts "; my($pcf_output)=$output; $pcf_output =~ s/\.prv$/.pcf/; open OUTPUT, "> $pcf_output"; print OUTPUT $pcf_file_content; close OUTPUT; my(%mpi_to_pcf) = ( "MPI_Running" => "1", "MPI_Send" => "10", "MPI_Recv" => "11", "Collective" => "13", "Others" => "15", ); my(%mpi_coll_to_pcf) = ( "MPI_Bcast" => "18", "MPI_Allreduce" => "10", "MPI_Alltoall" => "11", "MPI_Alltoallv" => "12", "MPI_Bcast" => "7", "MPI_Gather" => "13", "MPI_Gatherv" => "14", "MPI_Reduce_Scatter" => "80", "MPI_Reduce" => "9", ); my(%mpi_others_to_pcf) = ( "MPI_Comm_create" => "21", "MPI_Comm_rank" => "19", "MPI_Comm_size" => "20", "MPI_Finalize" => "32", "MPI_Init" => "31", ); my(%smpi_to_mpi) = ( "action_allReduce" => "MPI_Allreduce", "action_allToAll" => "MPI_Alltoall", "action_barrier" => "MPI_Barrier", "action_bcast" => "MPI_Bcast", "action_gather" => "MPI_Gather", "action_reduce" => "MPI_reduce", "action_reducescatter" => "MPI_Reduce_Scatter", "smpi_replay_finalize" => "MPI_Finalize", "smpi_replay_init" => "MPI_Init", "PMPI_Init" => "MPI_Init", "PMPI_Send" => "MPI_Send", "PMPI_Recv" => "MPI_Recv", "PMPI_Finalize" => "MPI_Finalize" ); my($line); open(INPUT,$pjfile) or die; open(OUTPUT,"> $output") or die; my(@tab); @tab=(); foreach (1..$nb_nodes) { push @tab,1; } my $node_list = join(',',@tab); @tab=(); foreach (1..$nb_nodes) { push @tab,"1:$_"; } my $thread_list = join(',',@tab); print OUTPUT "#Paraver (generated with perl from SMPI):${duration}_ns:$nb_nodes($node_list):1:$nb_nodes($thread_list),3\n"; my $comm_list = join(':',(1..$nb_nodes)); my $comm=1; print OUTPUT "c:1:$comm:$nb_nodes:$comm_list\n"; $comm++; foreach (1..$nb_nodes) { print OUTPUT "c:1:$comm:1:$_\n"; } while(defined($line=<INPUT>)) { chomp($line); my($Foo1,$rank,$Foo2,$start,$end,$duration,$Foo3,$type) = split(/,/,$line); $rank=~ s/\D*//g; $rank++; $start *= 1E9; $end *= 1E9; if($type =~ /action_/ or $type =~ /smpi_/ or $type =~ /PMPI_/) { my($key); foreach $key (keys(%smpi_to_mpi)) { if($type eq $key) { $type = $smpi_to_mpi{$key}; last; } } } if(defined($mpi_to_pcf{$type})) { print "$type $mpi_to_pcf{$type}\n"; print OUTPUT "1:$rank:1:$rank:1:$start:$end:$mpi_to_pcf{$type}\n"; } elsif(defined($mpi_coll_to_pcf{$type})) { print OUTPUT "1:$rank:1:$rank:1:$start:$end:$mpi_to_pcf{Collective}\n"; # group communication print OUTPUT "2:$rank:1:$rank:1:$start:50000002:$mpi_coll_to_pcf{$type}\n"; print OUTPUT "2:$rank:1:$rank:1:$end:50000002:0\n"; # Output MPI } elsif(defined($mpi_others_to_pcf{$type})) { print OUTPUT "1:$rank:1:$rank:1:$start:$end:$mpi_to_pcf{Others}\n"; print OUTPUT "2:$rank:1:$rank:1:$start:50000003:$mpi_others_to_pcf{$type}\n"; print OUTPUT "2:$rank:1:$rank:1:$end:50000003:0\n"; # Output MPI } else { warn("Unknown type $type: Skipping $line\n"); } }