R 101 for MESCAL and MOAIS

Table of Contents

Sitemap

---> misc
| ---> 2016
| ---> 2015
| ---> 2014
| ---> 2013
| ---> 2012
`--> Agenda

Recently, I felt I have to give an overview to my colleagues of what was possible with R. There are a lots of presentation out there but I wanted something more suited to this audience. Here is what I talked about.

Why R?

R is a great language for data analysis and statistics

  • Open-source and multi-platform
  • Very expressive with high-level constructs
  • Excellent graphics
  • Widely used in academia and business
  • Very active community
    • Documentation, FAQ on stackoverflow
  • Great integration with other tools

Why is R a pain for computer scientists ?

  • R is not really a programming language:
    • You should always avoid to code with it
    • The semantic and syntax is uggly
    • It is interpreted, hence slow
  • Documentation is for statisticians
  • Default plots are cumbersome (meaningful)
  • Summaries are cryptic (precise)
  • Steep learning curve even for us, computer scientists whereas we generally switch seamlessly from a language to another! That's frustrating! ;)

Do's and dont's

R is high level, I'll do everything myself

  • CTAN comprises 4,334 TeX, LaTeX, and related packages and tools. Most of you do not use plain TeX.
  • Currently, the CRAN package repository features 4030 available packages.
  • How do you know which one to use ??? Many of them are highly exotic (not to say useless to you). http://www.r-bloggers.com/
  • Lots of introductions but not necessarily what you're looking for so I'll give you a short tour.

You should quickly realize though that you need proper training in statistics and data analysis if you do not want tell nonsense.

Install and run R on debian

  apt-cache search r

Err, that'is not very useful :) It's the same when searching on google but once the filter bubble is set up, it gets better…

  sudo apt-get install r-base
  R

Install a few cool packages

  • ddply, reshape and ggplot2 (Hadley Wickham) http://had.co.nz/

        install.packages("plyr")
        install.packages("reshape")
        install.packages("ggplot2")
    
  • knitR (Yihui Xie) http://yihui.name/knitr/

        install.packages("knitr")
    

IDE

Interactive use is nice but quickly becomes painful so at some point, you'll want an IDE.

Emacs is great but you'll need Emacs Speaks Statistics

  sudo apt-get install ess

Soon you may fall in love with org-mode. Although part of it comes in the base emacs installation, you will need the full package to have babel working.

  sudo apt-get install org-mode

You may want to give a try to Rstudio or R-commander

  sudo apt-get install rtudio r-cran-rcmdr

Roadmap

  • Read data into a dataframe
library(plyr)
library(ggplot2)
df <- read.csv("http://mescal.imag.fr/membres/arnaud.legrand/uss_simgrid/tau.csv", header=FALSE, strip.white=TRUE)
names(df) <- c("Nature","ResourceId","Type","Start","End","Duration","Level","Value")
df = df[!(names(df) %in% c("Nature","Type","Level"))]
  • Explore the content:
    • names
    • ?names to get the help on a function
    • head, tail , df[2:5,] (do not forget the ",") , dim, length, df$toto, df[df$toto=="t",] (again, do not forget the final ",")
    • str, summary
  • Reshape it a little
block <- function(proc) {
  end <- proc$Start
  end <- end[2:length(end)]
  start <- proc$End
  start <- start[1:(length(start)-1)]
  d <- data.frame(ResourceId=unique(proc$ResourceId),
                  Start = start, End=end, Duration = end-start,
                  Value= "computing")
  d
}
compute_durations <- function(df) {
  d <- data.frame()
  for(rank in unique(df$ResourceId)) {
    d=rbind(d,block(df[df$ResourceId==rank,]))
  }
  # Let's merge these two frames together but allow to distinguish
  # between them.
  d$Type = "Computation"
  df$Type = "Communication"
  rbind(df,d)
}
df_tau <- compute_durations(df)
  • Plot a bunch of things
ggplot(df_tau,aes(xmin=Start,xmax=End, ymin=ResourceId, ymax=ResourceId+1,fill=Value))+theme_bw()+geom_rect()

plot1.png

p <- ggplot(df_tau,aes(x=Duration,fill=Value))+theme_bw()+scale_x_log10()
p+geom_histogram(position="identity",alpha=.5,binwidth=.1)+facet_wrap(~Type,ncol=1)

plot2.png

  • Aggregates and replot
df_sum_tau <- ddply(df_tau, c("Value","ResourceId"),summarize, Duration = sum(Duration))
p <- ggplot(df_sum_tau, aes(x=Value, y=Duration,color=Value)) 
p + geom_boxplot() + geom_jitter(alpha=.4)

plot3.png

Take away

  • R is a great tool but is only a tool. There is no magic. You need to understand what you are doing and get a minimal training in statistics.
  • It is one of the building block of reproducible research (the reproducible analysis block) and will save you a lot of time.
  • Read at least Jain's book: The Art of Computer Systems Performance Analysis. There are many other references on Jean-Marc's shelves or on mine.

Entered on [2012-09-12 mer. 13:00]