code

This document describes coding in terms of setting up and running experiments, tools to help with statistical analyses and plotting, and simplifying code deployments.

Designing Experiments

There are several purposes for running systems experiments. These can include:

to show a metric is possible (e.g., it is possible for technique A to run >1M operations per second)
to validate a hypothesis (e.g., using X instead of Y has an effect and the effect size matters)
to understand the relationship of a technique in relation to prior work
to understand the design trade-offs that affect a system's performance (however it is measured)

Running Experiments

 tldr; put all the raw data into something like a database

You want your experiments to be reproducible because they will fail the first (and second, third) times, so you will want an easy way to examine and debug them. The following are some basic instructions.

Put things in a database (say, SQLite3). You probably want ta schema similar to the following

    experiment_name   // what experiment are you running?  
    run_id            // you probably want to run multiple times and compute mean/std/CI statistics
    dataset(s)        // if you are using different datasets 
    seed              // if there is ANY source of randomness in your experiment, use a seed so you can reproduce it later
    parameter(s)      // what are you varying?  Each should be an attribute.  If unused, set the value to NULL
    measure(s)        // what you are measuring.  precision, recall, latency etc.
                      // store the RAW measures that can be used to compute aggregate measures
    timestamp(s)      // in systems work, you should be collecting timestamps.  Store them!

Some rules of thumb:

Parameters go on the x axis
Metrics are computed from the measures, and go on the y axis
- Never ever compute aggregate statistics in your code and log only those. Always log all of the data and compute statistics later!
Always use a seed and record what it is
If the above table gets too wide, it's ok to denormalize.

If you want to go whole hog, try ReproZip

Why Me?

Think that you're better than SQL? That you're a scripting expert that can quickly whip up analysis code? Some quotes from real grad students that have learned the hard way:

Student 1:

[7:19 PM]  
the worst thing is everytime i write analysis code, its useless the next day
i think Sanjay said that’s what n00b data scientists do
writing throw-away code
that i cant even parse the day after....

A Concrete Example

My student ran experiments and created two files containing the results:

    ============ File 1 ===========                        ============ File 2 ===========

# tableau dataset, 50 queries, using prune middle    # tableau dataset, 50 queries, not using middle prune
window_size edges_num mining_time                    window_size edges_num mining_time mapping_time
30 36384 15818ms 32149ms                             30 53053 27660ms 57623ms
10 13797 5138ms 12889ms                              10 19986 9136ms  24352ms
2 883 419ms 867ms                                    2 1251 544ms     1318ms

There are several issues that make it very hard to write plotting and analysis scripts that are maintainable:

it's the same experiment, but saved in two separate files, that he now needs to manage manually
some experiment parameters, such as the dataset, number of queries, and pruning strategy are encoded as comments
the timing values are suffixed with "ms", so they can't directly be interpreted as numbers
file 1 is missing the "mapping_time" attribute
the file is an arbitrary (space separated values) format

Following the principles above, we first restructure the data file. It's more verbose, but you're generating the data automatically anyways.

exp,dataset,nqueries,prune,window_size,edges_num,mining_time,mapping_time
1,tableau,50,prune,30,36384,15818,32149
1,tableau,50,prune,10,13797,5138,12889
1,tableau,50,prune,2,883,419,867
1,tableau,50,noprune,30,53053,27660,57623
1,tableau,50,noprune,10,19986,9136,24352
1,tableau,50,noprune,2,1251,544,1318

Now we know exactly what experiment it is, all of the parameters, can run queries over it, and can now plot it.

Analyzing Experiments

Again, I suggest using a database like duckdb and writing analyses as queries. Here's a simple starter script based on analyzing some tpch benchmark results:

import duckdb

setupqs = [
  # LOAD YOUR EXPERIMENT RESULTS
  "create table data as select * from read_csv_auto('./tpch_benchmark_notes_feb18_sf1.csv');",

  # Setup tables
  "create table b as select * from data where lineage_type = 'Baseline';",
  "create table s as select * from data where lineage_type = 'SD_Capture';",
]

db = duckdb.connect(":memory:")
for q in setupqs:
  db.execute(q)


#
# Run analyses
#
q1 = """SELECT s.query, (s.runtime-b.runtime)/b.runtime as overhead, s.runtime as s, b.runtime as b 
from s join b on s.query = b.query where s.query in (1,3,10, 12);"""
df = db.execute(q).df()
print(q1)
print(df())

# ploting scripts.  example uses pygg
from pygg import *
p = ggplot(df, aes(x="query", y="overhead")
ggsave("plot.png", p, width=6, height=4, scale=0.9, device="cairo_pdf")

Plotting experiments

Simple advice

Did you compute aggregate values (e.g., mean latency etc)? Show the standard deviation, or more meaningfully, bootstrapped confidence intervals. See the appendix for UDF code to compute it in PostgreSQL

Flow chart for picking plots

TBD

Tools

For the vast majority of plots, ggplot2 in R is the way to go. Find a tutorial and follow it.
If you use python for everything like I do, try the pygg library. It gives you ggplot2 syntax in python. It can't handle multiple layers, which you probably shouldn't be doing anyways. I use it in combination with some helper fuctions defined in wuutils. You can pip install them using

pip install pygg wuutils

An Example

Let's assume we are using the cleaned experiment data from the example above:

exp,dataset,nqueries,prune,window_size,edges_num,mining_time,mapping_time
1,tableau,50,prune,30,36384,15818,32149
1,tableau,50,prune,10,13797,5138,12889
1,tableau,50,prune,2,883,419,867
1,tableau,50,noprune,30,53053,27660,57623
1,tableau,50,noprune,10,19986,9136,24352
1,tableau,50,noprune,2,1251,544,1318

Take a look at the documentation of pygg and wuutils for installation instructions. Pygg assumes you have also installed R and ggplot. After that, you can run:

import pandas
import duckdb
from pygg import *
from wuutils import *

# read csv and return a list of dictionaries.  One dict for each row
data = pandas.read_csv("data.csv").to_dict("records")

## Alternatively, if you stored data in a duckdb file:
# con = duckdb.connect("res.db")
# data = con.execute("SQL-BASED ANALYSIS QUERY").df()
## turn pandas data frame into list of dictionaries
# data = df().to_dict("records")

# from wuutils, fold the y metrics:
# The metric names will become values in the attribute "key", 
# and their values are in attribute val
data = fold(data, ["edges_num", "mining_time", "mapping_time"])

# plot it
p = ggplot(data, aes(x="window_size", y="val", color="prune"))
p += geom_line() + geom_point()
p += facet_wrap("~key", scales=esc("free_y"))
ggsave("pruning.png", p, width=8, height=3)

fold() is a function in the wuutils set of utility functions that I use for myself, that does the following. This is why we mapped y to val, and we are able to facet on the metrics (key):

# let the data be:    # you run:                # it outputs:
i,a,b                 fold(data, ['a', b'])     i,key,val
1,2,3                                           1,a,2
2,4,5                                           1,b,3
                                                2,a,4
                                                2,b,5

The final result is below, which you can then tweak to look nice:

Presenting Experiments

There are simple steps to better understand what's going on. These are also the steps for presenting an experiment. Do the following

List your hypothesis
What ideal plots will help validate or debunk your hypothesis?
1. Draw out what the plots should look like if your hypothesis is correct and if incorrect
2. include x and y axes, shape of curves
What will you do to generate these plots?
Why do the plots look this way?
1. Which parts confirm your hypothesis?
2. What's surprising/does not confirm your hypothesis? Why?
What are the next steps to
1. answer 4ii
2. use what you learned in your system
Go to step 1

Deployments

You'll likely setup many application/server deployments as part of research. Automate your deployment

SSH

google for "passwordless ssh"
http://www.linuxproblem.org/art_9.html

Fabric is great for orchestrating and collecting experiments in parallell across many machines.

create a fabfile.py in your project directory

  from fabric.api import run, env, local

  env.hosts = ['clic.cs.columbia.edu']

  def taskA():
      run('ls')    # runs on your env.hosts machines

  def taskB():
      local('ls')  # runs on your local machine

Type the following in the same directory as fabfile.py to list the commands
```
  fab -l
```
documentation

Releasing Code

Releasing working code is a very good idea. If you are writing a python application, make it pip installable. The following repo provides a skeleton and simple example code for structuring a python package so it can be uploaded to pypi.

https://github.com/cudbg/pipexample

Appendix

Useful Code

PostgreSQL code for defining 95% confidence intervals

    DROP LANGUAGE plpythonu;
    CREATE LANGUAGE plpythonu;

    DROP FUNCTION IF EXISTS ci_final(numeric[]) cascade;
    CREATE  FUNCTION ci_final(vs numeric[])
    RETURNS numeric[]
    as $$
      vs = args[0]
      sortedvs = sorted(vs)
      from scikits import bootstrap
      return bootstrap.ci(sortedvs, alpha=0.05)
    $$ language plpythonu;

    DROP AGGREGATE IF EXISTS ci(numeric);
    CREATE AGGREGATE ci (numeric) (
      SFUNC = array_append,
      STYPE = numeric[],
      initcond = '{}',
      FINALFUNC = ci_final
    );


    -- an example query

    SELECT ci(measure)[0] AS lower_bound,
          ci(measure)[1] AS upper_bound
    FROM dataset

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code

Designing Experiments

Running Experiments

Why Me?

A Concrete Example

Analyzing Experiments

Plotting experiments

An Example

Presenting Experiments

Deployments

Releasing Code

Appendix

Useful Code

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally