-
Notifications
You must be signed in to change notification settings - Fork 5
code
This document describes coding in terms of setting up and running experiments, tools to help with statistical analyses and plotting, and simplifying code deployments.
There are several purposes for running systems experiments. These can include:
- to show a metric is possible (e.g., it is possible for technique A to run >1M operations per second)
- to validate a hypothesis (e.g., using X instead of Y has an effect and the effect size matters)
- to understand the relationship of a technique in relation to prior work
- to understand the design trade-offs that affect a system's performance (however it is measured)
tldr; put all the raw data into something like a database
You want your experiments to be reproducible because they will fail the first (and second, third) times, so you will want an easy way to examine and debug them. The following are some basic instructions.
Put things in a database (say, SQLite3). You probably want ta schema similar to the following
experiment_name // what experiment are you running?
run_id // you probably want to run multiple times and compute mean/std/CI statistics
dataset(s) // if you are using different datasets
seed // if there is ANY source of randomness in your experiment, use a seed so you can reproduce it later
parameter(s) // what are you varying? Each should be an attribute. If unused, set the value to NULL
measure(s) // what you are measuring. precision, recall, latency etc.
// store the RAW measures that can be used to compute aggregate measures
timestamp(s) // in systems work, you should be collecting timestamps. Store them!
Some rules of thumb:
- Parameters go on the x axis
- Metrics are computed from the measures, and go on the y axis
- Never ever compute aggregate statistics in your code and log only those. Always log all of the data and compute statistics later!
- Always use a seed and record what it is
- If the above table gets too wide, it's ok to denormalize.
If you want to go whole hog, try ReproZip
Think that you're better than SQL? That you're a scripting expert that can quickly whip up analysis code? Some quotes from real grad students that have learned the hard way:
Student 1:
[7:19 PM]
the worst thing is everytime i write analysis code, its useless the next day
i think Sanjay said that’s what n00b data scientists do
writing throw-away code
that i cant even parse the day after....
My student ran experiments and created two files containing the results:
============ File 1 =========== ============ File 2 ===========
# tableau dataset, 50 queries, using prune middle # tableau dataset, 50 queries, not using middle prune
window_size edges_num mining_time window_size edges_num mining_time mapping_time
30 36384 15818ms 32149ms 30 53053 27660ms 57623ms
10 13797 5138ms 12889ms 10 19986 9136ms 24352ms
2 883 419ms 867ms 2 1251 544ms 1318ms
There are several issues that make it very hard to write plotting and analysis scripts that are maintainable:
- it's the same experiment, but saved in two separate files, that he now needs to manage manually
- some experiment parameters, such as the dataset, number of queries, and pruning strategy are encoded as comments
- the timing values are suffixed with "ms", so they can't directly be interpreted as numbers
- file 1 is missing the "mapping_time" attribute
- the file is an arbitrary (space separated values) format
Following the principles above, we first restructure the data file. It's more verbose, but you're generating the data automatically anyways.
exp,dataset,nqueries,prune,window_size,edges_num,mining_time,mapping_time
1,tableau,50,prune,30,36384,15818,32149
1,tableau,50,prune,10,13797,5138,12889
1,tableau,50,prune,2,883,419,867
1,tableau,50,noprune,30,53053,27660,57623
1,tableau,50,noprune,10,19986,9136,24352
1,tableau,50,noprune,2,1251,544,1318
Now we know exactly what experiment it is, all of the parameters, can run queries over it, and can now plot it.
Again, I suggest using a database like duckdb and writing analyses as queries. Here's a simple starter script based on analyzing some tpch benchmark results:
import duckdb
setupqs = [
# LOAD YOUR EXPERIMENT RESULTS
"create table data as select * from read_csv_auto('./tpch_benchmark_notes_feb18_sf1.csv');",
# Setup tables
"create table b as select * from data where lineage_type = 'Baseline';",
"create table s as select * from data where lineage_type = 'SD_Capture';",
]
db = duckdb.connect(":memory:")
for q in setupqs:
db.execute(q)
#
# Run analyses
#
q1 = """SELECT s.query, (s.runtime-b.runtime)/b.runtime as overhead, s.runtime as s, b.runtime as b
from s join b on s.query = b.query where s.query in (1,3,10, 12);"""
df = db.execute(q).df()
print(q1)
print(df())
# ploting scripts. example uses pygg
from pygg import *
p = ggplot(df, aes(x="query", y="overhead")
ggsave("plot.png", p, width=6, height=4, scale=0.9, device="cairo_pdf")Simple advice
- Did you compute aggregate values (e.g., mean latency etc)? Show the standard deviation, or more meaningfully, bootstrapped confidence intervals. See the appendix for UDF code to compute it in PostgreSQL
Flow chart for picking plots
- TBD
Tools
-
For the vast majority of plots,
ggplot2inRis the way to go. Find a tutorial and follow it. -
If you use python for everything like I do, try the pygg library. It gives you ggplot2 syntax in python. It can't handle multiple layers, which you probably shouldn't be doing anyways. I use it in combination with some helper fuctions defined in
wuutils. You can pip install them usingpip install pygg wuutils
Let's assume we are using the cleaned experiment data from the example above:
exp,dataset,nqueries,prune,window_size,edges_num,mining_time,mapping_time
1,tableau,50,prune,30,36384,15818,32149
1,tableau,50,prune,10,13797,5138,12889
1,tableau,50,prune,2,883,419,867
1,tableau,50,noprune,30,53053,27660,57623
1,tableau,50,noprune,10,19986,9136,24352
1,tableau,50,noprune,2,1251,544,1318
Take a look at the documentation of pygg and wuutils for installation instructions. Pygg assumes you have also installed R and ggplot. After that, you can run:
import pandas
import duckdb
from pygg import *
from wuutils import *
# read csv and return a list of dictionaries. One dict for each row
data = pandas.read_csv("data.csv").to_dict("records")
## Alternatively, if you stored data in a duckdb file:
# con = duckdb.connect("res.db")
# data = con.execute("SQL-BASED ANALYSIS QUERY").df()
## turn pandas data frame into list of dictionaries
# data = df().to_dict("records")
# from wuutils, fold the y metrics:
# The metric names will become values in the attribute "key",
# and their values are in attribute val
data = fold(data, ["edges_num", "mining_time", "mapping_time"])
# plot it
p = ggplot(data, aes(x="window_size", y="val", color="prune"))
p += geom_line() + geom_point()
p += facet_wrap("~key", scales=esc("free_y"))
ggsave("pruning.png", p, width=8, height=3)
fold() is a function in the wuutils set of utility functions that I use for myself, that does the following. This is why we mapped y to val, and we are able to facet on the metrics (key):
# let the data be: # you run: # it outputs:
i,a,b fold(data, ['a', b']) i,key,val
1,2,3 1,a,2
2,4,5 1,b,3
2,a,4
2,b,5
The final result is below, which you can then tweak to look nice:

There are simple steps to better understand what's going on. These are also the steps for presenting an experiment. Do the following
- List your hypothesis
- What ideal plots will help validate or debunk your hypothesis?
- Draw out what the plots should look like if your hypothesis is correct and if incorrect
- include x and y axes, shape of curves
- What will you do to generate these plots?
- Why do the plots look this way?
- Which parts confirm your hypothesis?
- What's surprising/does not confirm your hypothesis? Why?
- What are the next steps to
- answer 4ii
- use what you learned in your system
- Go to step 1
You'll likely setup many application/server deployments as part of research. Automate your deployment
SSH
- google for "passwordless ssh"
- http://www.linuxproblem.org/art_9.html
Fabric is great for orchestrating and collecting experiments in parallell across many machines.
-
create a
fabfile.pyin your project directoryfrom fabric.api import run, env, local env.hosts = ['clic.cs.columbia.edu'] def taskA(): run('ls') # runs on your env.hosts machines def taskB(): local('ls') # runs on your local machine -
Type the following in the same directory as
fabfile.pyto list the commandsfab -l -
documentation
Releasing working code is a very good idea. If you are writing a python application, make it pip installable. The following repo provides a skeleton and simple example code for structuring a python package so it can be uploaded to pypi.
PostgreSQL code for defining 95% confidence intervals
DROP LANGUAGE plpythonu;
CREATE LANGUAGE plpythonu;
DROP FUNCTION IF EXISTS ci_final(numeric[]) cascade;
CREATE FUNCTION ci_final(vs numeric[])
RETURNS numeric[]
as $$
vs = args[0]
sortedvs = sorted(vs)
from scikits import bootstrap
return bootstrap.ci(sortedvs, alpha=0.05)
$$ language plpythonu;
DROP AGGREGATE IF EXISTS ci(numeric);
CREATE AGGREGATE ci (numeric) (
SFUNC = array_append,
STYPE = numeric[],
initcond = '{}',
FINALFUNC = ci_final
);
-- an example query
SELECT ci(measure)[0] AS lower_bound,
ci(measure)[1] AS upper_bound
FROM dataset