ocaml-multicore · christinerose · Aug 7, 2022 · Aug 8, 2022 · Sep 6, 2022 · Sep 6, 2022
diff --git a/README.md b/README.md
@@ -5,7 +5,7 @@ was run on the 19th of May 2022 at the [Tarides retreat](https://tarides.com/blo
 
 ## Installation
 
-This tutorial works on x86-64 and Arm64 architectures on Linux and macOS. 
+This tutorial works on x86-64 and Arm64 architectures on Linux and macOS. If you're new to OCaml, it will be necessary to first [install opam](https://opam.ocaml.org/doc/Install.html), OCaml's package manager. 
 
 Before we move on to the instructions, check your version of opam with `opam --version`, then follow the instructions below for your version. You can also quickly update to the latest version of opam (currently 2.1.2) by running:
 
@@ -32,7 +32,19 @@ eval $(opam env)
 ```
 
 Since we will be doing performance measurements, it is recommended that you also
-install [`hyperfine`](https://github.com/sharkdp/hyperfine). 
+install [`hyperfine`](https://github.com/sharkdp/hyperfine). Scroll down toward the bottom of its README to find the 
+installation instructions for various systems.
+
+----
+
+**Please note**: Many of the following exercises require access to files in this repo, so please clone it now 
+using `git clone https://github.com/kayceesrk/ocaml5-tutorial.git`, then `cd ocaml5-tutorial`.
+
+Throughout, we'll be using Dune, OCaml's build system, to convert our programs into executables. For more information on 
+Dune, please reference [Dune's documentation](https://dune.readthedocs.io/en/stable/).
+
+----
+
 
 ## Domains for Parallelism
 
@@ -63,9 +75,9 @@ I ran in parallel
 
 Use `Ctrl+D` to exit.
 
-(If you get the error "Cannot find file topfind," run `opam install ocamlfind`, part of the `findlib` package.)
+(If you get the error "Cannot find file topfind," run `opam install ocamlfind`, part of the `findlib` package, then run `$ eval $(opam env) again.)
 
-The same example is also in [src/par.ml](src/par.ml):
+The same example is also in [src/par.ml](src/par.ml): 
 
 ```bash
 $ cat src/par.ml
@@ -79,7 +91,8 @@ $ dune exec src/par.exe
 I ran in parallel
 ```
 
-In this section of the tutorial, we will be running parallel programs. The
+
+In this next section of the tutorial, we will be running parallel programs. The
 results observed will be dependent on the number of cores that you have on your
 machine. I am writing this tutorial on an 2.3 GHz Quad-Core Intel Core i7
 MacBook Pro with 4 cores and 8 hardware threads. It is reasonable to expect a
@@ -88,8 +101,9 @@ Hyper-Threading gods are kind to us).
 
 ### Fibonacci Number
 
-We shall use the program to compute the nth Fibonacci number as the running
-example. The program is in [src/fib.ml](src/fib.ml).
+The following program has already been created in [src/fib.ml](src/fib.ml), but we've displayed it here 
+for your convenience. We shall use this program to compute the nth Fibonacci number as the running
+example:
 
 ```ocaml
 let n = try int_of_string Sys.argv.(1) with _ -> 40
@@ -103,7 +117,9 @@ let main () =
 let _ = main ()
 ```
 
-The program is a vanilla implementation of the Fibonacci function.
+The program is a vanilla implementation of the Fibonacci function. 
+First, we'll use Dune to turn it into an executable, then choose a number for n (40 or less), 
+and finally run the program with Hyperfine.
 
 ```bash
 $ dune build src/fib.exe
@@ -116,7 +132,7 @@ Benchmark 1: dune exec src/fib.exe 40
 On my machine, it takes 500ms to compute the 40th Fibonacci number.
 
 Spawned domains can be joined to get their results. The program
-[src/fib_twice.ml](src/fib_twice.ml) computes the nth Fibonacci number twice in
+[src/fib_twice.ml](src/fib_twice.ml), shown below, computes the nth Fibonacci number twice in
 parallel.
 
 ```ocaml
@@ -149,7 +165,7 @@ Benchmark 1: dune exec src/fib_twice.exe 40
 ```
 
 You can see that computing the nth Fibonacci number twice almost took the same
-time as computing it once thanks to parallelism.
+time as computing it once, thanks to parallelism.
 
 ### Nature of Domains
 
@@ -160,25 +176,25 @@ particular, each domain has its own minor heap area and major heap pools. Due to
 the overhead of domains, **the recommendation is that you spawn exactly one
 domain per available core.**
 
-OCaml 5 GC is designed to be a low-latency garbage collector with short
+The OCaml 5 garbage collector (GC) is designed to be a low-latency GC with short
 stop-the-world pauses. Whenever a domain exhausts its minor heap arena, it calls
-for a stop-the-world, parallel minor GC, where all the domains collect their
-minor heaps. The domains also perform concurrent (not stop-the-world) collection
+for a stop-the-world, parallel minor GC. Here, all the domains collect their
+minor heaps. They also perform concurrent (not stop-the-world) collection
 of the major heap. The major collection cycle involves a number of very short
 stop-the-world pauses.
 
-Overall, the behaviour of OCaml 5 GC should match that of the OCaml 4 GC for
-sequential programs, and remains scalable and low-latency for parallel programs.
+Overall, the behaviour of the OCaml 5 GC should match that of the OCaml 4 GC for
+sequential programs, and it remains scalable and low-latency for parallel programs.
 For more information, please have a look at the [ICFP 2020 paper and talk on
 "Retrofitting Parallelism onto
 OCaml"](https://icfp20.sigplan.org/details/icfp-2020-papers/21/Retrofitting-Parallelism-onto-OCaml).
 
 ### Exercise ★★☆☆☆
 
 Compute the nth Fibonacci number in parallel by parallelising recursive calls.
-For this exercise, only spawn new domains for the top two recursive calls. You
+For this exercise, only spawn new domains for the top two recursive calls. Your
 program will only spawn two additional domains. The skeleton is in the file
-[src/fib_par.ml](src/fib_par.ml):
+[src/fib_par.ml](src/fib_par.ml), as shown below, to get you started:
 
 ```ocaml
 let n = try int_of_string Sys.argv.(1) with _ -> 40
@@ -199,7 +215,7 @@ let _ = main ()
 ```
 
 When you finish the exercise, you will notice that with 2 cores, the speed up is
-nowhere close to 2x. 
+nowhere close to 2x. Compare the output of each file:
 
 ```bash
 % hyperfine 'dune exec src/fib.exe 42'
@@ -224,33 +240,33 @@ fib(n) = (fib(n-2) + fib(n-3)) + fib(n-2)
 The left recursive call does more work than the right branch. We shall get to 2x
 speedup eventually. First, we need to take a detour.
 
-## Inter-domain communication
+## Inter-Domain Communication
 
 `Domain.join` is a way to synchronize with the domain. OCaml 5 also provides
 other features for inter-domain communication.
 
-### DRF-SC guarantee
+### DRF-SC Guarantee
 
 OCaml has mutable reference cells and arrays. Can we share ref cells and arrays
 between multiple domains and access them in parallel? The answer is yes. But the
-value that may be returned by a read may not be the latest one written to that
-memory location due to the influence of compiler and hardware optimizations. The
+value that a read returns may not be the latest one written to that
+memory location, due to the influence of compiler and hardware optimizations. The
 description of the exact value returned by such racy accesses is beyond the
-scope of the tutorial. For more information on this, you should refer to the
+scope of the tutorial. For more information on this, refer to the
 [PLDI 2018 paper on "Bounding Data Races in Space and
 Time"](https://kcsrk.info/papers/pldi18-memory.pdf).
 
 OCaml reference cells and arrays are known as **non-atomic** data structures.
-Whenever two domains race to access a non-atomic memory location, and one of the
-access is a write, then we say that there is a **data race**. When your program
+Whenever two domains race to access a non-atomic memory location, and one is a write access, 
+then we say that there is a **data race**. When your program
 does not have a data race, then the behaviours observed are **sequentially
-consistent** -- the observed behaviour can simply be understood as the
+consistent**. The observed behaviour can simply be understood as the
 interleaved execution of different domains. This guarantee is known as
 data-race-freedom sequential-consistency (DRF-SC).
 
 An important aspect of the OCaml 5 memory model is that, even if you program has
 data races, your program will not crash (memory safety). The recommendation for
-the OCaml user is that **avoid data races for ease of reasoning**.
+the OCaml user is to **avoid data races for ease of reasoning**.
 
 ### Atomics
 
@@ -269,7 +285,7 @@ Non-atomic ref count: 1101799
 Atomic ref count: 2000000
 ```
 
-Atomic module is used for low-level inter-domain communication. They are used
+The Atomic module is used for low-level, inter-domain communication. They are used
 for implementing lock-free data structures. For example, the program
 [src/msg_passing.ml](src/msg_passing.ml) shows an implementation of message
 passing between domains. The program uses `get` and `set` on the atomic
@@ -281,9 +297,9 @@ since `r` is an atomic variable, it is not a data race.
 Hello
 ```
 
-### Compare-and-set
+### Compare-and-Set
 
-Atomic module also has `compare_and_set` primitive. `compare_and_set r old new`
+The Atomic module also has `compare_and_set` primitive. `compare_and_set r old new`
 atomically compares the current value of the atomic reference `r` with the `old`
 value and replaces that with the `new` value. The program
 [src/incr_cas.ml](src/incr_cas.ml) shows how to implement atomic increment
@@ -312,15 +328,15 @@ is [src/prod_cons_nb.ml](src/prod_cons_nb.ml). Remember that
 physically match the current value of the atomic reference for the comparison to
 succeed.
 
-### Blocking synchronization
+### Blocking Synchronisation
 
 The only primitive that we have seen so far that blocks a domain is
-`Domain.join`. OCaml 5 also provides blocking synchronization through
+`Domain.join`. OCaml 5 also provides blocking synchronisation through
 [`Mutex`](https://github.com/ocaml/ocaml/blob/trunk/stdlib/mutex.mli),
-[`Condition`](https://github.com/ocaml/ocaml/blob/trunk/stdlib/condition.mli)
+[`Condition`],(https://github.com/ocaml/ocaml/blob/trunk/stdlib/condition.mli)
 and
 [`Semaphore`](https://github.com/ocaml/ocaml/blob/trunk/stdlib/semaphore.mli)
-modules. These are the same modules that are present in OCaml 4 to synchronize
+modules. These are the same modules present in OCaml 4 to synchronise
 between `Threads`. These modules have been lifted up to the level of domains.
 
 #### Exercise ★★★☆☆
@@ -349,9 +365,9 @@ to learn about effect handlers, please do check out the [effect handlers
 tutorial in the OCaml 5 manual](https://kcsrk.info/webman/manual/effects.html).
 
 [Domainslib](https://github.com/ocaml-multicore/domainslib) is a library that
-provides support for nested-parallel programming, which is epitomized by
+provides support for nested-parallel programming, which is epitomised by
 the parallelism available in the recursive Fibonacci computation. At its core,
-`domainslib` has an efficient implementation of work-stealing queue in order to
+`domainslib` has an efficient implementation of a work-stealing queue in order to
 efficiently share tasks with other domains. 
 
 Let's first install `domainslib`:
@@ -360,7 +376,7 @@ Let's first install `domainslib`:
 % opam install domainslib
 ```
 
-### Async/await
+### Async/Await
 
 At its core, `domainslib` provides an
 [async/await](https://github.com/ocaml-multicore/domainslib/blob/b8de1f718804f64b158dd3bffda1b1c15ea90f29/lib/task.mli#L38-L49)
@@ -370,8 +386,8 @@ iterators](https://github.com/ocaml-multicore/domainslib/blob/b8de1f718804f64b15
 
 ### Parallel Fibonacci 
 
-Let us now parallelise Fibonacci using domainslib. The program is in the file
-[src/fib_domainslib.ml](src/fib_domainslib.ml):
+Let us now parallelise Fibonacci using `domainslib`. The program is in the file
+[src/fib_domainslib.ml](src/fib_domainslib.ml), but shown below for your convenience:
 
 ```ocaml
 module T = Domainslib.Task
@@ -401,7 +417,7 @@ The program takes the number of domains to use as the first argument and the
 input as the second argument. 
 
 Let's start with the main function. The first
-thing to do in order to use domainslib is to set up a pool of domains on which
+thing to do in order to use `domainslib` is to set up a pool of domains on which
 the nested parallel tasks will run. The domain invoking the `run` function will
 also participate in executing the tasks submitted to the pool. We invoke the
 parallel Fibonacci function `fib_par` in the `run` function. Finally, we
@@ -410,14 +426,14 @@ teardown the pool and print the result.
 For sufficiently large inputs (`n > 20`), the `fib_par` function spawns the left
 and the right recursive calls asynchronously in the pool using `async` function.
 `async` function returns a promise for the result. The result of an `async` is
-obtained by `await`ing on the promise, which may block if the promise is not
+obtained by `await`ing the promise, which may block if the promise is not
 resolved. 
 
 For small inputs, the function simply calls the sequential Fibonacci function.
 It is important to switch to sequential mode for small problem sizes. If not,
 the cost of parallelisation will outweigh the work available.
 
-Let's see how this program scales compared to our earlier implementations.
+Let's see how this program scales compared to our earlier implementations. Run the following:
 
 ```bash
 % hyperfine 'dune exec src/fib.exe 42'
@@ -436,7 +452,7 @@ Benchmark 1: dune exec src/fib_domainslib.exe 2 42
   Range (min … max):   662.0 ms … 692.1 ms    10 runs
 ```
 
-The domainslib version scales extremely well. This holds true even as the core
+The `domainslib` version scales extremely well. This holds true even as the core
 count increases. On a machine with 24 cores, for `fib(48)`,
 
 | Cores	| Time (Seconds)	| Vs Serial	| Vs Self |
@@ -459,9 +475,9 @@ let rec tak x y z =
   else z
 ```
 
-The skeleton file is in [src/tak_par.ml](src/tak_par.ml). Calculating the time
+The skeleton file shown above is in [src/tak_par.ml](src/tak_par.ml). Calculating the time
 complexity of `tak` function turns out to be tricky. Use `x < 20 && y < 20` as
-the sequential cutoff -- if the condition holds, call the sequential version of
+the sequential cutoff. If the condition holds, call the sequential version of
 `tak`.
 
 ```bash
@@ -480,17 +496,17 @@ Benchmark 3: dune exec solutions/tak_par.exe 4 36 24 12
 ```
 
 Observe that there is super-linear speedup going from the sequential version to
-the 2 core version! Why?
+the two-core version! Why?
 
 
 #### Exercise ★★★★★
 
 Implement a parallel version of merge sort. It easy to implement a version that
-doesn't scale :-) If you use a list for holding the intermediate results, GC
+doesn't scale :-). If you use a list for holding the intermediate results, the GC
 impact will kill scalability. 
 
-You should use an array for holding the elements to be sorted. The observation
-is that during the merge step, the length of the merged result is exactly the
+You should use an [array](https://v2.ocaml.org/api/Array.html) for holding the elements to be sorted. We observed 
+that during the merge step, the length of the merged result is exactly the
 sum of the input arrays. Hence, one may use an additional array of the same size
 as the input array to hold the merge results.
 
@@ -502,7 +518,7 @@ straight-forward way to parallelize such code. Lets take the
 benchmark from the computer language benchmarks game. The sequential version of
 the benchmark is available at [src/spectralnorm.ml](src/spectralnorm.ml).
 
-We can see that the program has several for loops. How do we which part of the
+We can see that the program has several for loops. How do we know which part of the
 program is amenable to parallelism? We can profile the program using `perf` to
 answer this. `perf` only works on Linux.
 
@@ -573,20 +589,20 @@ Benchmark 2: dune exec src/spectralnorm_par.exe 4 4096
 
 Implement parallel version of [Game of
 Life](https://en.wikipedia.org/wiki/Conway%27s_Game_of_Life) simulation. The
-sequential version is in [src/game_of_life.ml](src/game_of_life.ml). The
-sequential version takes the number of iterations and the board size as the
+sequential version is in [src/game_of_life.ml](src/game_of_life.ml). It 
+takes the number of iterations and the board size as the
 first and second arguments.
 
 You should modify [src/game_of_life_par.ml](src/game_of_life_par.ml) with the
 parallel version. Currently, this file is the same as the sequential version
 except that it takes the number of domains as the first argument, the number
-iterations as the second argument and the board size as the third argument.
+iterations as the second argument, and the board size as the third argument.
 
 
-#### Parallelising mandelbrot
+#### Parallelising Mandelbrot
 
 Let's parallelise something more tricky -- the [sequential version of
-mandelbrot](https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/mandelbrot-ocaml-6.html)
+Mandelbrot](https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/mandelbrot-ocaml-6.html)
 from the computer language benchmarks game. The sequential version is available
 in [src/mandelbrot.ml](src/mandelbrot.ml).