Advanced-R-Solutions/2-09-Functionals.Rmd at master · anouel/Advanced-R-Solutions · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
```{r, include=FALSE}
source("common.R")
```

# (PART) Functional programming {-}

# Functionals

## Prerequisites

```{r, message=FALSE}
library(purrr)
```

## My first functional: `map()`

1. __<span style="color:red">Q</span>__: Use `as_mapper()` to explore how purrr generates anonymous functions for the integer, character, and list helpers. What helper allows you to extract attributes? Read the documentation to find out.

   __<span style="color:green">A</span>__: `map()` offers multiple ways (functions, formulas and extractor functions) to specify the function argument (`.f`). Initially, the various inputs have to be transformed into a valid function, which is then applied. The creation of this valid function is the job of `as_mapper()` and it is called every time `map()` is used.

   Given character, numeric or list input `as_mapper()` will create an extractor function. Characters select by name, while numeric input selects by positions and a list allows a mix of these two approaches. This extractor interface can be very useful, when working with nested data.

   The extractor function is implemented as a call to `purrr::pluck()`, which accepts a list of accessors (accessors "access" some part of your data object).

    ```{r}
    as_mapper(c(1, 2))
    as_mapper(c("a", "b"))
    as_mapper(list(1, "b"))
    ```

   Besides mixing positions and names, it is also possible to pass along an accessor function. This is basically an anonymous function, that gets information about some aspect of the input data. You are free to define your own accessor functions.

   If you need to access certain attributes, the helper `attr_getter(y)` is already predefined and will create the appropriate accessor function for you.

    ```{r}
    # define custom accessor function
    get_class <- function(x) attr(x, "class")
    pluck(mtcars, get_class)

    # use attr_getter() as a helper
    pluck(mtcars, attr_getter("class"))
    ```


2. __<span style="color:red">Q</span>__: `map(1:3, ~ runif(2))` is a useful pattern for generating random numbers, but `map(1:3, runif(2))` is not. Why not? Can you explain why it returns the result that it does?

   __<span style="color:green">A</span>__: The first pattern creates random numbers, because `~ runif(2)` successfully uses the formula interface. Internally `map()` applies `as_mapper()` to this formula, which converts `~ runif(2)` into an anonymous function. Afterwards `runif(2)` is applied three times (one time during each iteration), leading to three different pairs of random numbers.

   In the second pattern `runif(2)` is supplied as an atomic vector. Consequently `as_mapper()` creates an extractor function based on the return values from `runif(2)` (via `pluck()`). This leads to three `NULL`s (`pluck()`'s `.default` return), because no values corresponding to the index can be found.

    ```{r}
    map(1:3, ~ runif(2))  # uses formular interface
    map(1:3, runif(2))  # uses extractor interface
    ```

3. __<span style="color:red">Q</span>__: Use the appropriate `map()` function to:

   a) Compute the standard deviation of every column in a numeric data frame.

   a) Compute the standard deviation of every numeric column in a mixed data frame. (Hint: you'll need to do it in two steps.)

   a) Compute the number of levels for every factor in a data frame.

   __<span style="color:green">A</span>__: To solve this exercise we take advantage of calling the type stable variants of `map()`, which give us more concise output. We also use `purrr::keep()` to initially select the matching columns of the data frames. (`keep()` is introduced in the [predicate functionals](https://adv-r.hadley.nz/functionals.html#predicate-functionals) section of the "Functionals"-chapter).

    ```{r}
    map_dbl(mtcars, sd)
    map_dbl(keep(iris, is.numeric), sd)
    map_int(keep(iris, is.factor), ~ length(levels(.x)))
    ```

4. __<span style="color:red">Q</span>__: The following code simulates the performance of a t-test for non-normal data. Extract the p-value from each test, then visualise.

    ```{r}
    trials <- map(1:100, ~ t.test(rpois(10, 10), rpois(7, 10)))
    ```

   __<span style="color:green">A</span>__: `pluck()` allows us to elegantly extract the p-values.  We can then pass a data frame directly to `ggplot()` for the visualisation. For randomly generated data, we can expect a uniform distribution of the p-values.

    ```{r, message = FALSE}
    library(ggplot2)

    map_dbl(trials, "p.value") %>%
      tibble::tibble(`p_value` = .) %>%
      ggplot(aes(x = p_value, fill = p_value < 0.05)) +
      geom_dotplot(binwidth = .025) +
      ggtitle("Distribution of p-values for random poisson data.")
    ```

<!-- I would like to use `map_dfr` in the example above, but the required names for dplyr::bind_cols are missing...-->

5. __<span style="color:red">Q</span>__: The following code uses a map nested inside another map to apply a function to every element of a nested list. Why does it fail, and what do you need to do to make it work?

    ```{r, error = TRUE}
    x <- list(
      list(1, c(3, 9)),
      list(c(3, 6), 7, c(4, 7, 6))
    )

    triple <- function(x) x * 3
    map(x, map, .f = triple)
    ```

   __<span style="color:green">A</span>__: This function call fails, because `triple()` is specified as the `.f` argument and consequently belongs to the outer `map()`. The unnamed argument `map` is treated as an argument of `triple()`, which causes the error.

   If we switch the naming of the argument, the nested transformations work as expected:

    ```{r}
    map(x, .f = map, triple)
    ```

6. __<span style="color:red">Q</span>__: Use `map()` to fit linear models to the `mtcars` using the formulas stored in this list:

    ```{r}
    formulas <- list(
      mpg ~ disp,
      mpg ~ I(1 / disp),
      mpg ~ disp + wt,
      mpg ~ I(1 / disp) + wt
    )
    ```

   __<span style="color:green">A</span>__: The data (`mtars`) is constant for all these models and we iterate over the `formulas` provided. Because the formula is the first argument of a `lm()`-call, it doesn't need to be specified explicitly.

    ```{r}
    map(formulas, lm, data = mtcars) %>%
      map(coef)  # shortens output
    ```

7. __<span style="color:red">Q</span>__: Fit the model `mpg ~ disp` to each of the bootstrap replicates of `mtcars` in the list below, then extract the $R^2$ of the model fit (Hint: you can compute the $R^2$ with `summary()`)

    ```{r}
    bootstrap <- function(df) {
      df[sample(nrow(df), replace = TRUE), , drop = FALSE]
    }

    bootstraps <- map(1:10, ~ bootstrap(mtcars))
    ```

   __<span style="color:green">A</span>__: To accomplish this task, we take advantage of the "list in, list out"-functionality of `map()`. This allows us to chain multiple transformation together. We start by fitting the models. We then calculate the summaries and extract the $R^2$ values. For the last call we use `map_dbl`, which provides convenient output.

    ```{r}
    bootstraps %>%
      map(~ lm(mpg ~ disp, data = .x)) %>%
      map(summary) %>%
      map_dbl("r.squared")
    ```

## Map variants

1. __<span style="color:red">Q</span>__: Explain the results of `modify(mtcars, 1)`.

   __<span style="color:green">A</span>__: `modify()` is based on `map()`, and in this case, the extractor interface will be used. It extracts the first element of each column in `mtcars`. `modify()` always returns the same structure as its input:  in this case it forces the first row to be recycled 32 times. (Internally `modify()` uses `.x[] <- map(.x, .f, ...)` for assignment.)

2. __<span style="color:red">Q</span>__: Rewrite the following code to use `iwalk()` instead of `walk2()`. What are the advantages and disadvantages?

    ```{r, eval = FALSE}
    cyls <- split(mtcars, mtcars$cyl)
    paths <- file.path(temp, paste0("cyl-", names(cyls), ".csv"))
    walk2(cyls, paths, write.csv)
    ```

   __<span style="color:green">A</span>__: With `iwalk()` it is possible combine the name creation and file saving with one function call. It is not necessary to create intermediate objects, which won't be used any further. Unfortunately, the function turns out quite long and it may be difficult to understand it immediately.

    ```{r, eval = FALSE}
    temp_dir <- tempfile()
    dir.create(temp_dir)

    split(mtcars, mtcars$cyl) %>%
      iwalk(~ write.csv(.x, file.path(temp_dir, paste0("cyl-", .y, ".csv"))))

    list.files(temp_dir)
    #> [1] "cyl-4.csv" "cyl-6.csv" "cyl-8.csv"
    ```

3. __<span style="color:red">Q</span>__: Explain how the following code transforms a data frame using functions stored in a list.

    ```{r}
    trans <- list(
      disp = function(x) x * 0.0163871,
      am = function(x) factor(x, labels = c("auto", "manual"))
    )

    vars <- names(trans)
    mtcars[vars] <- map2(trans, mtcars[vars], function(f, var) f(var))
    ```

   Compare and contrast the `map2()` approach to this `map()` approach:

    ```{r, eval = FALSE}
    mtcars[vars] <- map(vars, ~ trans[[.x]](mtcars[[.x]]))
    ```

   __<span style="color:green">A</span>__: In the first approach the list of functions and the appropriately selected data frame columns are supplied to `map2()`. `map2()` creates an anonymous function `f(var)` which applies the functions to the variables when `map2()` iterates over their (similar) index. On the left hand side the regarding elements of `mtcars` are being replaced by their new transformations.

   The `map()` variant does basically the same. However, it directly iterates over the names of the transformations. Therefore, the data frame columns are selected during the iteration.

   Besides the iteration pattern, the approaches differ in the possibilities for appropriate argument naming in the `.f` argument. In the `map2()` approach we iterate over the elements of `x` and `y`. Therefore, it is possible to choose appropriate placeholders like `f` and `var`. This can make the body of the anonymous function quite expressive. A small downside is that this is less compact than the usage of a formula. However, a formula would only allow the usage of `.x` and `.y` shortcuts, which can be - again - less expressive: `mtcars[vars] <- map2(trans, mtcars[vars], ~ .x(.y))`. In the `map()` approach we map over the variable names. It is therefore not possible to introduce placeholders for the function and variable names. The formula syntax together with the `.x` shortcut is pretty compact. The object names and the brackets indicate clearly the application of transformaions to specific columns of `mtcars`. In this case the iteration over the variable names comes in handy, as it highlights the importance of matching between `trans` and `mtcars` element names. Together with the replacement form on the left hand side, this lines is relatively easy to inspect. To summarise, in situations where `map()` and `map2()` provide solutions for an iteration problem, several points are to consider before deciding for one or the other approach.

4. __<span style="color:red">Q</span>__: What does `write.csv()` return? i.e. what happens if you use it with `map2()` instead of `walk2()`?

   __<span style="color:green">A</span>__: `write.csv()` returns `NULL`. In the example above we iterated over a list of data frames and file names a named list of `NULL`s would be returned.

    ```{r, eval=FALSE}
    cyls <- split(mtcars, mtcars$cyl)
    paths <- file.path(temp_dir, paste0("cyl-", names(cyls), ".csv"))

    map2(cyls, paths, write.csv)
    #> $`4`
    #> NULL
    #>
    #> $`6`
    #> NULL
    #>
    #> $`8`
    #> NULL
    ```

## Predicate Functionals

1. __<span style="color:red">Q</span>__: Why isn't `is.na()` a predicate function? What base R function is closest to being a predicate version of `is.na()`?

   __<span style="color:green">A</span>__: `is.na` is not a predicate function, because a predicate function may only return `TRUE` or `FALSE`. This is not strictly the case for `is.na` (e.g. `is.na(NULL)` returns `logical(0)`). It may be, that `anyNA()`, if applied elementwise, may be closest to a predicate-`is.na()` in base R.

<!-- @hadley: is this correct? -->

2. __<span style="color:red">Q</span>__: `simple_reduce()` has a problem when `x` is length 0 or length 1. Describe the source of the problem and how you might go about fixing it.

    ```{r}
    simple_reduce <- function(x, f) {
      out <- x[[1]]
      for (i in seq(2, length(x))) {
        out <- f(out, x[[i]])
      }
      out
    }
    ```

   __<span style="color:green">A</span>__: The loop inside `simple_reduce()` always starts with the index 2. Therefore, subsetting length-0 and length-1 vectors via `[[` will lead to the error *subscript out of bounds*. To avoid this, we allow `simple_reduce()` to `return()` before the for-loop is started and include default argument for 0-length vectors.

    ```{r}
    simple_reduce <- function(x, f, default) {
      if(length(x) == 0L) return(default)
      if(length(x) == 1L) return(x[[1L]])

      out <- x[[1]]
      for (i in seq(2, length(x))) {
        out <- f(out, x[[i]])
      }
      out
    }
    ```

   Our new new `simple_reduce()` now works as intended:

    ```{r, error = TRUE}
    simple_reduce(integer(0), `+`)
    simple_reduce(integer(0), `+`, default = 0L)
    simple_reduce(1, `+`)
    simple_reduce(1:3, `+`)
    ```

3. __<span style="color:red">Q</span>__: Implement the `span()` function from Haskell: given a list `x` and a predicate function `f`, `span(x, f)` returns the location of the longest sequential run of elements where the predicate is true. (Hint: you might find `rle()` helpful.)

   __<span style="color:green">A</span>__: Our `span_r()` function returns the list indices of the (first occuring) longest sequential run of elements where the predicate is true. In case the predicate is not true for any list element, `NA_integer` gets returned.

    ```{r}
    span_r <- function(x, f) {
      index_lgl <- map_lgl(x, ~ f(.x))
      index_lgl <- unname(index_lgl)

      # The interesting part of rle is in $lengths and $values
      sequences <- rle(index_lgl)

      # In case of no true $values, we return NA_integer
      if (!any(sequences$values)) {return(NA_integer_)}

      # For further calculations we need to find the $length index of the (first
      # appearing) longest sequence of trues
      index_seq <- which.max(
        sequences$lengths == max(sequences$lengths[sequences$values]) &
        sequences$values)

      # This allows us to calculate the start and end index of the longest sequence
      index_start <- sum(sequences$lengths[seq_len(index_seq - 1)]) + 1L
      index_end   <- sum(sequences$lengths[seq_len(index_seq)])

      # Now, it's straight forward to return the regarding sequence
      index_start:index_end
    }

    # Tests
    span_r(iris, is.numeric)
    span_r(iris, is.factor)
    span_r(iris, is.character)
    ```

4. __<span style="color:red">Q</span>__: Implement `arg_max()`. It should take a function and a vector of inputs, and return the elements of the input where the function returns the highest value. For example, `arg_max(-10:5, function(x) x ^ 2)` should return -10. `arg_max(-5:5, function(x) x ^ 2)` should return `c(-5, 5)`. Also implement the matching `arg_min()` function.

   __<span style="color:green">A</span>__: Both functions take a vector of inputs and a function as an argument. The functions output are then used to subset the input accordingly.

    ```{r}
    arg_max <- function(x, f){
      x[f(x) == max(f(x))]
    }

    arg_min <- function(x, f){
      x[f(x) == min(f(x))]
    }

    arg_max(-10:5, function(x) x ^ 2)
    arg_min(-10:5, function(x) x ^ 2)
    ```

   Both functions are actually quite similar, so it would have also been possible to pass an option (`max` or `min`) to an argument.

    ```{r}
    arg_ <- function(g, x, f){
      x[f(x) == g(f(x))]
    }

    arg_(max, -10:5, function(x) x ^ 2)
    arg_(min, -10:5, function(x) x ^ 2)
    ```

5. __<span style="color:red">Q</span>__: The function below scales a vector so it falls in the range [0, 1]. How would you apply it to every column of a data frame? How would you apply it to every numeric column in a data frame?

    ```{r}
    scale01 <- function(x) {
      rng <- range(x, na.rm = TRUE)
      (x - rng[1]) / (rng[2] - rng[1])
    }
    ```

   __<span style="color:green">A</span>__: To apply a function to every function of a data frame, we can use `purrr::modify`, which also conveniently returns a data frame. To limit the application to numeric columns, the scoped versions `modify_if()` can be used.

    ```{r, eval = FALSE}
    modify_if(iris, is.numeric, scale01)
    ```

## Base functionals

1. __<span style="color:red">Q</span>__: How does `apply()` arrange the output? Read the documentation and perform some experiments.

   __<span style="color:green">A</span>__: Basically `apply()` applies a function over the margins of an array. In the two dimensional case, the margins are just the rows and columns of a matrix. Let's make this concrete.

    ```{r}
    arr2 <- array(1:9, dim = c(3, 3),
                  dimnames = list(row = paste0("row", 1:3),
                                  col = paste0("col", 1:3)))
    arr2
    ```

   When we apply the `head()` function over a margin of `arr2`, i.e. the rows, the results are contained in the columns of the output:

    ```{r}
    apply(X = arr2, MARGIN = "row", FUN = head, 2)
    ```

   Obviously the rows are orderd by the other input's margin (the original columns). For higher dimensional arrays this could become quite ambiguous. However, with a bit of experimantation, one can find out that `apply()` arranges the order of the of it's output directly in the order of the other array dimensions.

2. __<span style="color:red">Q</span>__: What do `eapply()` and `rapply()` do? Does purrr have equivalents?

   __<span style="color:green">A</span>__: `eapply()` is a variant of `lapply()`, which iterates over the (named) elements of an environment. In purrr there is no equivalent for `eapply()` as purrr mainly provides functions that operate on vectors and functions, but not on environments.

   `rapply()` applies a function to all elements of a list recursively. This function makes it possible to limit the application of the function to specified classes (default `classes = ANY`). One may also specify how elements of other classes should remain: i.e. as their identity (`how = replace`) or another value (`deflt = NULL`). Again purrr doesn't provide an equivalent to this function.

3. __<span style="color:red">Q</span>__: Challenge: read about the [fixed point algorithm](https://mitpress.mit.edu/sites/default/files/sicp/full-text/book/book-Z-H-12.html#%25_idx_1096). Complete the exercises using R.

   __<span style="color:green">A</span>__: A number $x$ is called a fixed point of a function $f$, if it satisfies the equation $f(x) = x$. For some functions we may find a fixed point by beginning with a starting value and applying $f$ repeatedly. Here `find_fixed_point()` acts as a functional, because it takes a function as an argument.

    ```{r}
    find_fixed_point <- function(f, x_start = 1, n_max = 10000, tol = 0.0001) {
      # Initialize
      n <- 1
      x <- x_start

      while (n < n_max) {
        # Compute function and test for fixed point quality
        y <- f(x)
        is_fixed_point <- all.equal(x, y, tolerance = tol)  == TRUE

        if(is_fixed_point){
          # Success case
          message("Fixed point was found, after ", n, " iterations.")
          return(x)
        } else {
          # Recursive case
          x <- y
          n <- n + 1
        }
      }
      if(!is_fixed_point){
        # Non-converging case
        message("No fixed point found.")
      }
    }
    ```

    ```{r}
    # Functions with fixed points
    find_fixed_point(sin, x_start = 1)
    find_fixed_point(cos, x_start = 1)

    # Functions without fixed points
    add_one <- function(x) x + 1
    find_fixed_point(add_one, x_start = 1)
    ```