6 Introduction to simulation
6.1 Motivation
Creating simulated data set is a common habit to test your methodology and get familiar with it. Papers based on simulated data are usually the most profitable ones.However, writing code for simulation is much more effective if you do not use million of loops, because the script can easily become messy and hard to read.
That is the reason, while we learn some theoretical concepts about how to effectively write simulations.
6.2 The for loop
Sadly, a common way to write iteration with loops.
for (i in 1:5) {
print(i)
}
#> [1] 1
#> [1] 2
#> [1] 3
#> [1] 4
#> [1] 5
This is a simple syntax for printing the numbers from 1 to 5. for
loop works as a snippet, so you have to only type “for” and press TAB
after. This will create the frame work (spaces, inputs, parentheses) of the loop. You should see then the following:
for (variable in vector) {
}
#> Error in for (variable in vector) {: invalid for() loop sequence
Now you have to specify the input: the variable, the vector and the code to evaluate.
6.3 Replicate
replicate
is a base R function, but can be very useful. An important difference is that with loop, you can refer to an iterator (i = which iteration runs?). But that is something you do not need in simulative studies, so i
would be only a redundant variable in your environment. Lets generate a random set from normal distribution.
rnorm(n = 10, mean = 0, sd = 1)
#> [1] 1.53071072 1.11572128 -1.75680548 0.72440956 -0.42208816 -0.81778805
#> [7] 0.24025204 -0.03659535 1.64694641 0.59957410
Lets generate 5 sets with replicate.
replicate(n = 5, rnorm(n = 10, mean = 0, sd = 1))
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] -1.0724047 0.1045381 -0.6821441 0.50552179 -2.1724049
#> [2,] -0.5170869 -1.3029887 -0.3407339 -0.26928148 0.7668037
#> [3,] 0.5090504 -0.2540999 0.2150593 -0.68754642 -0.2515032
#> [4,] 0.8061063 2.6600309 -0.5483644 -1.44132060 0.1812080
#> [5,] -0.1964774 1.6424674 1.2690177 0.57432540 -0.8727565
#> [6,] 1.0609074 -0.3389844 0.8434366 0.16891814 0.5702774
#> [7,] 0.4528373 0.2114825 -0.5460979 -0.51564255 -0.5184491
#> [8,] 0.1612659 -0.8790304 0.3836906 0.75262964 0.4892118
#> [9,] -0.3785246 -1.1249163 1.5422579 -0.42230031 -1.2927494
#> [10,] -1.2660102 -0.1263657 -1.0911051 -0.02975666 1.6548887
The result is a matrix, but if you wish to keep the output in list format you can write the following:
replicate(n = 5, rnorm(n = 10, mean = 0, sd = 1), simplify = FALSE)
#> [[1]]
#> [1] 1.2736958 -0.6688812 -1.3302128 0.5530845 -0.1607211 0.1988171
#> [7] 0.1283606 0.3499840 0.7122512 0.8923841
#>
#> [[2]]
#> [1] -0.47361369 0.74955592 -0.39576259 -0.94361428 -0.46896246 0.02996324
#> [7] 0.21263376 0.66298590 0.24973523 -0.59405192
#>
#> [[3]]
#> [1] 3.22318891 0.86463075 -0.45963259 1.60003864 -1.17521562 1.30286863
#> [7] 0.12279515 -0.22744636 1.00398994 -0.07896539
#>
#> [[4]]
#> [1] -2.39293563 -0.45604575 0.15624406 -0.72389236 0.49026891 1.57235957
#> [7] -2.39689102 0.91221690 -0.01008771 -1.01214036
#>
#> [[5]]
#> [1] -2.8694232 2.5789621 0.5585526 1.0734009 -0.3474956 0.8562965
#> [7] 0.5296800 0.0107040 0.5775843 0.9275607
But how to use the simulated data, what we have created? Lets recall, that we can use a function on each elements of a list or vector with the map
function. (map_dbl
if the output contains only one number per element)
replicate(n = 5, rnorm(n = 10, mean = 0, sd = 1), simplify = FALSE) %>%
map(mean) # mean of trajectories as a list
#> [[1]]
#> [1] 0.4386977
#>
#> [[2]]
#> [1] -0.4110464
#>
#> [[3]]
#> [1] -0.2545167
#>
#> [[4]]
#> [1] -0.1256227
#>
#> [[5]]
#> [1] 0.4292195
replicate(n = 5, rnorm(n = 10, mean = 0, sd = 1), simplify = FALSE) %>%
map_dbl(mean) # mean of trajectories as a vector
#> [1] -0.77554025 -0.12528062 0.05073575 -0.09084027 0.15321429
The advantage of ignoring loops from your script for simulation is the increase in speed. In this aspect, it is important to note that you can use parallel computing to make your code even faster5.
6.4 The tidy framework
From the previous chapters we already know that we nest even data frames (and any other object) into a data frame.
Lets create a data frame first with the corresponding parameters.
simulation_df <- tibble(
m = 100, # mean
s = 20, # sd
)
simulation_df
#> # A tibble: 1 x 2
#> m s
#> <dbl> <dbl>
#> 1 100 20
Lets add a set of random data from normal distribution, with the given mean.
simulation_df %>%
mutate(
data = map(m, ~ rnorm(n = 10, mean = .)) # . refers to the m which is mean
)
#> # A tibble: 1 x 3
#> m s data
#> <dbl> <dbl> <list>
#> 1 100 20 <dbl [10]>
We see that a numeric vector (double) is nested into a cell of the data frame. Are we interested in the minimum of the simulated data? Lets use map again.
simulation_df %>%
mutate(
data = map(m, ~ rnorm(n = 10, mean = .)),
minimum = map(data, min) # minimum of the value on `data` column
)
#> # A tibble: 1 x 4
#> m s data minimum
#> <dbl> <dbl> <list> <list>
#> 1 100 20 <dbl [10]> <dbl [1]>
But the new column still contains a vector, we do not see the value even if it is just a number. Remember, if we know that the result is single number, then we can convert it to numeric vector with map_dbl
.
simulation_df %>%
mutate(
data = map(m, ~ rnorm(n = 10, mean = .)),
minimum = map_dbl(data, min) # minimum of the value on `data` column
)
#> # A tibble: 1 x 4
#> m s data minimum
#> <dbl> <dbl> <list> <dbl>
#> 1 100 20 <dbl [10]> 97.4
Now the expected mean of the generated data is 100, but the standard deviation is still 1 (the default value). We can control for 2 input with map2
.
simulation_df %>%
mutate(
data = map2(m, s, ~ rnorm(n = 10, mean = .x, sd = .y)), # .x refers to m, .y to s
minimum = map_dbl(data, min)
)
#> # A tibble: 1 x 4
#> m s data minimum
#> <dbl> <dbl> <list> <dbl>
#> 1 100 20 <dbl [10]> 61.2
What about the size of the sample? We need a new frame and a new function: pmap
. Use this map type function (same functionality) when you have more than 2 inputs.
simulation_df <- tibble(
m = 100, # mean
s = 20, # sd
n = 1000, # sample size
)
simulation_df
#> # A tibble: 1 x 3
#> m s n
#> <dbl> <dbl> <dbl>
#> 1 100 20 1000
simulation_df %>%
mutate(
data = pmap(list(m, s, n), function(m, s, n) rnorm(n = n, mean = m, sd = s))
)
#> # A tibble: 1 x 4
#> m s n data
#> <dbl> <dbl> <dbl> <list>
#> 1 100 20 1000 <dbl [1,000]>
But why do we worked so much to generate a simply trajectory? We could do this simply with the rnorm
function. It is true, but most of the time you have to generate multiple random data sets with different arguments. Lets see how to manage this with the current frame work.
simulation_df <- tibble(
m = c(1, 10, 100), # mean
s = c(3, 10, 50), # sd
n = c(10, 100, 10000) # sample size
)
simulation_df
#> # A tibble: 3 x 3
#> m s n
#> <dbl> <dbl> <dbl>
#> 1 1 3 10
#> 2 10 10 100
#> 3 100 50 10000
simulation_df %>%
mutate(
data = pmap(list(m, s, n), function(m, s, n) rnorm(n = n, mean = m, sd = s)),
minimum = map_dbl(data, min),
maximum = map_dbl(data, max)
)
#> # A tibble: 3 x 6
#> m s n data minimum maximum
#> <dbl> <dbl> <dbl> <list> <dbl> <dbl>
#> 1 1 3 10 <dbl [10]> -2.41 8.37
#> 2 10 10 100 <dbl [100]> -15.9 28.5
#> 3 100 50 10000 <dbl [10,000]> -103. 278.
So we see that the variables definied in the mutate
function will be calculated for every input combination in the frame. How about to evaluate it multiple times? We have to duplicate the lines.
slice
can be used to select a given row with number.
simulation_df %>%
slice(2) # select the 2nd line
#> # A tibble: 1 x 3
#> m s n
#> <dbl> <dbl> <dbl>
#> 1 10 10 100
But what if the number appears multiple times?
simulation_df %>%
slice(c(2, 2)) # select the 2nd line 2 times
#> # A tibble: 2 x 3
#> m s n
#> <dbl> <dbl> <dbl>
#> 1 10 10 100
#> 2 10 10 100
So we need to select each line multiple times. We can do it with the rep
function.
rep(c(1, 2, 3), times = 3)
#> [1] 1 2 3 1 2 3 1 2 3
rep(c(1, 2, 3), each = 3)
#> [1] 1 1 1 2 2 2 3 3 3
simulation_df %>%
slice(rep(1:n(), each = 2)) # select all the lines 2 times
#> # A tibble: 6 x 3
#> m s n
#> <dbl> <dbl> <dbl>
#> 1 1 3 10
#> 2 1 3 10
#> 3 10 10 100
#> 4 10 10 100
#> 5 100 50 10000
#> 6 100 50 10000
simulation_df %>%
slice(rep(1:n(), each = 2)) %>% # select all the lines 2 times
mutate(
data = pmap(list(m, s, n), function(m, s, n) rnorm(n = n, mean = m, sd = s)),
minimum = map_dbl(data, min),
maximum = map_dbl(data, max)
)
#> # A tibble: 6 x 6
#> m s n data minimum maximum
#> <dbl> <dbl> <dbl> <list> <dbl> <dbl>
#> 1 1 3 10 <dbl [10]> -4.69 4.59
#> 2 1 3 10 <dbl [10]> -4.56 5.78
#> 3 10 10 100 <dbl [100]> -12.6 31.6
#> 4 10 10 100 <dbl [100]> -16.3 29.8
#> 5 100 50 10000 <dbl [10,000]> -110. 304.
#> 6 100 50 10000 <dbl [10,000]> -75.0 274.
6.5 Case Study
How does the kurtosis change if the sample size or the standard deviation changes?
tibble(n = 10^(0:5)) %>%
crossing(s = 2^(0:5)) # match all possible combination
#> # A tibble: 36 x 2
#> n s
#> <dbl> <dbl>
#> 1 1 1
#> 2 1 2
#> 3 1 4
#> 4 1 8
#> 5 1 16
#> 6 1 32
#> 7 10 1
#> 8 10 2
#> 9 10 4
#> 10 10 8
#> # ... with 26 more rows
tibble(n = 10^(1:5)) %>%
crossing(s = 2^(0:5)) %>% # match all possible combination
mutate(
data = map2(n, s, ~ rnorm(n = .x, sd = .y)) # generate the data
)
#> # A tibble: 30 x 3
#> n s data
#> <dbl> <dbl> <list>
#> 1 10 1 <dbl [10]>
#> 2 10 2 <dbl [10]>
#> 3 10 4 <dbl [10]>
#> 4 10 8 <dbl [10]>
#> 5 10 16 <dbl [10]>
#> 6 10 32 <dbl [10]>
#> 7 100 1 <dbl [100]>
#> 8 100 2 <dbl [100]>
#> 9 100 4 <dbl [100]>
#> 10 100 8 <dbl [100]>
#> # ... with 20 more rows
tibble(n = 10^(1:5)) %>%
crossing(s = 2^(0:5)) %>% # match all possible combination
mutate(
data = map2(n, s, ~ rnorm(n = .x, sd = .y)),
kurt = map_dbl(data, moments::kurtosis) # calculate the kurtosis of the data
)
#> # A tibble: 30 x 4
#> n s data kurt
#> <dbl> <dbl> <list> <dbl>
#> 1 10 1 <dbl [10]> 1.61
#> 2 10 2 <dbl [10]> 3.43
#> 3 10 4 <dbl [10]> 2.01
#> 4 10 8 <dbl [10]> 1.75
#> 5 10 16 <dbl [10]> 2.56
#> 6 10 32 <dbl [10]> 2.27
#> 7 100 1 <dbl [100]> 2.67
#> 8 100 2 <dbl [100]> 3.02
#> 9 100 4 <dbl [100]> 2.60
#> 10 100 8 <dbl [100]> 2.65
#> # ... with 20 more rows
tibble(n = 10^(1:5)) %>%
crossing(s = 2^(0:5)) %>%
slice(rep(1:n(), each = 100)) %>% # increase the iterations to 100
mutate(
data = map2(n, s, ~ rnorm(n = .x, sd = .y)),
kurt = map_dbl(data, moments::kurtosis)
) %>%
ggplot() + # plot
aes(kurt) +
geom_density() +
facet_grid(n ~ s)
Conclusion #1: The kurtosis of the sample size did not change if we increase the standard deviation or the sample size. But is has became concentreted to 3, when the size of the sample increased.
Conclusion #2: We could write a complex simulation with 7 simple codes (while 4 lines are related to the plot).