2 Markdown & Tidyverse

2.1 Markdown

Last week we wrote or codes on the Source pane, and if you save the codes and revisit the file with your file explorer, you can see that the extension of the file is .R. You can open the file with any other text editor (like Notepad) and see the codes. In a .R file you can write your codes and comments, and if you hit ctrl + shift + enter then all lines will be evaluated. Most of the times the goal to use .R files is this, we want to reuse the codes frequently (like downloading data from a website).

Another possible extension for your files is the .Rmd, which stands for R MarkDown. In this file format you can combine text, R codes and their output into one single document. from now, We will use RMarkdown during the class. Please visit the following website to find useful examples: https://rmarkdown.rstudio.com/articles_intro.html

2.2 Introduction to the tidyverse

We will work with the tidyverse today. You can simply install it from the CRAN (go to Packages pane and click the install button). But tidyverse is not a simple package. “The tidyverse is a set of packages that work in harmony because they share common data representations […]”

Tidyverse contains the most important packages that you’re likely to use in everyday data analyses:

ggplot2, for data visualisation.
dplyr, for data manipulation.
tidyr, for data tidying.
readr, for data import.
purrr, for functional programming.
tibble, for tibbles, a modern re-imagining of data frames.
stringr, for strings.
forcats, for factors.

After you installed the packages, you can load in the core packages with `library` command. You will receive a warning message, but do not worry, this is fine.

library(tidyverse)

2.3 The `%>%` operator

This operator is written ctrl + shift + m, it is denoted as %>%, but we pronounce as pipe. To understand its relevance, you should think about it like … then. This opreator forward the value of the previous expression into the next function as the first unspecified input. Lets see an example:

We first generate a numerical vector as trajectories of standard normal distribution.

norm_sample <- rnorm(n = 1000, mean = 0, sd = 1) # standard normal

AND THEN we visualize its distribution with a histogram.

hist(norm_sample)

Figure 2.1: Histogram of standard normal distribution

Now we had to assign the norm_sample object to use only for a single graph. If you are working on a project it will be confusing to have tons of one time used objects in your environment (Like DataFrameAfterCleaningStep1, DataFrameAfterCleaningStep2, DataFrameAfterCleaningStep3, etc.).

With the pipe operator we can embed several steps into one single workflow. This way we do not have to assign the norm_sample object. We simple generate the random values AND THEN draw their distribution.

rnorm(n = 1000, mean = 0, sd = 1) %>% 
  hist()

Figure 2.2: Histogram of standard normal distribution using the pipe

You may see that the title of plot has changed (probably the graph also changed a bit, since chances are low that we generate exactly the 1000 random values). This is because pipe and many other tidyverse function works with a lambda like function framework. This means that you can refer to the input value (only if you have only one single input) with ..

Example:

(2 + 2) %>% 
  {. * .}
#> [1] 16

The result is 16, since 2 + 2 = 4, and 4 * 4 equals 16.

The pipe may seem unrelevant first, but this is one of the most powerful tool in R if you have complex data manipulating steps. We will cover the core functions of dplyr in this chapter, which appear in the following video (37:30 - 38:30), but when you are familier with them, you will see the motivation behind the pipe.

2.4 Tibble, count, filter, select, arrange

2.4.1 Tibble

Let’s import a dataset from the OECD webpage. Download the data in csv format and paste it into your working directory.

fertility_df <- read_csv("DP_LIVE_22092021161631568.csv")

read_csv is importad from the readr package, thus shares its output is adequat to the tidyverse principles. fertility_df is a data.frame, but it is a tibble, which means that it will be printed nicely on the console: only the first 10 rows appears, and only the amount of columns that can be printed without a linebreak (similar to pandas dataframe in python).

fertility_df
#> # A tibble: 3,193 x 8
#>    LOCATION INDICATOR SUBJECT MEASURE   FREQUENCY  TIME Value `Flag Codes`
#>    <chr>    <chr>     <chr>   <chr>     <chr>     <dbl> <dbl> <lgl>       
#>  1 AUS      FERTILITY TOT     CHD_WOMAN A          1960  3.45 NA          
#>  2 AUS      FERTILITY TOT     CHD_WOMAN A          1961  3.55 NA          
#>  3 AUS      FERTILITY TOT     CHD_WOMAN A          1962  3.43 NA          
#>  4 AUS      FERTILITY TOT     CHD_WOMAN A          1963  3.34 NA          
#>  5 AUS      FERTILITY TOT     CHD_WOMAN A          1964  3.15 NA          
#>  6 AUS      FERTILITY TOT     CHD_WOMAN A          1965  2.97 NA          
#>  7 AUS      FERTILITY TOT     CHD_WOMAN A          1966  2.89 NA          
#>  8 AUS      FERTILITY TOT     CHD_WOMAN A          1967  2.85 NA          
#>  9 AUS      FERTILITY TOT     CHD_WOMAN A          1968  2.89 NA          
#> 10 AUS      FERTILITY TOT     CHD_WOMAN A          1969  2.89 NA          
#> # ... with 3,183 more rows

2.4.2 Count

With the count function you can simple count the number of appearancesof the levels of a given column. Example:

count(fertility_df, LOCATION)
#> # A tibble: 54 x 2
#>    LOCATION     n
#>    <chr>    <int>
#>  1 ARG         60
#>  2 AUS         60
#>  3 AUT         61
#>  4 BEL         60
#>  5 BGR         60
#>  6 BRA         60
#>  7 CAN         60
#>  8 CHE         61
#>  9 CHL         60
#> 10 CHN         60
#> # ... with 44 more rows

Now we can see that we have 60 observation from ARG. If you are interested in for which country we have the most datapoints, you can use the sort = TRUE option.

count(fertility_df, LOCATION, sort = TRUE)
#> # A tibble: 54 x 2
#>    LOCATION     n
#>    <chr>    <int>
#>  1 AUT         61
#>  2 CHE         61
#>  3 CZE         61
#>  4 DNK         61
#>  5 EST         61
#>  6 FIN         61
#>  7 FRA         61
#>  8 LUX         61
#>  9 NOR         61
#> 10 NZL         61
#> # ... with 44 more rows

You can also use the pipe here[^It would not be a good solution to assign a new dataframe to each count table]

fertility_df %>% # %>% = the 1st argument of count function is the fertility_df data.frame
  count(LOCATION)
#> # A tibble: 54 x 2
#>    LOCATION     n
#>    <chr>    <int>
#>  1 ARG         60
#>  2 AUS         60
#>  3 AUT         61
#>  4 BEL         60
#>  5 BGR         60
#>  6 BRA         60
#>  7 CAN         60
#>  8 CHE         61
#>  9 CHL         60
#> 10 CHN         60
#> # ... with 44 more rows

You may specifiy multiple columns, this way the frequency of all value combinaton appears.

fertility_df %>% 
  count(LOCATION, TIME, sort = T)
#> # A tibble: 3,193 x 3
#>    LOCATION  TIME     n
#>    <chr>    <dbl> <int>
#>  1 ARG       1960     1
#>  2 ARG       1961     1
#>  3 ARG       1962     1
#>  4 ARG       1963     1
#>  5 ARG       1964     1
#>  6 ARG       1965     1
#>  7 ARG       1966     1
#>  8 ARG       1967     1
#>  9 ARG       1968     1
#> 10 ARG       1969     1
#> # ... with 3,183 more rows

Now we see that we have only one observation for each country at a time, so this data frame was originally tidy (this is usually not the case). Let’s see for example the Q GDP data from the OECD webpage.

gdp_df <- read_csv("DP_LIVE_22092021163414835.csv")

Using the count function, we may find that MEASURE, SUBJECT & FREQUENCY columns have multiple levels.

gdp_df %>% 
  count(MEASURE, SUBJECT, FREQUENCY)
#> # A tibble: 5 x 4
#>   MEASURE  SUBJECT FREQUENCY     n
#>   <chr>    <chr>   <chr>     <int>
#> 1 IDX      VOLIDX  A          1764
#> 2 IDX      VOLIDX  Q          7127
#> 3 PC_CHGPP TOT     A          2329
#> 4 PC_CHGPP TOT     Q          9472
#> 5 PC_CHGPY TOT     Q          9385

Let’s say we are interested only in the observations where the SUBJECT is equal to TOT (total), MEASURE is equal to PC_CHGPY (percentage change to previous value) and the FREQUENCY is not equal to A (only quarterly observations). You can select rows by using the filter function. Its 1st argument is the dataframe you want to modify, and the other arguments are conditional statements. If all the statements are TRUE, then the row will be selected.

filter(gdp_df, SUBJECT == "TOT", MEASURE == "PC_CHGPY", FREQUENCY != "A")
#> # A tibble: 9,385 x 8
#>    LOCATION INDICATOR SUBJECT MEASURE  FREQUENCY TIME     Value `Flag Codes`
#>    <chr>    <chr>     <chr>   <chr>    <chr>     <chr>    <dbl> <chr>       
#>  1 DEU      QGDP      TOT     PC_CHGPY Q         1961-Q1  7.45  E           
#>  2 DEU      QGDP      TOT     PC_CHGPY Q         1961-Q2  5.00  E           
#>  3 DEU      QGDP      TOT     PC_CHGPY Q         1961-Q3  3.58  E           
#>  4 DEU      QGDP      TOT     PC_CHGPY Q         1961-Q4  3.00  E           
#>  5 DEU      QGDP      TOT     PC_CHGPY Q         1962-Q1  3.47  E           
#>  6 DEU      QGDP      TOT     PC_CHGPY Q         1962-Q2  5.11  E           
#>  7 DEU      QGDP      TOT     PC_CHGPY Q         1962-Q3  5.47  E           
#>  8 DEU      QGDP      TOT     PC_CHGPY Q         1962-Q4  4.76  E           
#>  9 DEU      QGDP      TOT     PC_CHGPY Q         1963-Q1 -0.372 E           
#> 10 DEU      QGDP      TOT     PC_CHGPY Q         1963-Q2  3.02  E           
#> # ... with 9,375 more rows

Just like the count, we use this function with %>% operator, and use & instead of ,, but if you wish to use “OR”, then you can use | between two statement.

gdp_df %>% 
  filter(SUBJECT == "TOT" & MEASURE == "PC_CHGPY" & FREQUENCY != "A")
#> # A tibble: 9,385 x 8
#>    LOCATION INDICATOR SUBJECT MEASURE  FREQUENCY TIME     Value `Flag Codes`
#>    <chr>    <chr>     <chr>   <chr>    <chr>     <chr>    <dbl> <chr>       
#>  1 DEU      QGDP      TOT     PC_CHGPY Q         1961-Q1  7.45  E           
#>  2 DEU      QGDP      TOT     PC_CHGPY Q         1961-Q2  5.00  E           
#>  3 DEU      QGDP      TOT     PC_CHGPY Q         1961-Q3  3.58  E           
#>  4 DEU      QGDP      TOT     PC_CHGPY Q         1961-Q4  3.00  E           
#>  5 DEU      QGDP      TOT     PC_CHGPY Q         1962-Q1  3.47  E           
#>  6 DEU      QGDP      TOT     PC_CHGPY Q         1962-Q2  5.11  E           
#>  7 DEU      QGDP      TOT     PC_CHGPY Q         1962-Q3  5.47  E           
#>  8 DEU      QGDP      TOT     PC_CHGPY Q         1962-Q4  4.76  E           
#>  9 DEU      QGDP      TOT     PC_CHGPY Q         1963-Q1 -0.372 E           
#> 10 DEU      QGDP      TOT     PC_CHGPY Q         1963-Q2  3.02  E           
#> # ... with 9,375 more rows

2.4.3 Mutate

In the gdp_df dataframe you can find the TIME column and that its class character. Lets say we want to change this to date. Functions related to date are in the lubridate package. Since the values of the TIME column consist of a year and a quarter, we will use the yq function to change it into date. But we should not forget that this will work only with the datapoints where the FREQUENCY is not annual. So let’s keep the previous filter. If we want to modify a given column of a dataframe we should use the mutate function. With the pipe operator we now combine the filter and the mutate command.

gdp_df %>% 
  filter(SUBJECT == "TOT" & MEASURE == "PC_CHGPY" & FREQUENCY != "A") %>% 
  mutate(TIME = lubridate::yq(TIME))
#> # A tibble: 9,385 x 8
#>    LOCATION INDICATOR SUBJECT MEASURE  FREQUENCY TIME        Value `Flag Codes`
#>    <chr>    <chr>     <chr>   <chr>    <chr>     <date>      <dbl> <chr>       
#>  1 DEU      QGDP      TOT     PC_CHGPY Q         1961-01-01  7.45  E           
#>  2 DEU      QGDP      TOT     PC_CHGPY Q         1961-04-01  5.00  E           
#>  3 DEU      QGDP      TOT     PC_CHGPY Q         1961-07-01  3.58  E           
#>  4 DEU      QGDP      TOT     PC_CHGPY Q         1961-10-01  3.00  E           
#>  5 DEU      QGDP      TOT     PC_CHGPY Q         1962-01-01  3.47  E           
#>  6 DEU      QGDP      TOT     PC_CHGPY Q         1962-04-01  5.11  E           
#>  7 DEU      QGDP      TOT     PC_CHGPY Q         1962-07-01  5.47  E           
#>  8 DEU      QGDP      TOT     PC_CHGPY Q         1962-10-01  4.76  E           
#>  9 DEU      QGDP      TOT     PC_CHGPY Q         1963-01-01 -0.372 E           
#> 10 DEU      QGDP      TOT     PC_CHGPY Q         1963-04-01  3.02  E           
#> # ... with 9,375 more rows

Now that TIME is already a date column, we can add an additional condition: we want to analyze only the latest valuse (where the value of TIME is maximum).

gdp_df %>% 
  filter(SUBJECT == "TOT" & MEASURE == "PC_CHGPY" & FREQUENCY != "A") %>% 
  mutate(TIME = lubridate::yq(TIME)) %>% 
  filter(TIME == max(TIME)) 
#> # A tibble: 49 x 8
#>    LOCATION INDICATOR SUBJECT MEASURE  FREQUENCY TIME       Value `Flag Codes`
#>    <chr>    <chr>     <chr>   <chr>    <chr>     <date>     <dbl> <chr>       
#>  1 DEU      QGDP      TOT     PC_CHGPY Q         2021-04-01  9.41 P           
#>  2 IND      QGDP      TOT     PC_CHGPY Q         2021-04-01 20.9  <NA>        
#>  3 TUR      QGDP      TOT     PC_CHGPY Q         2021-04-01 21.4  <NA>        
#>  4 IDN      QGDP      TOT     PC_CHGPY Q         2021-04-01  7.19 <NA>        
#>  5 LVA      QGDP      TOT     PC_CHGPY Q         2021-04-01 10.8  <NA>        
#>  6 CZE      QGDP      TOT     PC_CHGPY Q         2021-04-01  8.18 <NA>        
#>  7 BRA      QGDP      TOT     PC_CHGPY Q         2021-04-01 12.4  <NA>        
#>  8 POL      QGDP      TOT     PC_CHGPY Q         2021-04-01 11.0  <NA>        
#>  9 MEX      QGDP      TOT     PC_CHGPY Q         2021-04-01 19.5  P           
#> 10 CAN      QGDP      TOT     PC_CHGPY Q         2021-04-01 12.7  <NA>        
#> # ... with 39 more rows

Let’s where we find the highest values. We have to change the order of the rows with the arrange function. If you want to set decreasing order, then you should put the columns name into the desc command.

gdp_df %>% 
  filter(SUBJECT == "TOT" & MEASURE == "PC_CHGPY" & FREQUENCY != "A") %>% 
  mutate(TIME = lubridate::yq(TIME)) %>% 
  filter(TIME == max(TIME)) %>%
  arrange(desc(Value)) 
#> # A tibble: 49 x 8
#>    LOCATION INDICATOR SUBJECT MEASURE  FREQUENCY TIME       Value `Flag Codes`
#>    <chr>    <chr>     <chr>   <chr>    <chr>     <date>     <dbl> <chr>       
#>  1 GBR      QGDP      TOT     PC_CHGPY Q         2021-04-01  22.2 <NA>        
#>  2 TUR      QGDP      TOT     PC_CHGPY Q         2021-04-01  21.4 <NA>        
#>  3 IRL      QGDP      TOT     PC_CHGPY Q         2021-04-01  21.1 <NA>        
#>  4 IND      QGDP      TOT     PC_CHGPY Q         2021-04-01  20.9 <NA>        
#>  5 ESP      QGDP      TOT     PC_CHGPY Q         2021-04-01  19.8 P           
#>  6 MEX      QGDP      TOT     PC_CHGPY Q         2021-04-01  19.5 P           
#>  7 FRA      QGDP      TOT     PC_CHGPY Q         2021-04-01  18.7 <NA>        
#>  8 HUN      QGDP      TOT     PC_CHGPY Q         2021-04-01  17.7 <NA>        
#>  9 NZL      QGDP      TOT     PC_CHGPY Q         2021-04-01  17.4 <NA>        
#> 10 ITA      QGDP      TOT     PC_CHGPY Q         2021-04-01  17.3 <NA>        
#> # ... with 39 more rows

And we want to remove the unused columns. You can select columns with the select function.

gdp_df %>% 
  filter(SUBJECT == "TOT" & MEASURE == "PC_CHGPY" & FREQUENCY != "A") %>% 
  mutate(TIME = lubridate::yq(TIME)) %>% 
  filter(TIME == max(TIME)) %>% 
  arrange(desc(Value)) %>% 
  select(geo = LOCATION, gdp_change = Value) # select the LOCATION & gdp_change columns
#> # A tibble: 49 x 2
#>    geo   gdp_change
#>    <chr>      <dbl>
#>  1 GBR         22.2
#>  2 TUR         21.4
#>  3 IRL         21.1
#>  4 IND         20.9
#>  5 ESP         19.8
#>  6 MEX         19.5
#>  7 FRA         18.7
#>  8 HUN         17.7
#>  9 NZL         17.4
#> 10 ITA         17.3
#> # ... with 39 more rows

Alternative notation with select:

gdp_df %>% 
  select(1) # select the 1st column
#> # A tibble: 30,077 x 1
#>    LOCATION
#>    <chr>   
#>  1 OECD    
#>  2 OECD    
#>  3 OECD    
#>  4 OECD    
#>  5 OECD    
#>  6 OECD    
#>  7 OECD    
#>  8 OECD    
#>  9 OECD    
#> 10 OECD    
#> # ... with 30,067 more rows

gdp_df %>% 
  select(-1) # omit the 1st column
#> # A tibble: 30,077 x 7
#>    INDICATOR SUBJECT MEASURE  FREQUENCY TIME  Value `Flag Codes`
#>    <chr>     <chr>   <chr>    <chr>     <chr> <dbl> <chr>       
#>  1 QGDP      TOT     PC_CHGPP A         1962   5.70 <NA>        
#>  2 QGDP      TOT     PC_CHGPP A         1963   5.20 <NA>        
#>  3 QGDP      TOT     PC_CHGPP A         1964   6.38 <NA>        
#>  4 QGDP      TOT     PC_CHGPP A         1965   5.35 <NA>        
#>  5 QGDP      TOT     PC_CHGPP A         1966   5.75 <NA>        
#>  6 QGDP      TOT     PC_CHGPP A         1967   3.96 <NA>        
#>  7 QGDP      TOT     PC_CHGPP A         1968   5.92 <NA>        
#>  8 QGDP      TOT     PC_CHGPP A         1969   5.57 <NA>        
#>  9 QGDP      TOT     PC_CHGPP A         1970   3.94 <NA>        
#> 10 QGDP      TOT     PC_CHGPP A         1971   3.70 <NA>        
#> # ... with 30,067 more rows

gdp_df %>% 
  select(1:2) # select all the columns between the 1st and the 2nd
#> # A tibble: 30,077 x 2
#>    LOCATION INDICATOR
#>    <chr>    <chr>    
#>  1 OECD     QGDP     
#>  2 OECD     QGDP     
#>  3 OECD     QGDP     
#>  4 OECD     QGDP     
#>  5 OECD     QGDP     
#>  6 OECD     QGDP     
#>  7 OECD     QGDP     
#>  8 OECD     QGDP     
#>  9 OECD     QGDP     
#> 10 OECD     QGDP     
#> # ... with 30,067 more rows

gdp_df %>% 
  select(LOCATION:TIME) # select all the columns between the LOCATION and the TIME column
#> # A tibble: 30,077 x 6
#>    LOCATION INDICATOR SUBJECT MEASURE  FREQUENCY TIME 
#>    <chr>    <chr>     <chr>   <chr>    <chr>     <chr>
#>  1 OECD     QGDP      TOT     PC_CHGPP A         1962 
#>  2 OECD     QGDP      TOT     PC_CHGPP A         1963 
#>  3 OECD     QGDP      TOT     PC_CHGPP A         1964 
#>  4 OECD     QGDP      TOT     PC_CHGPP A         1965 
#>  5 OECD     QGDP      TOT     PC_CHGPP A         1966 
#>  6 OECD     QGDP      TOT     PC_CHGPP A         1967 
#>  7 OECD     QGDP      TOT     PC_CHGPP A         1968 
#>  8 OECD     QGDP      TOT     PC_CHGPP A         1969 
#>  9 OECD     QGDP      TOT     PC_CHGPP A         1970 
#> 10 OECD     QGDP      TOT     PC_CHGPP A         1971 
#> # ... with 30,067 more rows

gdp_df %>% 
  select(TIME, LOCATION, everything()) 
#> # A tibble: 30,077 x 8
#>    TIME  LOCATION INDICATOR SUBJECT MEASURE  FREQUENCY Value `Flag Codes`
#>    <chr> <chr>    <chr>     <chr>   <chr>    <chr>     <dbl> <chr>       
#>  1 1962  OECD     QGDP      TOT     PC_CHGPP A          5.70 <NA>        
#>  2 1963  OECD     QGDP      TOT     PC_CHGPP A          5.20 <NA>        
#>  3 1964  OECD     QGDP      TOT     PC_CHGPP A          6.38 <NA>        
#>  4 1965  OECD     QGDP      TOT     PC_CHGPP A          5.35 <NA>        
#>  5 1966  OECD     QGDP      TOT     PC_CHGPP A          5.75 <NA>        
#>  6 1967  OECD     QGDP      TOT     PC_CHGPP A          3.96 <NA>        
#>  7 1968  OECD     QGDP      TOT     PC_CHGPP A          5.92 <NA>        
#>  8 1969  OECD     QGDP      TOT     PC_CHGPP A          5.57 <NA>        
#>  9 1970  OECD     QGDP      TOT     PC_CHGPP A          3.94 <NA>        
#> 10 1971  OECD     QGDP      TOT     PC_CHGPP A          3.70 <NA>        
#> # ... with 30,067 more rows
  # select all the columns, but TIME & LOCATION to the first place

2.5 `Group_by`, `Summary`

Let’s see another source for data. You can easily access Eurostat tables with the eurostat package.

eurostat::search_eurostat("birth")
#> # A tibble: 284 x 8
#>    title   code  type  `last update of~ `last table str~ `data start` `data end`
#>    <chr>   <chr> <chr> <chr>            <chr>            <chr>        <chr>     
#>  1 Live b~ demo~ data~ 30.06.2021       23.02.2021       1990         2019      
#>  2 Live b~ demo~ data~ 01.07.2021       01.07.2021       2013         2019      
#>  3 Live b~ demo~ data~ 30.06.2021       23.02.2021       1990         2019      
#>  4 Popula~ cens~ data~ 26.08.2015       08.02.2021       2011         2011      
#>  5 Popula~ cens~ data~ 26.08.2015       08.02.2021       2011         2011      
#>  6 Popula~ cens~ data~ 26.08.2015       08.02.2021       2011         2011      
#>  7 Popula~ cens~ data~ 26.08.2015       08.02.2021       2011         2011      
#>  8 Popula~ cens~ data~ 26.08.2015       08.02.2021       2011         2011      
#>  9 Popula~ lfst~ data~ 10.09.2021       27.04.2021       1999         2020      
#> 10 Activi~ lfst~ data~ 10.09.2021       27.04.2021       1999         2020      
#> # ... with 274 more rows, and 1 more variable: values <chr>

Let’s choose the “Live births by mother’s age and NUTS 2 region” dataset.

livebirth_eu_df <- eurostat::get_eurostat("demo_r_fagec")

livebirth_eu_df
#> # A tibble: 456,956 x 5
#>    unit  age   geo   time       values
#>    <chr> <chr> <chr> <date>      <dbl>
#>  1 NR    TOTAL AL    2019-01-01  28561
#>  2 NR    TOTAL AL0   2019-01-01  28438
#>  3 NR    TOTAL AL01  2019-01-01   8909
#>  4 NR    TOTAL AL02  2019-01-01  12089
#>  5 NR    TOTAL AL03  2019-01-01   7440
#>  6 NR    TOTAL ALX   2019-01-01    123
#>  7 NR    TOTAL ALXX  2019-01-01    123
#>  8 NR    TOTAL AT    2019-01-01  84952
#>  9 NR    TOTAL AT1   2019-01-01  36819
#> 10 NR    TOTAL AT11  2019-01-01   2232
#> # ... with 456,946 more rows

First of all, we are interested in NUTS 2 reginal data. But in this dataset national aggregated values are also published (where the geo codes length is only 2 characters). Let’s remove these.

livebirth_eu_df %>% 
  filter(str_length(geo) != 2)
#> # A tibble: 416,961 x 5
#>    unit  age   geo   time       values
#>    <chr> <chr> <chr> <date>      <dbl>
#>  1 NR    TOTAL AL0   2019-01-01  28438
#>  2 NR    TOTAL AL01  2019-01-01   8909
#>  3 NR    TOTAL AL02  2019-01-01  12089
#>  4 NR    TOTAL AL03  2019-01-01   7440
#>  5 NR    TOTAL ALX   2019-01-01    123
#>  6 NR    TOTAL ALXX  2019-01-01    123
#>  7 NR    TOTAL AT1   2019-01-01  36819
#>  8 NR    TOTAL AT11  2019-01-01   2232
#>  9 NR    TOTAL AT12  2019-01-01  14652
#> 10 NR    TOTAL AT13  2019-01-01  19935
#> # ... with 416,951 more rows

We also should remove the aggregated values and keep only the latest one.

livebirth_eu_df %>% 
  filter(str_length(geo) != 2) %>% 
  filter(age != "TOTAL" & time == "2019-01-01") %>% 
  filter(!(age %in% c("UNK", "Y_GE45", "Y_GE48", "Y_GE50", "Y_LT16")))
#> # A tibble: 14,309 x 5
#>    unit  age    geo   time       values
#>    <chr> <chr>  <chr> <date>      <dbl>
#>  1 NR    Y10-14 AL0   2019-01-01     24
#>  2 NR    Y10-14 AL01  2019-01-01      4
#>  3 NR    Y10-14 AL02  2019-01-01     13
#>  4 NR    Y10-14 AL03  2019-01-01      7
#>  5 NR    Y10-14 ALX   2019-01-01      0
#>  6 NR    Y10-14 ALXX  2019-01-01      0
#>  7 NR    Y10-14 AT1   2019-01-01      1
#>  8 NR    Y10-14 AT11  2019-01-01      0
#>  9 NR    Y10-14 AT12  2019-01-01      0
#> 10 NR    Y10-14 AT13  2019-01-01      1
#> # ... with 14,299 more rows

Now our dataset is clean: we have one observation to each geo and age category. Let’s suppose we are interested in the total number of birth by the mothers age in EU (I know we would found a table for this). For this we want to sum the values in the values column by age category.

livebirth_eu_df %>% 
  filter(str_length(geo) != 2) %>% 
  filter(age != "TOTAL" & time == "2019-01-01") %>% 
  filter(!(age %in% c("UNK", "Y_GE45", "Y_GE48", "Y_GE50", "Y_LT16"))) %>% 
  group_by(age) %>%
  summarise(values = sum(values))

age	values
Y10-14	5337
Y15	13440
Y16	30049
Y17	60282
Y18	106563
Y19	172655
Y20	244849
Y21	308539
Y22	378984
Y23	450972
Y24	542341
Y25	649112
Y26	757241
Y27	867143
Y28	971229
Y29	1053152
Y30	1089894
Y31	1099205
Y32	1044856
Y33	979962
Y34	903304
Y35	811897
Y36	707466
Y37	597897
Y38	489777
Y39	385349
Y40	281512
Y41	192645
Y42	124356
Y43	73296
Y44	43848
Y45	23717
Y46	12628
Y47	7172
Y48	4354
Y49	2922

2.6 Pivot longer/wider

Let’s assume we are interested in the growth rate of fertility in each country. First, we should write a function to calculate growth rates (chain index).

chain_index <- function(x) {
  scales::percent(x/lag(x)-1, accuracy = .01)
  # lag: previous observation
}

Example:

x <- c(100, 107, 105, 110)

chain_index(x)
#> [1] NA       "7.00%"  "-1.87%" "4.76%"

fertility_df %>% 
  select(LOCATION, TIME, Value) %>% 
  mutate(GrowthRate = chain_index(Value))

(I hided the rest of the table)

LOCATION	TIME	Value	GrowthRate
AUS	1960	3.45	NA
AUS	1961	3.55	2.90%
AUS	1962	3.43	-3.38%
AUS	1963	3.34	-2.62%
AUS	1964	3.15	-5.69%
AUS	1965	2.97	-5.71%
AUS	1966	2.89	-2.69%
AUS	1967	2.85	-1.38%
AUS	1968	2.89	1.40%
AUS	1969	2.89	0.00%
AUS	1970	2.86	-1.04%
AUS	1971	2.95	3.15%
AUS	1972	2.74	-7.12%
AUS	1973	2.49	-9.12%
AUS	1974	2.32	-6.83%
AUS	1975	2.15	-7.33%
AUS	1976	2.06	-4.19%
AUS	1977	2.01	-2.43%
AUS	1978	1.95	-2.99%
AUS	1979	1.91	-2.05%
AUS	1980	1.89	-1.05%
AUS	1981	1.94	2.65%
AUS	1982	1.93	-0.52%
AUS	1983	1.92	-0.52%
AUS	1984	1.84	-4.17%
AUS	1985	1.92	4.35%
AUS	1986	1.87	-2.60%
AUS	1987	1.85	-1.07%
AUS	1988	1.83	-1.08%
AUS	1989	1.84	0.55%
AUS	1990	1.90	3.26%
AUS	1991	1.85	-2.63%
AUS	1992	1.89	2.16%
AUS	1993	1.86	-1.59%
AUS	1994	1.84	-1.08%
AUS	1995	1.82	-1.09%
AUS	1996	1.80	-1.10%
AUS	1997	1.78	-1.11%
AUS	1998	1.76	-1.12%
AUS	1999	1.76	0.00%
AUS	2000	1.76	0.00%
AUS	2001	1.73	-1.70%
AUS	2002	1.77	2.31%
AUS	2003	1.77	0.00%
AUS	2004	1.78	0.56%
AUS	2005	1.85	3.93%
AUS	2006	1.88	1.62%
AUS	2007	1.99	5.85%
AUS	2008	2.02	1.51%
AUS	2009	1.97	-2.48%
AUS	2010	1.95	-1.02%
AUS	2011	1.92	-1.54%
AUS	2012	1.93	0.52%
AUS	2013	1.88	-2.59%
AUS	2014	1.79	-4.79%
AUS	2015	1.79	0.00%
AUS	2016	1.79	0.00%
AUS	2017	1.74	-2.79%
AUS	2018	1.74	0.00%
AUS	2019	1.66	-4.60%
AUT	1960	2.69	62.05%
AUT	1961	2.78	3.35%
AUT	1962	2.80	0.72%
AUT	1963	2.82	0.71%
AUT	1964	2.79	-1.06%
AUT	1965	2.70	-3.23%
AUT	1966	2.66	-1.48%
AUT	1967	2.62	-1.50%
AUT	1968	2.58	-1.53%
AUT	1969	2.49	-3.49%

But the problem comes at the GrowthRate at AUT in 1960, since now (check in the table):

\[\text{Growth}_{\text{AUT}, 1960} = \frac{\text{AUS}_{2019}}{\text{AUT}_{1960}}\]

We can easily solve this by transforming the structure of the table. We need to make the table wider in this case, with the pivot_wider. this function will create a new column to each level of a seleceted variable.

fertility_df %>% 
  select(geo = LOCATION, time = TIME, fertility = Value) %>% 
  pivot_wider(names_from = "geo", values_from = "fertility")
#> # A tibble: 61 x 55
#>     time   AUS   AUT   BEL   CAN   CZE   DNK   FIN   FRA   DEU   GRC   HUN   ISL
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1  1960  3.45  2.69  2.54  3.9   2.11  2.54  2.71  2.74  2.37  2.23  2.02  4.26
#>  2  1961  3.55  2.78  2.63  3.84  2.13  2.55  2.65  2.82  2.44  2.13  1.94  3.88
#>  3  1962  3.43  2.8   2.59  3.76  2.14  2.54  2.66  2.8   2.44  2.16  1.79  3.98
#>  4  1963  3.34  2.82  2.68  3.67  2.33  2.64  2.66  2.9   2.51  2.14  1.82  3.98
#>  5  1964  3.15  2.79  2.72  3.5   2.36  2.6   2.58  2.91  2.53  2.24  1.8   3.86
#>  6  1965  2.97  2.7   2.62  3.15  2.18  2.61  2.46  2.85  2.5   2.25  1.81  3.71
#>  7  1966  2.89  2.66  2.52  2.81  2.01  2.62  2.4   2.8   2.51  2.32  1.88  3.58
#>  8  1967  2.85  2.62  2.41  2.6   1.9   2.35  2.32  2.67  2.45  2.45  2.01  3.28
#>  9  1968  2.89  2.58  2.31  2.45  1.83  2.12  2.15  2.59  2.36  2.42  2.06  3.07
#> 10  1969  2.89  2.49  2.28  2.4   1.86  2     1.94  2.53  2.21  2.36  2.04  2.99
#> # ... with 51 more rows, and 42 more variables: IRL <dbl>, ITA <dbl>,
#> #   JPN <dbl>, KOR <dbl>, LUX <dbl>, MEX <dbl>, NLD <dbl>, NZL <dbl>,
#> #   NOR <dbl>, POL <dbl>, PRT <dbl>, SVK <dbl>, ESP <dbl>, SWE <dbl>,
#> #   CHE <dbl>, TUR <dbl>, GBR <dbl>, USA <dbl>, BRA <dbl>, CHL <dbl>,
#> #   CHN <dbl>, EST <dbl>, IND <dbl>, IDN <dbl>, ISR <dbl>, RUS <dbl>,
#> #   SVN <dbl>, ZAF <dbl>, COL <dbl>, LVA <dbl>, LTU <dbl>, ARG <dbl>,
#> #   BGR <dbl>, HRV <dbl>, CYP <dbl>, MLT <dbl>, ROU <dbl>, SAU <dbl>, ...

If we use the chain_index function now on all the columns, then we can avoid the previous bug. You can use a function on all columns (except on the first) with the mutate_at function.

fertility_df %>% 
  select(geo = LOCATION, time = TIME, fertility = Value) %>% 
  pivot_wider(names_from = "geo", values_from = "fertility") %>% 
  mutate_at(-1, chain_index)
#> # A tibble: 61 x 55
#>     time AUS   AUT   BEL   CAN   CZE   DNK   FIN   FRA   DEU   GRC   HUN   ISL  
#>    <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#>  1  1960 <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
#>  2  1961 2.90% 3.35% 3.54% -1.5~ 0.95% 0.39% -2.2~ 2.92% 2.95% -4.4~ -3.9~ -8.9~
#>  3  1962 -3.3~ 0.72% -1.5~ -2.0~ 0.47% -0.3~ 0.38% -0.7~ 0.00% 1.41% -7.7~ 2.58%
#>  4  1963 -2.6~ 0.71% 3.47% -2.3~ 8.88% 3.94% 0.00% 3.57% 2.87% -0.9~ 1.68% 0.00%
#>  5  1964 -5.6~ -1.0~ 1.49% -4.6~ 1.29% -1.5~ -3.0~ 0.34% 0.80% 4.67% -1.1~ -3.0~
#>  6  1965 -5.7~ -3.2~ -3.6~ -10.~ -7.6~ 0.38% -4.6~ -2.0~ -1.1~ 0.45% 0.56% -3.8~
#>  7  1966 -2.6~ -1.4~ -3.8~ -10.~ -7.8~ 0.38% -2.4~ -1.7~ 0.40% 3.11% 3.87% -3.5~
#>  8  1967 -1.3~ -1.5~ -4.3~ -7.4~ -5.4~ -10.~ -3.3~ -4.6~ -2.3~ 5.60% 6.91% -8.3~
#>  9  1968 1.40% -1.5~ -4.1~ -5.7~ -3.6~ -9.7~ -7.3~ -3.0~ -3.6~ -1.2~ 2.49% -6.4~
#> 10  1969 0.00% -3.4~ -1.3~ -2.0~ 1.64% -5.6~ -9.7~ -2.3~ -6.3~ -2.4~ -0.9~ -2.6~
#> # ... with 51 more rows, and 42 more variables: IRL <chr>, ITA <chr>,
#> #   JPN <chr>, KOR <chr>, LUX <chr>, MEX <chr>, NLD <chr>, NZL <chr>,
#> #   NOR <chr>, POL <chr>, PRT <chr>, SVK <chr>, ESP <chr>, SWE <chr>,
#> #   CHE <chr>, TUR <chr>, GBR <chr>, USA <chr>, BRA <chr>, CHL <chr>,
#> #   CHN <chr>, EST <chr>, IND <chr>, IDN <chr>, ISR <chr>, RUS <chr>,
#> #   SVN <chr>, ZAF <chr>, COL <chr>, LVA <chr>, LTU <chr>, ARG <chr>,
#> #   BGR <chr>, HRV <chr>, CYP <chr>, MLT <chr>, ROU <chr>, SAU <chr>, ...

And now let’s transform the table to its original structure. You can do this with the pivot_longer column.

fertility_df %>% 
  select(geo = LOCATION, time = TIME, fertility = Value) %>% 
  pivot_wider(names_from = "geo", values_from = "fertility") %>% 
  mutate_at(-1, chain_index) %>% 
  pivot_longer(-1, names_to = "geo", values_to = "fertility")
#> # A tibble: 3,294 x 3
#>     time geo   fertility
#>    <dbl> <chr> <chr>    
#>  1  1960 AUS   <NA>     
#>  2  1960 AUT   <NA>     
#>  3  1960 BEL   <NA>     
#>  4  1960 CAN   <NA>     
#>  5  1960 CZE   <NA>     
#>  6  1960 DNK   <NA>     
#>  7  1960 FIN   <NA>     
#>  8  1960 FRA   <NA>     
#>  9  1960 DEU   <NA>     
#> 10  1960 GRC   <NA>     
#> # ... with 3,284 more rows

2.7 `Group_modify`

Alternativly, we could solve this problem if we split the data frame into 63 individual data frames (one for each country). If you use the following syntax, you will get the correct results:

fertility_df %>% 
  select(geo = LOCATION, time = TIME, fertility = Value) %>% 
  arrange(time) %>% 
  group_by(geo) %>% 
  group_modify(~ mutate(.x, fertility_growth = chain_index(fertility)), .keep = F)

geo	time	fertility	fertility_growth
ARG	1960	3.11	NA
ARG	1961	3.10	-0.32%
ARG	1962	3.09	-0.32%
ARG	1963	3.08	-0.32%
ARG	1964	3.07	-0.32%
ARG	1965	3.06	-0.33%
ARG	1966	3.05	-0.33%
ARG	1967	3.05	0.00%
ARG	1968	3.05	0.00%
ARG	1969	3.06	0.33%
ARG	1970	3.08	0.65%
ARG	1971	3.11	0.97%
ARG	1972	3.15	1.29%
ARG	1973	3.20	1.59%
ARG	1974	3.25	1.56%
ARG	1975	3.30	1.54%
ARG	1976	3.34	1.21%
ARG	1977	3.36	0.60%
ARG	1978	3.36	0.00%
ARG	1979	3.34	-0.60%
ARG	1980	3.30	-1.20%
ARG	1981	3.25	-1.52%
ARG	1982	3.20	-1.54%
ARG	1983	3.16	-1.25%
ARG	1984	3.12	-1.27%
ARG	1985	3.10	-0.64%
ARG	1986	3.08	-0.65%
ARG	1987	3.06	-0.65%
ARG	1988	3.04	-0.65%
ARG	1989	3.02	-0.66%
ARG	1990	3.00	-0.66%
ARG	1991	2.97	-1.00%
ARG	1992	2.93	-1.35%
ARG	1993	2.88	-1.71%
ARG	1994	2.83	-1.74%
ARG	1995	2.77	-2.12%
ARG	1996	2.72	-1.81%
ARG	1997	2.67	-1.84%
ARG	1998	2.62	-1.87%
ARG	1999	2.58	-1.53%
ARG	2000	2.54	-1.55%
ARG	2001	2.51	-1.18%
ARG	2002	2.49	-0.80%
ARG	2003	2.46	-1.20%
ARG	2004	2.44	-0.81%
ARG	2005	2.42	-0.82%
ARG	2006	2.40	-0.83%
ARG	2007	2.38	-0.83%
ARG	2008	2.37	-0.42%
ARG	2009	2.36	-0.42%
ARG	2010	2.35	-0.42%
ARG	2011	2.34	-0.43%
ARG	2012	2.33	-0.43%
ARG	2013	2.32	-0.43%
ARG	2014	2.31	-0.43%
ARG	2015	2.30	-0.43%
ARG	2016	2.29	-0.43%
ARG	2017	2.28	-0.44%
ARG	2018	2.26	-0.88%
ARG	2019	2.25	-0.44%
AUS	1960	3.45	NA
AUS	1961	3.55	2.90%
AUS	1962	3.43	-3.38%
AUS	1963	3.34	-2.62%
AUS	1964	3.15	-5.69%
AUS	1965	2.97	-5.71%
AUS	1966	2.89	-2.69%
AUS	1967	2.85	-1.38%
AUS	1968	2.89	1.40%
AUS	1969	2.89	0.00%

2.8 Exercise

You have to get to know your classmates to solve this exercise.

Create a new .Rmd file
Import the da_q.csv data into R with relative referencing (see Chapter 1.5.1)
Write down the information you know about at least 3 of your teammates (Names are neccesary) as text in the .Rmd.
Write R codes to find the teammates (write also comments in the code³) using only base R function (tools you have learned in the 1st seminar).
Write R codes to find the teammates using dplyr function (tools you have learned in the 2nd seminar)

You found your teammate if the output of an R code is correct ID number of the person.

Example:

df
#> # A tibble: 21 x 11
#>       ID `What is your zod~ `Do you prefer ~ `What experience~ `How many slices~
#>    <dbl> <chr>              <chr>            <chr>             <chr>            
#>  1     1 leo                dogs             I am familier wi~ 8                
#>  2     2 taurus             Dogd             I have some expe~ 12               
#>  3     3 pisces             Both             I do not have an~ Depends on size.~
#>  4     4 taurus             neither          I do not have an~ 2                
#>  5     5 aquarius           dogs             I have some expe~ 3                
#>  6     6 leo                Cats             I am familier wi~ 4                
#>  7     7 virgo              Cats             I am familier wi~ 4                
#>  8     8 virgo              Dogs             I learnt R in un~ 4                
#>  9     9 taurus             dogs             I learnt R in un~ 3                
#> 10    10 cancer             dogs             I do not have an~ 2                
#> # ... with 11 more rows, and 6 more variables: Do you wear glasses?2 <chr>,
#> #   How many countries have you been to so far? <dbl>,
#> #   How many instagram followers do you have? (zero, if you do not have an account) <chr>,
#> #   How many brothers and sisters do you have? <chr>,
#> #   What is your batteries current charge level? (0-100) <chr>,
#> #   What is the traditional food in your country? (you can mention more, if you wish) <chr>

I am from Hungary, and the national food in Hungary is gulash.

df %>% 
  filter(str_detect(`What is the traditional food in your country? (you can mention more, if you wish)`, "gulash")) %>% 
  pull(ID)
#> [1] 1

So my ID number is 1.

Of course, you can not choose me. :)

1 Introduction to R

3 Ggplot

2 Markdown & Tidyverse

2.1 Markdown

2.2 Introduction to the tidyverse

2.3 The %>% operator

2.4 Tibble, count, filter, select, arrange

2.4.1 Tibble

2.4.2 Count

2.4.3 Mutate

2.5 Group_by, Summary

2.6 Pivot longer/wider

2.7 Group_modify

2.8 Exercise

2.3 The `%>%` operator

2.5 `Group_by`, `Summary`

2.7 `Group_modify`