class: center, middle, title-slide .title[ # Econometrics III ] .subtitle[ ##
Big Data and Data Visualisation ] .author[ ### Marcell Granát ] .institute[ ### Central Bank of Hungary & .blue[John von Neumann University] ] .date[ ### 2022 ] --- <style type="text/css"> .red { color: red; } .blue { color: #378C95; } strong { color: red; } a { color: #378C95; font-weight: bold; } .remark-inline-code { font-weight: 900; background-color: #a7d5e7; } .caption { color: #378C95; font-style: italic; text-align: center; } .content-box { box-sizing: content-box; background-color: #378C95; /* Total width: 160px + (2 * 20px) + (2 * 8px) = 216px Total height: 80px + (2 * 20px) + (2 * 8px) = 136px Content box width: 160px Content box height: 80px */ } .content-box-green { background-color: #d9edc2; } .content-box-red { background-color: #f9dbdb; } .fullprice { text-decoration: line-through; } </style> # Today's .blue[Agenda] ### Logit --- class: inverse, middle, center # Logit --- ## Motivation - It has already been discussed how to incorporate categorical variables as regressors into the model - Today we will talk about what to do if the outcome variable is categorical (For simplicity, we only look at cases where **two outcomes** are possible) - An illustrative example (Hungarian): [Empirical investigation of yield curves' ability to predict recession](https://marcellgranat.github.io/yieldcurve/) --- ## Introduction > "For a number of reasons, the government bond yield curve is proving to be an accurate predictor of recessions in the US" <p align="center"><iframe width="720" height="405" src="https://www.youtube.com/embed/oW4hfaiXKG8" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe></p> --- ## Setup - Recession data NBER publishes officially the quarters of recessions [https://www.nber.org/research/data/us-business-cycle-expansions-and-contractions](https://www.nber.org/research/data/us-business-cycle-expansions-and-contractions) <img src="econometrics3/nber.png" width="500px" height="350px" style="display: block; margin: auto;" /> Let's take that table! --- ## Setup - Recession data ```r library(rvest) recession_usa_df <- read_html("https://www.nber.org/research/data/us-business-cycle-expansions-and-contractions") %>% html_table() %>% first() %>% janitor::row_to_names(1) %>% janitor::clean_names() %>% transmute( start_date = str_c(peak_year, "-", peak_quarter), end_date = str_c(trough_year, "-", trough_quarter), start_date = ifelse(str_detect(start_date, "occurred"), str_extract(start_date, "\\d\\d\\d\\dQ\\d"), start_date), end_date = ifelse(str_detect(end_date, "occurred"), str_extract(end_date, "\\d\\d\\d\\dQ\\d"), end_date), start_date = str_replace(start_date, "-", "Q"), end_date = str_replace(end_date, "-", "Q"), start_date = lubridate::yq(start_date), end_date = lubridate::yq(end_date) ) %>% drop_na() %$% map2(start_date, end_date, ~ seq.Date(.x, .y, by = "quarter")) %>% reduce(c) %>% tibble(date = ., recession = TRUE) %>% write_csv("recession_usa.csv") ``` --- ## Setup - Recession data ```r recession_usa_df <- recession_usa_df %>% full_join(tibble(date = seq.Date(as.Date("1857-01-01"), as.Date("2021-01-01"), by = "quarter"))) %>% arrange(date) %>% mutate(recession = ifelse(is.na(recession), FALSE, TRUE)) recession_usa_df ``` ``` ## # A tibble: 657 × 2 ## date recession ## <date> <lgl> ## 1 1857-01-01 FALSE ## 2 1857-04-01 TRUE ## 3 1857-07-01 TRUE ## 4 1857-10-01 TRUE ## 5 1858-01-01 TRUE ## 6 1858-04-01 TRUE ## 7 1858-07-01 TRUE ## 8 1858-10-01 TRUE ## 9 1859-01-01 FALSE ## 10 1859-04-01 FALSE ## # … with 647 more rows ``` --- ## Setup - Yield data - Yield data from the FRED website -- Yes, it is prepared for now 😄 ```r yield_usa_df <- read_csv("https://raw.githubusercontent.com/MarcellGranat/big_data2022/main/econometrics3/yield_usa.csv", col_select = -1) ``` -- But, the recession data is quarterly... --- ## Setup - Join the two sets ```r inner_join( x = yield_usa_df, y = recession_usa_df ) %>% transmute( date, recession, spread = GS10 - GS1, spread = lag(spread, n = 4) ) %>% drop_na() ``` ``` ## # A tibble: 268 × 3 ## date recession spread ## <date> <lgl> <dbl> ## 1 1954-04-01 TRUE 0.567 ## 2 1954-07-01 FALSE 0.630 ## 3 1954-10-01 FALSE 0.937 ## 4 1955-01-01 FALSE 1.21 ## 5 1955-04-01 FALSE 1.47 ## 6 1955-07-01 FALSE 1.43 ## 7 1955-10-01 FALSE 1.3 ## 8 1956-01-01 FALSE 1.13 ## 9 1956-04-01 FALSE 0.91 ## 10 1956-07-01 FALSE 0.697 ## # … with 258 more rows ``` ## Final setup ```r recession_to_spread <- function(short = "GS1", long = "GS10", n_lag = 4) { inner_join( x = yield_usa_df, y = recession_usa_df ) %>% select(date, recession, short, long) %>% set_names("date", "recession", "short", "long") %>% transmute( date, recession, spread = long - short, spread = lag(spread, n = n_lag) ) %>% drop_na() } ``` -- .blue[Now, let's see the model] --- ## The logit Let P_i denote the probability of being `TRUE`. The estimated function is the following: `$$\text{log}\frac{P_i}{1-P_i}=\beta_0+\sum_{j=1}^{k}\beta_jx_{i,j}$$` - The left side of the equation is called **log odds-ratio** - This model is not estimated by OLS! -- - We must introduce the likelihood-function: `$$L=\prod_{y_i=1}P_i\prod_{y_i=0}(1-P_i)$$` .blue[How would you interpret it?] -- > Maximization of the likelihood function for either the probit or the logit model is accomplished by .blue[iterative] nonlinear estimation methods. There are now several computer programs available for probit and logit analysis, and these programs are very inexpensive to run. --- ### The Logit in R ```r fit <- glm(recession ~ spread, binomial(link = "logit"), recession_to_spread()) ``` -- The broom functions also work here! Extract the coeffitients: ```r broom::tidy(fit) ``` ``` ## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -0.719 0.201 -3.57 3.60e- 4 ## 2 spread -1.72 0.278 -6.20 5.61e-10 ``` --- ### The Logit in R Predictions & errors for the observations: ```r broom::augment(fit, type.predict = "response") ``` ``` ## # A tibble: 268 × 8 ## recession spread .fitted .resid .std.resid .hat .sigma .cooksd ## <lgl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 TRUE 0.567 0.155 1.93 1.94 0.00590 0.808 0.0163 ## 2 FALSE 0.630 0.141 -0.552 -0.554 0.00590 0.816 0.000491 ## 3 FALSE 0.937 0.0884 -0.430 -0.432 0.00602 0.816 0.000296 ## 4 FALSE 1.21 0.0568 -0.342 -0.343 0.00593 0.816 0.000181 ## 5 FALSE 1.47 0.0372 -0.275 -0.276 0.00555 0.816 0.000108 ## 6 FALSE 1.43 0.0398 -0.285 -0.286 0.00562 0.816 0.000118 ## 7 FALSE 1.3 0.0493 -0.318 -0.319 0.00583 0.816 0.000153 ## 8 FALSE 1.13 0.0650 -0.367 -0.368 0.00599 0.816 0.000211 ## 9 FALSE 0.91 0.0922 -0.440 -0.441 0.00601 0.816 0.000309 ## 10 FALSE 0.697 0.128 -0.523 -0.525 0.00592 0.816 0.000439 ## # … with 258 more rows ``` --- ### Reproduce the table from Estrella és Fishkin [1996b] Estrella és Fishkin [1996b] reports a table that contains the probability of recession to certain values of spread. -- To reproduce this table we use our own model to estimate the probability of recession.footnote[The results will differ a bit. This is because the time-series of spread is longer in the article & they use probit model (the link function is different, but does not cause a huge difference)]. -- The `augment` function has a `newdata` input parameter, so we generate the predictions so easily with that. (This works also in the case of other linear models). --- ### Reproduce the table from Estrella és Fishkin [1996b] ```r recession_to_spread() %>% filter(date < "1995-01-01") %>% glm(recession ~ spread, binomial(link = "probit"), data = .) %>% broom::augment(fit, type.predict = "response", newdata=tibble(spread = seq( from = -2.5, to = 1.4, by = .40) )) ``` ``` ## # A tibble: 10 × 2 ## spread .fitted ## <dbl> <dbl> ## 1 -2.5 0.968 ## 2 -2.1 0.929 ## 3 -1.7 0.862 ## 4 -1.3 0.760 ## 5 -0.9 0.627 ## 6 -0.5 0.477 ## 7 -0.100 0.330 ## 8 0.300 0.206 ## 9 0.7 0.115 ## 10 1.1 0.0566 ``` --- ### ROC curve #### When do we say that "the winter is coming"? If it exceeds a certain probability... = **Cut-off value** ```r cut_off = 0.5 broom::augment(fit, type.predict = "response") %>% mutate( estimation = .fitted > cut_off, ) %>% select(recession, spread, .fitted, estimation) ``` ``` ## # A tibble: 268 × 4 ## recession spread .fitted estimation ## <lgl> <dbl> <dbl> <lgl> ## 1 TRUE 0.567 0.155 FALSE ## 2 FALSE 0.630 0.141 FALSE ## 3 FALSE 0.937 0.0884 FALSE ## 4 FALSE 1.21 0.0568 FALSE ## 5 FALSE 1.47 0.0372 FALSE ## 6 FALSE 1.43 0.0398 FALSE ## 7 FALSE 1.3 0.0493 FALSE ## 8 FALSE 1.13 0.0650 FALSE ## 9 FALSE 0.91 0.0922 FALSE ## 10 FALSE 0.697 0.128 FALSE ## # … with 258 more rows ``` --- ### ROC curve #### When do we say that "the winter is coming"? If it exceeds a certain probability... = **Cut-off value** ```r cut_off = 0.5 broom::augment(fit, type.predict = "response") %>% mutate( estimation = .fitted > cut_off, correct = recession == estimation ) %>% count(correct, estimation) %>% mutate(estimation = ifelse(estimation == TRUE, "positive", "negative")) ``` ``` ## # A tibble: 4 × 3 ## correct estimation n ## <lgl> <chr> <int> ## 1 FALSE negative 35 ## 2 FALSE positive 6 ## 3 TRUE negative 216 ## 4 TRUE positive 11 ``` --- ### ROC curve #### When do we say that "the winter is coming"? If it exceeds a certain probability... = **Cut-off value** ```r confusion_matrix <- function(.fit, cut_off = .5) { broom::augment(fit, type.predict = "response") %>% mutate( estimation = .fitted > cut_off, correct = recession == estimation ) %>% count(correct, estimation) %>% mutate(estimation = ifelse(estimation == TRUE, "positive", "negative")) } confusion_matrix(fit, .7) ``` ``` ## # A tibble: 4 × 3 ## correct estimation n ## <lgl> <chr> <int> ## 1 FALSE negative 39 ## 2 FALSE positive 5 ## 3 TRUE negative 217 ## 4 TRUE positive 7 ``` --- ### ROC curve #### When do we say that "the winter is coming"? **If it exceeds a certain probability... = Cut-off value** Some possible cut-off values: - .5 - return the same number of `TRUE` predictions as `TRUE` observations - maximizing the accuracy - minimizing cost (if you know that) --- ### ROC curve Let's construct the confusion matrix to a given a cut-off value Now, the following indicators can be derived: `$$\text{Sensitivity}=\frac{TP}{TP+FN}$$` `$$\text{Specificity}=\frac{TN}{TN+FP}$$` This two values can be calculated to any cut-off values... This is the ROC curve --- ### ROC curve in R ```r roc_curve <- broom::augment(fit, type.predict = "response") %$% pROC::roc(recession, .fitted) ``` Area under the curve: ```r roc_curve$auc ``` ``` ## Area under the curve: 0.8713 ``` .content-box-greeen[ AUC is the suggested indicator to evaluate your model and compare it to another one! ] ```r roc_curve_df <- roc_curve %$% tibble(thresholds, sensitivities, specificities) ``` --- ### Visualization of the ROC curve <img src="econometrics3_files/figure-html/unnamed-chunk-21-1.png" width="700px" height="450px" style="display: block; margin: auto;" /> --- # References <p><cite>Maddala, G. S. (1992). <em>Introduction to economics</em>. Macmillan.</cite></p> <p><cite>Kahneman, D. (2011). <em>Thinking, fast and slow</em>. Macmillan.</cite></p> <p><cite>Ramanathan, R. (1992). <em>Introductory econometrics with applications</em>. Dryden Press.</cite></p> <p><cite>Estrella, A. and F. S. Mishkin (1996). “The yield curve as a predictor of US recessions”. In: <em>Current issues in economics and finance</em> 2.7.</cite></p> --- class: center, middle # Thank you for your attention! Slides are available at [www.marcellgranat.com](https://www.marcellgranat.com)