Econometrics III

.title[
# Econometrics III
]
.subtitle[
## <img src="mnb_intezet.png" style="width:30.0%" />
Big Data and Data Visualisation
]
.author[
### Marcell Granát
]
.institute[
### Central Bank of Hungary & .blue[John von Neumann University]
]
.date[
### 2022
]

---

.content-box { 
box-sizing: content-box;
background-color: #378C95;
/* Total width: 160px + (2 * 20px) + (2 * 8px) = 216px
Total height: 80px + (2 * 20px) + (2 * 8px) = 136px
Content box width: 160px
Content box height: 80px */
}

.content-box-green {
background-color: #d9edc2;
}

.content-box-red {
background-color: #f9dbdb;
}

.fullprice {
text-decoration: line-through;
}
</style>

# Today's .blue[Agenda]

### Logit

---

# Logit

---

## Motivation

- It has already been discussed how to incorporate categorical variables as regressors into the model

- Today we will talk about what to do if the outcome variable is categorical (For simplicity, we only look at cases where **two outcomes** are possible)

- An illustrative example (Hungarian): [Empirical investigation of yield curves' ability to predict recession](https://marcellgranat.github.io/yieldcurve/)

---

## Introduction

> "For a number of reasons, the government bond yield curve is proving to be an accurate predictor of recessions in the US"

---

## Setup - Recession data

NBER publishes officially the quarters of recessions

[https://www.nber.org/research/data/us-business-cycle-expansions-and-contractions](https://www.nber.org/research/data/us-business-cycle-expansions-and-contractions)

Let's take that table!

---

## Setup - Recession data

```r
library(rvest)

recession_usa_df <- read_html("https://www.nber.org/research/data/us-business-cycle-expansions-and-contractions") %>%
  html_table() %>%
  first() %>% 
  janitor::row_to_names(1) %>% 
  janitor::clean_names() %>% 
  transmute(
    start_date = str_c(peak_year, "-", peak_quarter),
    end_date = str_c(trough_year, "-", trough_quarter),
    start_date = ifelse(str_detect(start_date, "occurred"), str_extract(start_date, "\\d\\d\\d\\dQ\\d"), start_date),
    end_date = ifelse(str_detect(end_date, "occurred"), str_extract(end_date, "\\d\\d\\d\\dQ\\d"), end_date),
    start_date = str_replace(start_date, "-", "Q"),
    end_date = str_replace(end_date, "-", "Q"),
    start_date = lubridate::yq(start_date),
    end_date = lubridate::yq(end_date)
  ) %>% 
  drop_na() %$% 
  map2(start_date, end_date, ~ seq.Date(.x, .y, by = "quarter")) %>% 
  reduce(c) %>% 
  tibble(date = ., recession = TRUE) %>% 
  write_csv("recession_usa.csv")
```

---

## Setup - Recession data

```r
recession_usa_df <- recession_usa_df %>% 
  full_join(tibble(date = seq.Date(as.Date("1857-01-01"), as.Date("2021-01-01"), by = "quarter"))) %>% 
  arrange(date) %>% 
  mutate(recession = ifelse(is.na(recession), FALSE, TRUE))

recession_usa_df
```

```
## # A tibble: 657 × 2
##    date       recession
##    <date>     <lgl>    
##  1 1857-01-01 FALSE    
##  2 1857-04-01 TRUE     
##  3 1857-07-01 TRUE     
##  4 1857-10-01 TRUE     
##  5 1858-01-01 TRUE     
##  6 1858-04-01 TRUE     
##  7 1858-07-01 TRUE     
##  8 1858-10-01 TRUE     
##  9 1859-01-01 FALSE    
## 10 1859-04-01 FALSE    
## # … with 647 more rows
```

---

## Setup - Yield data

- Yield data from the FRED website

Yes, it is prepared for now 😄

```r
yield_usa_df <- read_csv("https://raw.githubusercontent.com/MarcellGranat/big_data2022/main/econometrics3/yield_usa.csv", col_select = -1)
```

But, the recession data is quarterly...

---

## Setup - Join the two sets

```r
inner_join(
  x = yield_usa_df,
  y = recession_usa_df
) %>% 
  transmute(
    date,
    recession,
    spread = GS10 - GS1,
    spread = lag(spread, n = 4)
  ) %>% 
  drop_na()
```

```
## # A tibble: 268 × 3
##    date       recession spread
##    <date>     <lgl>      <dbl>
##  1 1954-04-01 TRUE       0.567
##  2 1954-07-01 FALSE      0.630
##  3 1954-10-01 FALSE      0.937
##  4 1955-01-01 FALSE      1.21 
##  5 1955-04-01 FALSE      1.47 
##  6 1955-07-01 FALSE      1.43 
##  7 1955-10-01 FALSE      1.3  
##  8 1956-01-01 FALSE      1.13 
##  9 1956-04-01 FALSE      0.91 
## 10 1956-07-01 FALSE      0.697
## # … with 258 more rows
```

## Final setup

```r
recession_to_spread <- function(short = "GS1", long = "GS10", n_lag = 4) {
  inner_join(
    x = yield_usa_df,
    y = recession_usa_df
  ) %>% 
    select(date, recession, short, long) %>% 
    set_names("date", "recession", "short", "long") %>% 
    transmute(
      date,
      recession,
      spread = long - short,
      spread = lag(spread, n = n_lag)
    ) %>% 
    drop_na()
}
```

---

## The logit

Let P_i denote the probability of being `TRUE`. The estimated function is the following:

`$$\text{log}\frac{P_i}{1-P_i}=\beta_0+\sum_{j=1}^{k}\beta_jx_{i,j}$$`

- The left side of the equation is called **log odds-ratio**

- This model is not estimated by OLS!

- We must introduce the likelihood-function:

`$$L=\prod_{y_i=1}P_i\prod_{y_i=0}(1-P_i)$$`

> Maximization of the likelihood function for either the probit or the logit model is accomplished by .blue[iterative] nonlinear estimation methods. There are now several computer programs available for probit and logit analysis, and these programs are very inexpensive to run.

---

### The Logit in R

```r
fit <- glm(recession ~ spread, binomial(link = "logit"), recession_to_spread())
```

The broom functions also work here!

Extract the coeffitients:

```r
broom::tidy(fit)
```

```
## # A tibble: 2 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   -0.719     0.201     -3.57 3.60e- 4
## 2 spread        -1.72      0.278     -6.20 5.61e-10
```

---

### The Logit in R

Predictions & errors for the observations:

```r
broom::augment(fit, type.predict = "response")
```

```
## # A tibble: 268 × 8
##    recession spread .fitted .resid .std.resid    .hat .sigma  .cooksd
##    <lgl>      <dbl>   <dbl>  <dbl>      <dbl>   <dbl>  <dbl>    <dbl>
##  1 TRUE       0.567  0.155   1.93       1.94  0.00590  0.808 0.0163  
##  2 FALSE      0.630  0.141  -0.552     -0.554 0.00590  0.816 0.000491
##  3 FALSE      0.937  0.0884 -0.430     -0.432 0.00602  0.816 0.000296
##  4 FALSE      1.21   0.0568 -0.342     -0.343 0.00593  0.816 0.000181
##  5 FALSE      1.47   0.0372 -0.275     -0.276 0.00555  0.816 0.000108
##  6 FALSE      1.43   0.0398 -0.285     -0.286 0.00562  0.816 0.000118
##  7 FALSE      1.3    0.0493 -0.318     -0.319 0.00583  0.816 0.000153
##  8 FALSE      1.13   0.0650 -0.367     -0.368 0.00599  0.816 0.000211
##  9 FALSE      0.91   0.0922 -0.440     -0.441 0.00601  0.816 0.000309
## 10 FALSE      0.697  0.128  -0.523     -0.525 0.00592  0.816 0.000439
## # … with 258 more rows
```

---

### Reproduce the table from Estrella és Fishkin [1996b]

Estrella és Fishkin [1996b] reports a table that contains the probability of recession to certain values of spread.

To reproduce this table we use our own model to estimate the probability of recession.footnote[The results will differ a bit. This is because the time-series of spread is longer in the article & they use probit model (the link function is different, but does not cause a huge difference)].

The `augment` function has a `newdata` input parameter, so we generate the predictions so easily with that. (This works also in the case of other linear models).

---

### Reproduce the table from Estrella és Fishkin [1996b]

```r
recession_to_spread() %>% 
  filter(date < "1995-01-01") %>% 
  glm(recession ~ spread, binomial(link = "probit"), data = .) %>% 
  broom::augment(fit, type.predict = "response", 
                 newdata=tibble(spread = seq(
                   from = -2.5, 
                   to = 1.4, 
                   by = .40)
                   ))
```

```
## # A tibble: 10 × 2
##    spread .fitted
##     <dbl>   <dbl>
##  1 -2.5    0.968 
##  2 -2.1    0.929 
##  3 -1.7    0.862 
##  4 -1.3    0.760 
##  5 -0.9    0.627 
##  6 -0.5    0.477 
##  7 -0.100  0.330 
##  8  0.300  0.206 
##  9  0.7    0.115 
## 10  1.1    0.0566
```

---

### ROC curve

#### When do we say that "the winter is coming"?

If it exceeds a certain probability... = **Cut-off value**

```r
cut_off = 0.5

broom::augment(fit, type.predict = "response") %>% 
  mutate(
    estimation = .fitted > cut_off,
  ) %>% 
  select(recession, spread, .fitted, estimation)
```

```
## # A tibble: 268 × 4
##    recession spread .fitted estimation
##    <lgl>      <dbl>   <dbl> <lgl>     
##  1 TRUE       0.567  0.155  FALSE     
##  2 FALSE      0.630  0.141  FALSE     
##  3 FALSE      0.937  0.0884 FALSE     
##  4 FALSE      1.21   0.0568 FALSE     
##  5 FALSE      1.47   0.0372 FALSE     
##  6 FALSE      1.43   0.0398 FALSE     
##  7 FALSE      1.3    0.0493 FALSE     
##  8 FALSE      1.13   0.0650 FALSE     
##  9 FALSE      0.91   0.0922 FALSE     
## 10 FALSE      0.697  0.128  FALSE     
## # … with 258 more rows
```

---

### ROC curve

#### When do we say that "the winter is coming"?

If it exceeds a certain probability... = **Cut-off value**

```r
cut_off = 0.5

broom::augment(fit, type.predict = "response") %>% 
  mutate(
    estimation = .fitted > cut_off,
    correct = recession == estimation
  ) %>% 
  count(correct, estimation) %>% 
  mutate(estimation = ifelse(estimation == TRUE, "positive", "negative"))
```

```
## # A tibble: 4 × 3
##   correct estimation     n
##   <lgl>   <chr>      <int>
## 1 FALSE   negative      35
## 2 FALSE   positive       6
## 3 TRUE    negative     216
## 4 TRUE    positive      11
```

---

### ROC curve

#### When do we say that "the winter is coming"?

If it exceeds a certain probability... = **Cut-off value**

```r
confusion_matrix <- function(.fit, cut_off = .5) {
  broom::augment(fit, type.predict = "response") %>% 
    mutate(
      estimation = .fitted > cut_off,
      correct = recession == estimation
    ) %>% 
    count(correct, estimation) %>% 
    mutate(estimation = ifelse(estimation == TRUE, "positive", "negative"))
}
confusion_matrix(fit, .7)
```

```
## # A tibble: 4 × 3
##   correct estimation     n
##   <lgl>   <chr>      <int>
## 1 FALSE   negative      39
## 2 FALSE   positive       5
## 3 TRUE    negative     217
## 4 TRUE    positive       7
```

---

### ROC curve

#### When do we say that "the winter is coming"?

**If it exceeds a certain probability... = Cut-off value**

Some possible cut-off values:

- .5

- return the same number of `TRUE` predictions as `TRUE` observations

- maximizing the accuracy

- minimizing cost (if you know that)

---

### ROC curve

Let's construct the confusion matrix to a given a cut-off value

Now, the following indicators can be derived:

`$$\text{Sensitivity}=\frac{TP}{TP+FN}$$`

`$$\text{Specificity}=\frac{TN}{TN+FP}$$`

This two values can be calculated to any cut-off values... This is the ROC curve

---

### ROC curve in R

```r
roc_curve <- broom::augment(fit, type.predict = "response") %$% 
  pROC::roc(recession, .fitted)
```

Area under the curve:

```r
roc_curve$auc
```

```
## Area under the curve: 0.8713
```

.content-box-greeen[
AUC is the suggested indicator to evaluate your model and compare it to another one!
]

```r
roc_curve_df <- roc_curve %$%
  tibble(thresholds, sensitivities, specificities)
```

---

### Visualization of the ROC curve

---

# References

<p><cite>Maddala, G. S.
(1992).
<em>Introduction to economics</em>.
Macmillan.</cite></p>
<p><cite>Kahneman, D.
(2011).
<em>Thinking, fast and slow</em>.
Macmillan.</cite></p>
<p><cite>Ramanathan, R.
(1992).
<em>Introductory econometrics with applications</em>.
Dryden Press.</cite></p>
<p><cite>Estrella, A. and F. S. Mishkin
(1996).
&ldquo;The yield curve as a predictor of US recessions&rdquo;.
In: <em>Current issues in economics and finance</em> 2.7.</cite></p>

---

# Thank you for your attention!

Slides are available at [www.marcellgranat.com](https://www.marcellgranat.com)