Model building

Following the findings in the previous chapters, I highlight the main ideas for the model building: (1) Youth unemployment is a significant factor in explaining the variance of fertility, but including it in the model reduces the number of complete observations, so I report one model with the youth unemployment included and one without that. (2) Extending the design matrix with the interactions of youth unemployment rates and fertility is justified. (This extension only concerns the framework which contains unemployment).

Additionally, concerning the nature of fertility (duration of pregnancy) and childbearing decisions (parents may interpolate their expected socio-economic situation from their past) extending the model with the lagged value of the predictors is also reasonable. This raises the issue that the model contains too many predictors and variable selection becomes difficult. I manage variable selection based on lasso selection.

The first step to estimate panel regression models is to identify the appropriate model type. Choosing among the pooled, within and random-effects model requires performing the Chow test and Hausman test. The null hypothesis in the former one is that whether a significant difference among the individual intercepts exists, and the test is performed based on an F-test. If the given \(\text{H}_{0}\) cannot be rejected at any standard significance level, estimating a pooled model is suggested. In other cases, the outcome depends on the result of the Hausman test. The Hausman test (or Durbin-Wu-Hausman test) is more complex from mathematical aspects, but the interpretation is simple: if the given \(\text{H}_{0}\) cannot be rejected, the within (fixed-effects) model is not as efficient as the random-effects, and estimating random-effects model is suggested. If the null hypothesis is rejected at any standard significance level, the fixed-effects model will be suitable .

The optimal model may differ if the set of predictors change, but as a starting point, I estimate the one-one simplest model for the two mentioned model frameworks (including unemployment statistics or not). These models do not contain interaction but lagged effect and quadratic terms are included. The performed tests show that the fixed-effects model is optimal (see the results in Table 5), and the estimated coefficients from the fixed-effects models are presented in figure 10.

Panel models on the total fertility rates

Figure 10: Panel models on the total fertility rates

Table 4: Models
Indicator Model I. Model II.
Pooltest 0.00% 0.00%
Phtest 0.00% 0.00%
Adjusted \(R^2\) 20.20% 29.23%
Observations 699 3257

In the following, I perform lasso variable selection on these model frameworks to find the most relevant effects.

Framework I: with unemployment

As repeatedly mentioned in the previous section, including unemployment statistics in the regression analysis drastically reduces the number of complete observations. Moreover, there is a good reason to believe that this selection method is not random, and omitting the incomplete data points may cause bias in the results. Figure 10 shows the number of complete observations by territorial units related to the two frameworks.

Number of complete observations by countries when the model includes or excludes unemployment statistics

Figure 11: Number of complete observations by countries when the model includes or excludes unemployment statistics

Figure 10 reveals that there are unbalances among the regions concerning their representation in the dataset. A prominent amount of incomplete and thus unusable observations concerns the Central European region and the UK. In this paper, I do not impute these missing values, and therefore, the limited generalizability of the results is taken into account.

To measure the bias, I calculate the mean of the given variables in the total sample (including incomplete observations), and the sample that is used in this framework (excluding incomplete observations). The result is reported in table 5. The table describes that the health index and income index are higher, while the family benefit is lower in the used sample compared to the dataset containing the incomplete observations as well.

Table 5: Comparison of average values of the variables for incomplete and complete observations (Framework I)
Variable Mean in total sample Mean in used sample Number of observations in the total sample
TFR 1.5575 1.4801 7249
Education index 0.5159 0.5093 5373
Health index 0.9089 0.9375 6983
Income index 0.8120 0.8298 4291
Family benefit 2.0759 1.6302 6155
UR (15-19 y) 36.6887 44.4312 1713
UR (20-24 y) 24.3992 25.5202 2935
UR (25-29 y) 17.0111 17.0333 2664

Methodology of model estimating

As described above, an optimal statistical tool for a high-dimensional dataset is the lasso regression. From mathematical aspect lasso regression means to add a \(\lambda\sum_{j = 1}^{p}|\beta_{j}|\) term to the target function of regression . Intuitively, with this additional term, the value of the target function is lower (which has to be minimized), if more parameters are equal to zero, but the prediction error does not decrease significantly.

Lasso regression has a hyperparameter \(\lambda\). Setting its value to zero leads to the unmodified OLS model. In contrast, a lasso regression with \(\lambda = 1\) would lead to an empty model. Finding the optimal \(\lambda\) hyperparameter requires estimating the model with different parameters. In each case, the model contains a different number of variables. To determine the optimal value of \(\lambda\), leave-one-out cross-validation is performed with 10 folds, then the model having the lowest mean squared error on the validation set is chosen. This process is visualized in figure 11.

Performance of lasso regression models with different parameters

Figure 12: Performance of lasso regression models with different parameters

It is important to note that the above-described algorithm eliminated the insignificant individual intercept terms from the model. To adjust this property, I reestimate the within model including all the individual intercept terms and predictor variables from the best-performing lasso regression model.

In addition, the difference in the measurement of the variables causes complexity in interpretation. Interpreting the direct effect of a predictor is simple, but the coefficients are not sufficient to describe the importance of the variables, because they are measured on different scales (one percentage point change in the family benefit-to-GDP ratio would be extreme, but not as outstanding as one percentage point change in the youth unemployment rate). To manage this, I reestimate the model with standardized variables. The benefit of this model is that the explained variance of the regression model can be decomposed with it:

\[\begin{align} \text{R}^2 = \sum_{j=1}^{p} r_j \times \beta_{standardized,j} \end{align}\]

Based on equation 4, the coefficient multiplied by the linear correlation coefficient (\(r\)) can be interpreted as the contribution to the explained variance. The result of this and the above-mentioned computations are reported in figure 12.

Estimated coefficient of the fixed panel model controlling for youth unemployment indicators

Figure 13: Estimated coefficient of the fixed panel model controlling for youth unemployment indicators

Interpreting the results

The high contribution of income index and family benefit to the explained variance revealed by figure 12. Both of them seem to have a positive effect on fertility. Increasing income per capita leads to higher total fertility rate based on the model parameters (positive coefficient correspond to each lagged variable). In contrast, coefficients related to the different lagged values of family support indicate a complex mechanism: family support has a high instantaneous effect on fertility, but the lagged negative effect implies that this birth-surplus disappears in the following years.

Based on the results the answer to our research question is that in the developed world (1) income is far the most important component of human development influencing fertility (highest contribution to \(\text{R}^2\)) and (2) family support also has a significant instantaneous effect on childbearing willingness, but it seems weaker on the long run.

Youth unemployment rates among 25-29 year-olds have a negative effect on fertility, but its total effect is lagged. Among 15-19 year-olds this effect is different. In their case, the increase in unemployment causes an instantaneous increase in fertility. But in the case of a permanent increase in unemployment, this increase disappears (ceteris paribus). Extending the model with the interactions of youth unemployment and family benefit was truly beneficial comparing the standardized effect of the unemployment rates and the interactions.

Education index is also detected as a significant explanation of the variance of fertility, but its interpretation is more complex. The lagged quadratic terms are represented in the model with negative coefficients. This reflects that the higher education index leads to lower fertility, and this effect is stronger in the case of those regions, where the education index is higher. This confirms the former findings in the literature that highly educated women tend to have less children , but recent research found empirical evidence, that “higher educated German women, who already decided to have a child despite their high opportunity costs are more family oriented” .

Framework II: without unemployment

I continue my study reporting the results from the model excludes unemployment statistics. This framework omits significantly fewer data points, so the probability of contra selection-caused bias is reduced. Table 6 describes the average difference comparing the used sample and the values from the incomplete observations.

The methodology of the model estimation is equivalent to the one described in the first model framework. The results are presented in figure 15. The \(\text{H}_0\) of the Chow test (\(p = 0.00%\)) and the Hausman test (\(p = 0.00%\)) are rejected, so fixed effect model is adequate. The \(\text{R}^2\) of this model is 14.85%14.

Table 6: Comparison of average values of the variables for incomplete and complete observations (Framework II)
Variable Mean in total sample Mean in used sample Number of observations in total sample
TFR 1.5575 1.4935 7249
Education index 0.5159 0.4976 5373
Health index 0.9089 0.9202 6983
Income index 0.8120 0.8228 4291
Family benefit 2.0759 1.9699 6155
Estimated coefficient of the fixed panel model omitting youth unemployment indicators

Figure 14: Estimated coefficient of the fixed panel model omitting youth unemployment indicators

Interpreting the results

Figure 13 shows that the outstandingly high contribution to the \(\text{R}^2\) of the income index did not change, so the answer for my first research question is robust to the framework: income index explains significantly more of the variance of the fertility than the other components of human development, and it has a positive effect on childbearing willingness.

The estimated effect of family benefit differs in this framework compared to the one including unemployment statistics (and excluding observations where youth unemployment is not available). The new observations come from Central-European regions and the estimated structure of the effect of family benefit became significantly different. The reversal effect in this framework is close to the instantaneous effect of the family spending. This leads to the interpretation that family support has only an instantaneous impact on fertility, but on the long run it can not significantly influence the fertility.

The main difference between the results reported by the two frameworks is that the health index is signed as a significant variable in the second one. Since all of its transformed terms has a negative coefficient, a higher health index leads to lower fertility. The frequently mentioned reason for this is the changing lifestyle and women in the EU are having their first child later. One possible explanation why health index was not relevant in the previous framework is that the share of observations from Central-European countries are much higher in this model (due to the lack of unemployment statistics from the early 2000s). Extension of average childbearing age led to failing fertility rates in this area15. But many articles suggest that an adjusted TFR should be considered16, because the drastically low fertility during the time of this mechanism. However, these indices are currently not available for regional datasets, but this could be a possible further research direction.


  1. The heterogenity of the contries containing complete observation is higher in this setup, that is comparing the \(R^2\) of the two frameworks is not suggested.↩︎

  2. Berde, É. and Németh, P. (2014). Az alacsony magyarországi termékenység új megközelítésben. Statisztikai Szemle, 92(3):253–274.↩︎

  3. Berde, É. and Németh, P. (2014). Az alacsony magyarországi termékenység új megközelítésben. Statisztikai Szemle, 92(3):253–274.↩︎