Feel free to try the exercises below at your leisure. Solutions will be posted later in the week!

Prepping Data and Running Basic Models

Using the data set linked here, we will attempt to run some basic linear regression models where we will attempt to predict gdp08 with dem_score14, pop_urban, and oecd (i.e. regressing gdp08 on dem_score14, pop_urban, and oecd).

  1. Load and prepare the data. Create the following new transformed variables: 1) scale dem_score14 and pop_urban; 2) create a dummy variable for OECD membership.
library(dplyr)
world_data <- read.csv("https://github.com/apodkul/ppol6803_03/raw/main/Data/world_data.csv")

world_data <- world_data %>%
  mutate(dem_score14_s = scale(dem_score14), 
         pop_urban_s   = scale(pop_urban), 
         oecd_dummy = case_when(
           oecd == 'OECD Member state' ~ 1,
           oecd == 'Not member'        ~ 0,
           is.na(oecd) ~ as.numeric(NA)
         ))
  1. Randomly split the data set with 50% of the rows in Group A and 50% in Group B. (Try to make this step replicable. For a hint see here.)
set.seed(1234)
library(caret)

trainIndex <- createDataPartition(1:nrow(world_data), 
                                  p = .5, list = F,
                                  times = 1)

world_data_A <- world_data[trainIndex,]
world_data_B  <- world_data[-trainIndex,]
  1. Run a basic linear model for each Group created in the previous step using lm(). Compare the \(R^2\) and RMSE values of each model.
mod_a <- lm(gdp08~dem_score14_s+pop_urban_s+oecd_dummy, 
            data = world_data_A)
mod_b <- lm(gdp08~dem_score14_s+pop_urban_s+oecd_dummy, 
            data = world_data_B)

# extract RMSE and R-squared here -- differences will vary due to randomization
summary(mod_a)
## 
## Call:
## lm(formula = gdp08 ~ dem_score14_s + pop_urban_s + oecd_dummy, 
##     data = world_data_A)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -788.4 -159.9  -88.4   23.0 3619.3 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)     142.22      74.32   1.914  0.05968 . 
## dem_score14_s   -84.49      86.43  -0.978  0.33162   
## pop_urban_s      95.30      72.27   1.319  0.19152   
## oecd_dummy      695.63     242.48   2.869  0.00542 **
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 555.2 on 71 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.1607, Adjusted R-squared:  0.1252 
## F-statistic: 4.531 on 3 and 71 DF,  p-value: 0.005791
summary(mod_b)
## 
## Call:
## lm(formula = gdp08 ~ dem_score14_s + pop_urban_s + oecd_dummy, 
##     data = world_data_B)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1600.2  -331.8  -258.8  -139.5 12581.7 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)     316.64     255.11   1.241   0.2185  
## dem_score14_s   -48.60     281.58  -0.173   0.8634  
## pop_urban_s      54.09     251.32   0.215   0.8302  
## oecd_dummy     1298.68     635.75   2.043   0.0447 *
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1865 on 73 degrees of freedom
##   (6 observations deleted due to missingness)
## Multiple R-squared:  0.08694,    Adjusted R-squared:  0.04941 
## F-statistic: 2.317 on 3 and 73 DF,  p-value: 0.08267
  1. Re-estimate a linear model using caret::train with the tag of lm (keep the other arguments at their default values) and compare to the models estimated in step #3. How do the outputs in caret::train and lm() differ?
mod <- caret::train(gdp08~dem_score14_s+pop_urban_s+oecd_dummy, 
             method = 'lm', 
             data = world_data, 
             na.action = na.pass #model won't run without dealing with missing data
             )
summary(mod) #Differences will differ due to randomization
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1326.0  -252.3  -169.1   -67.5 12848.3 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)     224.53     130.86   1.716  0.08828 . 
## dem_score14_s   -75.60     148.00  -0.511  0.61022   
## pop_urban_s      75.87     128.55   0.590  0.55596   
## oecd_dummy     1131.73     365.74   3.094  0.00236 **
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1378 on 148 degrees of freedom
##   (15 observations deleted due to missingness)
## Multiple R-squared:  0.09398,    Adjusted R-squared:  0.07561 
## F-statistic: 5.117 on 3 and 148 DF,  p-value: 0.002136