Feel free to try the exercises below at your leisure. Solutions will be posted later in the week!
Using the data set linked here,
we will attempt to run some basic linear regression models where we will
attempt to predict gdp08 with dem_score14,
pop_urban, and oecd (i.e. regressing
gdp08 on dem_score14, pop_urban,
and oecd).
dem_score14 and pop_urban;
2) create a dummy variable for OECD membership.library(dplyr)
world_data <- read.csv("https://github.com/apodkul/ppol6803_03/raw/main/Data/world_data.csv")
world_data <- world_data %>%
mutate(dem_score14_s = scale(dem_score14),
pop_urban_s = scale(pop_urban),
oecd_dummy = case_when(
oecd == 'OECD Member state' ~ 1,
oecd == 'Not member' ~ 0,
is.na(oecd) ~ as.numeric(NA)
))
set.seed(1234)
library(caret)
trainIndex <- createDataPartition(1:nrow(world_data),
p = .5, list = F,
times = 1)
world_data_A <- world_data[trainIndex,]
world_data_B <- world_data[-trainIndex,]
lm(). Compare the \(R^2\) and RMSE values of each model.mod_a <- lm(gdp08~dem_score14_s+pop_urban_s+oecd_dummy,
data = world_data_A)
mod_b <- lm(gdp08~dem_score14_s+pop_urban_s+oecd_dummy,
data = world_data_B)
# extract RMSE and R-squared here -- differences will vary due to randomization
summary(mod_a)
##
## Call:
## lm(formula = gdp08 ~ dem_score14_s + pop_urban_s + oecd_dummy,
## data = world_data_A)
##
## Residuals:
## Min 1Q Median 3Q Max
## -788.4 -159.9 -88.4 23.0 3619.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 142.22 74.32 1.914 0.05968 .
## dem_score14_s -84.49 86.43 -0.978 0.33162
## pop_urban_s 95.30 72.27 1.319 0.19152
## oecd_dummy 695.63 242.48 2.869 0.00542 **
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 555.2 on 71 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.1607, Adjusted R-squared: 0.1252
## F-statistic: 4.531 on 3 and 71 DF, p-value: 0.005791
summary(mod_b)
##
## Call:
## lm(formula = gdp08 ~ dem_score14_s + pop_urban_s + oecd_dummy,
## data = world_data_B)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1600.2 -331.8 -258.8 -139.5 12581.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 316.64 255.11 1.241 0.2185
## dem_score14_s -48.60 281.58 -0.173 0.8634
## pop_urban_s 54.09 251.32 0.215 0.8302
## oecd_dummy 1298.68 635.75 2.043 0.0447 *
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1865 on 73 degrees of freedom
## (6 observations deleted due to missingness)
## Multiple R-squared: 0.08694, Adjusted R-squared: 0.04941
## F-statistic: 2.317 on 3 and 73 DF, p-value: 0.08267
caret::train with the
tag of lm (keep the other arguments at their default
values) and compare to the models estimated in step #3. How do the
outputs in caret::train and lm() differ?mod <- caret::train(gdp08~dem_score14_s+pop_urban_s+oecd_dummy,
method = 'lm',
data = world_data,
na.action = na.pass #model won't run without dealing with missing data
)
summary(mod) #Differences will differ due to randomization
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1326.0 -252.3 -169.1 -67.5 12848.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 224.53 130.86 1.716 0.08828 .
## dem_score14_s -75.60 148.00 -0.511 0.61022
## pop_urban_s 75.87 128.55 0.590 0.55596
## oecd_dummy 1131.73 365.74 3.094 0.00236 **
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1378 on 148 degrees of freedom
## (15 observations deleted due to missingness)
## Multiple R-squared: 0.09398, Adjusted R-squared: 0.07561
## F-statistic: 5.117 on 3 and 148 DF, p-value: 0.002136