Problem 1: Lavine Question 2.8 (slightly modified) The book Data by Andrews and Herzberg [1985] contains lots of data sets that have been used for various purposes in statistics. One famous data set records the annual number of deaths by horse kicks in the Prussian Army from 1875-1894 for each of 14 corps. The data are included below. (The data come from Table 4.1 in the Andrews and Herzberg [1985] book.)
4 1 1 1875 0 0 0 0 0 0 0 1 1 0 0 0 1 0 3
4 1 2 1876 2 0 0 0 1 0 0 0 0 0 0 0 1 1 5
4 1 3 1877 2 0 0 0 0 0 1 1 0 0 1 0 2 0 7
4 1 4 1878 1 2 2 1 1 0 0 0 0 0 1 0 1 0 9
4 1 5 1879 0 0 0 1 1 2 2 0 1 0 0 2 1 0 10
4 1 6 1880 0 3 2 1 1 1 0 0 0 2 1 4 3 0 18
4 1 7 1881 1 0 0 2 1 0 0 1 0 1 0 0 0 0 6
4 1 8 1882 1 2 0 0 0 0 1 0 1 1 2 1 4 1 14
4 1 9 1883 0 0 1 2 0 1 2 1 0 1 0 3 0 0 11
4 1 10 1884 3 0 1 0 0 0 0 1 0 0 2 0 1 1 9
4 1 11 1885 0 0 0 0 0 0 1 0 0 2 0 1 0 1 5
4 1 12 1886 2 1 0 0 1 1 1 0 0 1 0 1 3 0 11
4 1 13 1887 1 1 2 1 0 0 3 2 1 1 0 1 2 0 15
4 1 14 1888 0 1 1 0 0 1 1 0 0 0 0 1 1 0 6
4 1 15 1889 0 0 1 1 0 1 1 0 0 1 2 2 0 2 11
4 1 16 1890 1 2 0 2 0 1 1 2 0 2 1 1 2 2 17
4 1 17 1891 0 0 0 1 1 1 0 1 1 0 3 3 1 0 12
4 1 18 1892 1 3 2 0 1 1 3 0 1 1 0 1 1 0 15
4 1 19 1893 0 1 0 0 0 1 0 2 0 0 1 3 0 0 8
4 1 20 1894 1 0 0 0 0 0 0 0 1 0 1 1 0 0 4
Let Yi j be the number of deaths in year i, corps j, for i = 1875,…,1894 and j = 1,…,14. The Yijs are in columns 5-18 of the table.
(a) What are the first four columns of the table?
The first four columns of the table are purely table defaults referencing the data, where 4 = chapter number, 1 = section number, and 1:20 are row numbers identifying the data
(b) What is the last column of the table?
The last column is the sum of deaths for all corps for that year
(c) What is a good model for the data? Explain.
A good model to use for the data could be the poisson distribution because there are many “0” values observed
(d) Suppose you model the data as i.i.d. Poi(lamda). (idenitical and independtely distributed i. Plot the likelihood function for lamda.
The Poisson Distribution in R: dpois(x, lambda, log = FALSE)
kickDeaths <- read.csv(file = "horsedeaths.csv", header = T)
str(kickDeaths)
## 'data.frame': 20 obs. of 19 variables:
## $ ï..chap: int 4 4 4 4 4 4 4 4 4 4 ...
## $ section: int 1 1 1 1 1 1 1 1 1 1 ...
## $ obvs : int 1 2 3 4 5 6 7 8 9 10 ...
## $ year : int 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 ...
## $ corp1 : int 0 2 2 1 0 0 1 1 0 3 ...
## $ corp2 : int 0 0 0 2 0 3 0 2 0 0 ...
## $ corp3 : int 0 0 0 2 0 2 0 0 1 1 ...
## $ corp4 : int 0 0 0 1 1 1 2 0 2 0 ...
## $ corp5 : int 0 1 0 1 1 1 1 0 0 0 ...
## $ corp6 : int 0 0 0 0 2 1 0 0 1 0 ...
## $ corp7 : int 0 0 1 0 2 0 0 1 2 0 ...
## $ corp8 : int 1 0 1 0 0 0 1 0 1 1 ...
## $ corp9 : int 1 0 0 0 1 0 0 1 0 0 ...
## $ corp10 : int 0 0 0 0 0 2 1 1 1 0 ...
## $ corp11 : int 0 0 1 1 0 1 0 2 0 2 ...
## $ corp12 : int 0 0 0 0 2 4 0 1 3 0 ...
## $ corp13 : int 1 1 2 1 1 3 0 4 0 1 ...
## $ corp14 : int 0 1 0 0 0 0 0 1 0 1 ...
## $ total : int 3 5 7 9 10 18 6 14 11 9 ...
noDeathsByCorp<-apply(kickDeaths[,5:18],2,sum) # Returns a vector of values obtained by applying sum to columns 5-18 to get # of deaths by corp
noDeathsByCorp
## corp1 corp2 corp3 corp4 corp5 corp6 corp7 corp8 corp9 corp10
## 16 16 12 12 8 11 17 12 7 13
## corp11 corp12 corp13 corp14
## 15 25 24 8
# writing a negative log likelihood function for the Poisson:
nllForPoisson<-function(lamda,data){
nll<- -sum(dpois(x=data,lambda=lamda,log=TRUE))
return(nll)
}
# making a model to run
lamdaSequence<-seq(from=1,to=100,length(1000))
Likelihood<-vapply(X=lamdaSequence, FUN=nllForPoisson, FUN.VALUE=1, data=noDeathsByCorp)
# returns a list of the same length as X, each element of which is the result of applying FUN (here, nllForPoisson) to the corresponding element of X
#vapply(X, FUN, FUN.VALUE, ..., USE.NAMES = TRUE)
plot(x=lamdaSequence, y=Likelihood,typ='l', main = "'Just Do It'", xlab= "Lamda") # the Nike swoosh of likelihoods
ii. Find lamda hat (i.e., the mle estimate of lamda) by corps and by year.
library(Rvmmin)
## Warning: package 'Rvmmin' was built under R version 3.3.3
# ?Rvmmin
# Rvmmin(par, fn, gr, lower, upper, bdmsk, control = list(), ...)
lamda<-1 # set initial parameter value for lamda. This can be arbitrary.
poissonOutput<-Rvmmin(par=lamda,fn=nllForPoisson,lower=0.1,upper=Inf,data=noDeathsByCorp)
#par = A numeric vector of starting estimates.
#fn = a function that returns the value of the objective at the supplied set of parameters
#lower and upper: A vector of lower bounds on the parameters.
poissonOutput
## $par
## [1] 14
##
## $value
## [1] 44.03165
##
## $counts
## function gradient
## 17 13
##
## $convergence
## [1] 2
##
## $message
## [1] "Rvmminb appears to have converged"
##
## $bdmsk
## [1] 1
poissonOutput$value #likelihood (y axis). #"The value of the objective at the best set of parameters found."
## [1] 44.03165
poissonOutput$par # this is the MLE estimate of lamda (x axis). par = "The best set of parameters found." Aka, this is the maximum likelihood of deaths by horse kick over the 20 year period
## [1] 14
iii. What can you say about the rate of death by horse kick in the Prussian cavalry at the end of the 19th century?
The average number of deaths by horse kicks during the 20 time period (1875-1894) for each corp is 14, taken from our MLE. The overall mean deaths per year for all corps is 0.7.
mean(as.vector(as.matrix(kickDeaths[,5:18]))) # mean number of deaths per year for all corps
## [1] 0.7
mean(noDeathsByCorp) # same as MLE of lamda
## [1] 14
(e) Is there any evidence that different corps had different death rates? Explain.
Yes there is a difference in lamda (death rates) which can be visually inspected in the plot produced and the approximate locations of the MLE, ranging from ~8 - ~25
lamdaSequence<-seq(from=1,to=100,length.out = 1000)
# use a for loop to plot the likelihood estimates for each corp to examine differences in lamda
for(i in seq_len(noDeathsByCorp)){
Likelihood<-vapply(lamdaSequence, nllForPoisson, FUN.VALUE=1, data=noDeathsByCorp[i])
if(i==1){
plot(lamdaSequence, Likelihood,typ='l',col=i, xlab = "lamda")
}else{
lines(lamdaSequence, Likelihood,typ='l',col=i)
}
}
## Warning in seq_len(noDeathsByCorp): first element used of 'length.out'
## argument
# label <- c("Corp 1", "Corp 2", "Corp 3", "Corp 4", "Corp 5", "Corp 6", "Corp 7", "Corp 8", "Corp 9", "Corp 10", "Corp 11", "Corp 12", "Corp 13", "Corp 14")
# legend("right", legend = label, fill = label[i])
Problem 2: This question uses the height vs. weight data from 6 Sep. The data can be found on the class website from that week.
weightData <- read.csv(file = "weightData.csv", header = TRUE)
head(weightData)
## gender height_in weight_lbs
## 1 m 72 165
## 2 m 67 160
## 3 m 68 160
## 4 f 64 114
## 5 f 66 135
## 6 m 69 148
linearModel <- lm(weight_lbs~height_in, weightData)
plot(weightData$height_in, weightData$weight_lbs)
abline(linearModel)
linearModel
##
## Call:
## lm(formula = weight_lbs ~ height_in, data = weightData)
##
## Coefficients:
## (Intercept) height_in
## -255.179 6.022
summary(linearModel)
##
## Call:
## lm(formula = weight_lbs ~ height_in, data = weightData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.341 -11.363 -1.274 11.170 35.659
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -255.179 110.894 -2.301 0.03858 *
## height_in 6.022 1.637 3.678 0.00278 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.89 on 13 degrees of freedom
## Multiple R-squared: 0.5099, Adjusted R-squared: 0.4722
## F-statistic: 13.53 on 1 and 13 DF, p-value: 0.002785
# intercept = -255.179
# Slope (effect of height) = 6.022
# standard error = 1.637
sd <- 1.637 * sqrt(15)
sd # 6.34 sigma
## [1] 6.340074
# (p) = 0.00278 **
this is how you fit a regression using a likelihood function wt ~ N(mu, sigma) mu = Beta0 + Beta 1 * ht or wt ~ norm(mean= betao+beta1*height, sigma)
parameterVector<- c(-255.179,6.022,6.34) # vector to be called in parVec. These parameters were taken directly from linear model results above. First = beta0 (intercept), second = beta1 (slope for height), third = stdev
nllNorm <- function(parVec, data) {
beta0 <- parVec[1] # INTERCEPT
beta1 <- parVec[2] # SLOPE
stdev <- parVec[3] # stdev
predictedWeight <- beta0+beta1*weightData$height_in
nllik <- -sum(dnorm(x=weightData$weight_lbs, mean = predictedWeight, sd = stdev, log = T))
return(nllik)
}
nllNormResults <- nllNorm(parVec = parameterVector, data = weightData)
mleResult <- optim(parameterVector, nllNorm, weightData)
mleResult$par # RESULTS FROM MLE MODEL:
## [1] -255.128498 6.021578 16.646750
# intercept = -255.128498
# slope = 6.021578
# stdev = 16.646750
plot(weightData$height_in, weightData$weight_lbs, xlab = "weight", ylab = "height", main = "Plot with self calculated regression using MLE")
lines(weightData$height_in, (mleResult$par[2]*weightData$height_in + mleResult$par[1]))
# lines code adds my calculated regression line to an existing plot.
#this bit of code creates two new columns which changes "m" or "f" to 0 and 1 or 1 and 0, respectively. For some reason the models I create below require that they be integers and that each gender be run through the model as 1. Otherwise the resulting output for the model (here, slope, intercept, etc) is just for one gender, not both
weightData$genderNoFemale[weightData$gender == "m"] <- 0
weightData$genderNoFemale[weightData$gender == "f"] <- 1
weightData$genderNoMale[weightData$gender == "m"] <- 1
weightData$genderNoMale[weightData$gender == "f"] <- 0
head(weightData)
## gender height_in weight_lbs genderNoFemale genderNoMale
## 1 m 72 165 0 1
## 2 m 67 160 0 1
## 3 m 68 160 0 1
## 4 f 64 114 1 0
## 5 f 66 135 1 0
## 6 m 69 148 0 1
# linear model for females
linearModelGenderFemale <- lm(weight_lbs~height_in+genderNoFemale, weightData)
linearModelGenderFemale
##
## Call:
## lm(formula = weight_lbs ~ height_in + genderNoFemale, data = weightData)
##
## Coefficients:
## (Intercept) height_in genderNoFemale
## -0.3987 2.4572 -29.0075
summary(linearModelGenderFemale)
##
## Call:
## lm(formula = weight_lbs ~ height_in + genderNoFemale, data = weightData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.146 -9.104 -1.518 7.276 23.482
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.3987 128.6798 -0.003 0.9976
## height_in 2.4572 1.8500 1.328 0.2088
## genderNoFemale -29.0075 10.4595 -2.773 0.0169 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.53 on 12 degrees of freedom
## Multiple R-squared: 0.7014, Adjusted R-squared: 0.6516
## F-statistic: 14.09 on 2 and 12 DF, p-value: 0.0007095
# intercept = -0.399
# slope height = 2.45
# slope female = -29.00
# standard error = 1.85
1.85 * sqrt(15) # 7.16 sigma (se * sqrt(n))
## [1] 7.165019
# linear model for males
linearModelGenderMale <- lm(weight_lbs~height_in+genderNoMale, weightData)
linearModelGenderMale
##
## Call:
## lm(formula = weight_lbs ~ height_in + genderNoMale, data = weightData)
##
## Coefficients:
## (Intercept) height_in genderNoMale
## -29.406 2.457 29.008
summary(linearModelGenderMale)
##
## Call:
## lm(formula = weight_lbs ~ height_in + genderNoMale, data = weightData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.146 -9.104 -1.518 7.276 23.482
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -29.406 121.433 -0.242 0.8127
## height_in 2.457 1.850 1.328 0.2088
## genderNoMale 29.008 10.460 2.773 0.0169 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.53 on 12 degrees of freedom
## Multiple R-squared: 0.7014, Adjusted R-squared: 0.6516
## F-statistic: 14.09 on 2 and 12 DF, p-value: 0.0007095
# intercept = -29.45
# slope height = 2.45
# slope female = 29.01
# standard error = 1.85
1.85 * sqrt(15) # 7.16 sigma (se * sqrt(n))
## [1] 7.165019
# Plotting both lm regression lines
plot(weightData$height_in, weightData$weight_lbs, col = weightData$gender, xlab="Height", ylab="Weight", main="Height vs. Weight regressed by efect of gender")
abline(coef=coef(linearModelGenderFemale), col="black")
abline(coef=coef(linearModelGenderMale), col="red")
#b) likelihood model with gender included: weight~b1*height + b2*gender + intercept
# remember betas are slopes
########
# Calculating MLE for Females, using parameters from LM above
# intercept = -0.399
# slope height = 2.45
# slope female = -29.00
# 7.16 sigma (se * sqrt(n))
parameterVectorFemale <- c(-0.399,2.45,7.16,-29.0) # this vector is to be called in parVec2. Numbers in vecor correspond with: First = beta0 (intercept), second = beta1 (slope for height), third = stdev, fourth = beta 3 (slope for gender)
parameterVectorFemale
## [1] -0.399 2.450 7.160 -29.000
nllNormGenderFemale <- function(parVec2, data) {
beta0 <- parVec2[1] # INTERCEPT
beta1 <- parVec2[2] # SLOPE FOR HEIGHT
stdev <- parVec2[3] # STDEV
beta3 <- parVec2[4] # SLOPE FOR GENDER (FEMALES)
predictedWeight <- beta0+beta1*weightData$height_in+beta3*weightData$genderNoFemale
nllik <- -sum(dnorm(x=weightData$weight_lbs, mean = predictedWeight, sd = stdev, log = T))
return(nllik)
}
#nllNormResultsFemale <- nllNormGender(parameterVectorFemale, weightData)
mleResultFemale <- optim(parameterVectorFemale, nllNormGenderFemale, weightData)
mleResultFemale$par # RESULTS FROM MLE MODEL FOR FEMALES
## [1] -1.163321 2.468135 12.996923 -28.958684
# intercept = -1.16
# slope height = 2.46
# stdev = 12.99
# slope for females = -28.95
#########
# Calculating MLE for males, using parameters from LM above
# intercept = -29.45
# slope height = 2.45
# slope female = 29.01
# 7.16 sigma (se * sqrt(n))
parameterVectorMale <- c(-29.45,2.45,7.16,-29.01) # this vector is to be called in parVec3. Numbers in vecor correspond with: First = beta0 (intercept), second = beta1 (slope for height), third = stdev, fourth = beta 3 (slope for gender)
parameterVectorMale
## [1] -29.45 2.45 7.16 -29.01
nllNormGenderMale <- function(parVec3, data) {
beta0 <- parVec3[1] # INTERCEPT
beta1 <- parVec3[2] # SLOPE FOR HEIGHT
stdev <- parVec3[3] # STDEV
beta3 <- parVec3[4] # SLOPE FOR GENDER (FEMALES)
predictedWeight <- beta0+beta1*weightData$height_in+beta3*weightData$genderNoMale
nllik <- -sum(dnorm(x=weightData$weight_lbs, mean = predictedWeight, sd = stdev, log = T))
return(nllik)
}
#nllNormResultsMale <- nllNormGender(parameterVectorMale, weightData)
mleResultMale <- optim(parameterVectorMale, nllNormGenderMale, weightData)
mleResultMale$par # RESULTS FROM MLE MODEL FOR MALES
## [1] -29.562301 2.459573 13.002862 28.997590
# intercept = -29.45
# slope height = 2.45
# stdev = 7.16
# slope for females = -29.01
# Plotting both regression lines caluclated by MLE
plot(weightData$height_in, weightData$weight_lbs, col = weightData$gender, xlab="Height", ylab="Weight", main="Height vs. Weight regressed by efect of gender")
abline(a=-1.16, b=2.45, col="red") # male regression. slope is only effect of height, not gender
abline(a=-29.45,b=2.45, col="black") # male regression. slope is only effect of height, not gender
###trying to code a better regression line for the plot...
#maleRegressionMlE <- -29.45+2.45*weightData$height_in + -29.01*weightData$genderNoMale
#lines(weightData$height_in,maleRegressionMlE)
##AIC for LM
### CODE BROKEN? BUT RESULTS ARE AS FOLLOWS
##CODE:
#AIC(linearModel,linearModelGender)
#RESULT
#AIC for linear model: 132.94
#AIC for linear model with gender effects: 127.51
##AIC for MLE
#Calculating the associated AIC with the model fit from Rvmmin minimizer. AIC = 2k-2log(likelihood) where k is number of fitted parameters.
#AIC for MLE (without gender)
2*1-2*mleResult$value # 124.94
## [1] -124.9438
#AIC for gender (females, 3 parameters)
3*1-2*mleResultFemale$value # 116.51
## [1] -116.5149
#AIC for gender (males, 3 parameters)
3*1-2*mleResultMale$value # 3???
## [1] -116.5148
#The lowest AIC is the favored model