Problem 1: Lavine Question 2.8 (slightly modified) The book Data by Andrews and Herzberg [1985] contains lots of data sets that have been used for various purposes in statistics. One famous data set records the annual number of deaths by horse kicks in the Prussian Army from 1875-1894 for each of 14 corps. The data are included below. (The data come from Table 4.1 in the Andrews and Herzberg [1985] book.)

4 1 1 1875 0 0 0 0 0 0 0 1 1 0 0 0 1 0 3
4 1 2 1876 2 0 0 0 1 0 0 0 0 0 0 0 1 1 5
4 1 3 1877 2 0 0 0 0 0 1 1 0 0 1 0 2 0 7
4 1 4 1878 1 2 2 1 1 0 0 0 0 0 1 0 1 0 9
4 1 5 1879 0 0 0 1 1 2 2 0 1 0 0 2 1 0 10
4 1 6 1880 0 3 2 1 1 1 0 0 0 2 1 4 3 0 18
4 1 7 1881 1 0 0 2 1 0 0 1 0 1 0 0 0 0 6
4 1 8 1882 1 2 0 0 0 0 1 0 1 1 2 1 4 1 14
4 1 9 1883 0 0 1 2 0 1 2 1 0 1 0 3 0 0 11
4 1 10 1884 3 0 1 0 0 0 0 1 0 0 2 0 1 1 9
4 1 11 1885 0 0 0 0 0 0 1 0 0 2 0 1 0 1 5
4 1 12 1886 2 1 0 0 1 1 1 0 0 1 0 1 3 0 11
4 1 13 1887 1 1 2 1 0 0 3 2 1 1 0 1 2 0 15
4 1 14 1888 0 1 1 0 0 1 1 0 0 0 0 1 1 0 6
4 1 15 1889 0 0 1 1 0 1 1 0 0 1 2 2 0 2 11
4 1 16 1890 1 2 0 2 0 1 1 2 0 2 1 1 2 2 17
4 1 17 1891 0 0 0 1 1 1 0 1 1 0 3 3 1 0 12
4 1 18 1892 1 3 2 0 1 1 3 0 1 1 0 1 1 0 15
4 1 19 1893 0 1 0 0 0 1 0 2 0 0 1 3 0 0 8
4 1 20 1894 1 0 0 0 0 0 0 0 1 0 1 1 0 0 4

Let Yi j be the number of deaths in year i, corps j, for i = 1875,…,1894 and j = 1,…,14. The Yijs are in columns 5-18 of the table.

(a) What are the first four columns of the table?

The first four columns of the table are purely table defaults referencing the data, where 4 = chapter number, 1 = section number, and 1:20 are row numbers identifying the data

(b) What is the last column of the table?

The last column is the sum of deaths for all corps for that year

(c) What is a good model for the data? Explain.

A good model to use for the data could be the poisson distribution because there are many “0” values observed

(d) Suppose you model the data as i.i.d. Poi(lamda). (idenitical and independtely distributed i. Plot the likelihood function for lamda.

The Poisson Distribution in R: dpois(x, lambda, log = FALSE)

kickDeaths <- read.csv(file = "horsedeaths.csv", header = T)
str(kickDeaths)
## 'data.frame':    20 obs. of  19 variables:
##  $ ï..chap: int  4 4 4 4 4 4 4 4 4 4 ...
##  $ section: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ obvs   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ year   : int  1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 ...
##  $ corp1  : int  0 2 2 1 0 0 1 1 0 3 ...
##  $ corp2  : int  0 0 0 2 0 3 0 2 0 0 ...
##  $ corp3  : int  0 0 0 2 0 2 0 0 1 1 ...
##  $ corp4  : int  0 0 0 1 1 1 2 0 2 0 ...
##  $ corp5  : int  0 1 0 1 1 1 1 0 0 0 ...
##  $ corp6  : int  0 0 0 0 2 1 0 0 1 0 ...
##  $ corp7  : int  0 0 1 0 2 0 0 1 2 0 ...
##  $ corp8  : int  1 0 1 0 0 0 1 0 1 1 ...
##  $ corp9  : int  1 0 0 0 1 0 0 1 0 0 ...
##  $ corp10 : int  0 0 0 0 0 2 1 1 1 0 ...
##  $ corp11 : int  0 0 1 1 0 1 0 2 0 2 ...
##  $ corp12 : int  0 0 0 0 2 4 0 1 3 0 ...
##  $ corp13 : int  1 1 2 1 1 3 0 4 0 1 ...
##  $ corp14 : int  0 1 0 0 0 0 0 1 0 1 ...
##  $ total  : int  3 5 7 9 10 18 6 14 11 9 ...
noDeathsByCorp<-apply(kickDeaths[,5:18],2,sum) # Returns a vector of values obtained by applying sum to columns 5-18 to get # of deaths by corp
noDeathsByCorp
##  corp1  corp2  corp3  corp4  corp5  corp6  corp7  corp8  corp9 corp10 
##     16     16     12     12      8     11     17     12      7     13 
## corp11 corp12 corp13 corp14 
##     15     25     24      8
# writing a negative log likelihood function for the Poisson:
nllForPoisson<-function(lamda,data){
  nll<- -sum(dpois(x=data,lambda=lamda,log=TRUE))
  return(nll)
  }

# making a model to run 
lamdaSequence<-seq(from=1,to=100,length(1000))
    Likelihood<-vapply(X=lamdaSequence, FUN=nllForPoisson, FUN.VALUE=1, data=noDeathsByCorp)
    # returns a list of the same length as X, each element of which is the result of applying FUN (here, nllForPoisson) to the corresponding element of X
    #vapply(X, FUN, FUN.VALUE, ..., USE.NAMES = TRUE)

plot(x=lamdaSequence, y=Likelihood,typ='l', main = "'Just Do It'", xlab= "Lamda") # the Nike swoosh of likelihoods

ii. Find lamda hat (i.e., the mle estimate of lamda) by corps and by year.

library(Rvmmin)
## Warning: package 'Rvmmin' was built under R version 3.3.3
# ?Rvmmin
# Rvmmin(par, fn, gr, lower, upper, bdmsk, control = list(), ...)

lamda<-1 # set initial parameter value for lamda. This can be arbitrary.

poissonOutput<-Rvmmin(par=lamda,fn=nllForPoisson,lower=0.1,upper=Inf,data=noDeathsByCorp)
  #par = A numeric vector of starting estimates.
  #fn = a function that returns the value of the objective at the supplied set of     parameters
  #lower and upper: A vector of lower bounds on the parameters.

poissonOutput
## $par
## [1] 14
## 
## $value
## [1] 44.03165
## 
## $counts
## function gradient 
##       17       13 
## 
## $convergence
## [1] 2
## 
## $message
## [1] "Rvmminb appears to have converged"
## 
## $bdmsk
## [1] 1
  poissonOutput$value   #likelihood (y axis). #"The value of the objective at the best set of parameters found."
## [1] 44.03165
  poissonOutput$par # this is the MLE estimate  of  lamda (x axis). par = "The best set of parameters found." Aka, this is the maximum likelihood of deaths by horse kick over the 20 year period
## [1] 14

iii. What can you say about the rate of death by horse kick in the Prussian cavalry at the end of the 19th century?

The average number of deaths by horse kicks during the 20 time period (1875-1894) for each corp is 14, taken from our MLE. The overall mean deaths per year for all corps is 0.7.

mean(as.vector(as.matrix(kickDeaths[,5:18]))) # mean number of deaths per year for all corps
## [1] 0.7
mean(noDeathsByCorp) # same as MLE of lamda
## [1] 14

(e) Is there any evidence that different corps had different death rates? Explain.

Yes there is a difference in lamda (death rates) which can be visually inspected in the plot produced and the approximate locations of the MLE, ranging from ~8 - ~25

lamdaSequence<-seq(from=1,to=100,length.out = 1000)

# use a for loop to plot the likelihood estimates for each corp to examine differences in lamda
for(i in seq_len(noDeathsByCorp)){
    Likelihood<-vapply(lamdaSequence, nllForPoisson, FUN.VALUE=1, data=noDeathsByCorp[i])
    if(i==1){
        plot(lamdaSequence, Likelihood,typ='l',col=i, xlab = "lamda") 
    }else{
        lines(lamdaSequence, Likelihood,typ='l',col=i) 
    }
}
## Warning in seq_len(noDeathsByCorp): first element used of 'length.out'
## argument

# label <- c("Corp 1", "Corp 2", "Corp 3", "Corp 4", "Corp 5", "Corp 6", "Corp 7", "Corp 8", "Corp 9", "Corp 10", "Corp 11", "Corp 12", "Corp 13", "Corp 14")
# legend("right", legend = label, fill = label[i])

Problem 2: This question uses the height vs. weight data from 6 Sep. The data can be found on the class website from that week.

  1. Fit a linear regression to these data using the ‘lm’ command within R. The model should predict the observed weight from height and the model should include an intercept and an effect of height. Plot the estimated linear fit superimposed on the data and report the parameter estimates for the intercept, effect of height, and sigma.
weightData <- read.csv(file = "weightData.csv", header = TRUE)
head(weightData)
##   gender height_in weight_lbs
## 1      m        72        165
## 2      m        67        160
## 3      m        68        160
## 4      f        64        114
## 5      f        66        135
## 6      m        69        148
linearModel <- lm(weight_lbs~height_in, weightData)
plot(weightData$height_in, weightData$weight_lbs)
abline(linearModel)

linearModel
## 
## Call:
## lm(formula = weight_lbs ~ height_in, data = weightData)
## 
## Coefficients:
## (Intercept)    height_in  
##    -255.179        6.022
summary(linearModel)
## 
## Call:
## lm(formula = weight_lbs ~ height_in, data = weightData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.341 -11.363  -1.274  11.170  35.659 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -255.179    110.894  -2.301  0.03858 * 
## height_in      6.022      1.637   3.678  0.00278 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.89 on 13 degrees of freedom
## Multiple R-squared:  0.5099, Adjusted R-squared:  0.4722 
## F-statistic: 13.53 on 1 and 13 DF,  p-value: 0.002785
# intercept = -255.179
# Slope (effect of height) = 6.022
# standard error = 1.637 
sd <- 1.637 * sqrt(15)
sd # 6.34 sigma
## [1] 6.340074
# (p) = 0.00278 **
  1. Repeat a) except fitting your model using your own likelihood function rather than R’s built in ‘lm’. Plot the estimated linear fit superimposed on the data and report the parameter estimates for the intercept, effect of height, and sigma. Are they the same as from ‘lm’?

this is how you fit a regression using a likelihood function wt ~ N(mu, sigma) mu = Beta0 + Beta 1 * ht or wt ~ norm(mean= betao+beta1*height, sigma)

parameterVector<- c(-255.179,6.022,6.34) # vector to be called in parVec. These parameters were taken directly from linear model results above. First = beta0 (intercept), second = beta1 (slope for height), third = stdev 

nllNorm <- function(parVec, data) {
  beta0 <- parVec[1] # INTERCEPT
  beta1 <- parVec[2] # SLOPE
  stdev <- parVec[3] # stdev
  predictedWeight <- beta0+beta1*weightData$height_in
  nllik <- -sum(dnorm(x=weightData$weight_lbs, mean = predictedWeight, sd = stdev, log = T))
  return(nllik)
    }

nllNormResults <- nllNorm(parVec = parameterVector, data = weightData)

mleResult <- optim(parameterVector, nllNorm, weightData)

mleResult$par # RESULTS FROM MLE MODEL:
## [1] -255.128498    6.021578   16.646750
# intercept = -255.128498    
# slope = 6.021578
# stdev = 16.646750

plot(weightData$height_in, weightData$weight_lbs, xlab = "weight", ylab = "height", main = "Plot with self calculated regression using MLE")
lines(weightData$height_in, (mleResult$par[2]*weightData$height_in + mleResult$par[1]))

# lines code adds my calculated regression line to an existing plot. 
  1. Repeat a) and b) with the addition of an effect of gender in the linear model. Plot the model fit (i.e., with separate lines for males and females) along with the data. Report parameter estimates for the intercept, effect of height, gender and sigma.
#this bit of code creates two new columns which changes "m" or "f" to 0 and 1 or 1 and 0, respectively. For some reason the models I create below require that they be integers and that each gender be run through the model as 1. Otherwise the resulting output for the model (here, slope, intercept, etc) is just for one gender, not both   
weightData$genderNoFemale[weightData$gender == "m"] <- 0
weightData$genderNoFemale[weightData$gender == "f"] <- 1
weightData$genderNoMale[weightData$gender == "m"] <- 1
weightData$genderNoMale[weightData$gender == "f"] <- 0
head(weightData)
##   gender height_in weight_lbs genderNoFemale genderNoMale
## 1      m        72        165              0            1
## 2      m        67        160              0            1
## 3      m        68        160              0            1
## 4      f        64        114              1            0
## 5      f        66        135              1            0
## 6      m        69        148              0            1
# linear model for females
linearModelGenderFemale <- lm(weight_lbs~height_in+genderNoFemale, weightData)
linearModelGenderFemale
## 
## Call:
## lm(formula = weight_lbs ~ height_in + genderNoFemale, data = weightData)
## 
## Coefficients:
##    (Intercept)       height_in  genderNoFemale  
##        -0.3987          2.4572        -29.0075
summary(linearModelGenderFemale)
## 
## Call:
## lm(formula = weight_lbs ~ height_in + genderNoFemale, data = weightData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.146  -9.104  -1.518   7.276  23.482 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)  
## (Intercept)     -0.3987   128.6798  -0.003   0.9976  
## height_in        2.4572     1.8500   1.328   0.2088  
## genderNoFemale -29.0075    10.4595  -2.773   0.0169 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.53 on 12 degrees of freedom
## Multiple R-squared:  0.7014, Adjusted R-squared:  0.6516 
## F-statistic: 14.09 on 2 and 12 DF,  p-value: 0.0007095
# intercept = -0.399
# slope height = 2.45
# slope female = -29.00 
# standard error = 1.85 
1.85 * sqrt(15) # 7.16 sigma (se * sqrt(n))
## [1] 7.165019
# linear model for males
linearModelGenderMale <- lm(weight_lbs~height_in+genderNoMale, weightData)
linearModelGenderMale
## 
## Call:
## lm(formula = weight_lbs ~ height_in + genderNoMale, data = weightData)
## 
## Coefficients:
##  (Intercept)     height_in  genderNoMale  
##      -29.406         2.457        29.008
summary(linearModelGenderMale)
## 
## Call:
## lm(formula = weight_lbs ~ height_in + genderNoMale, data = weightData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.146  -9.104  -1.518   7.276  23.482 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   -29.406    121.433  -0.242   0.8127  
## height_in       2.457      1.850   1.328   0.2088  
## genderNoMale   29.008     10.460   2.773   0.0169 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.53 on 12 degrees of freedom
## Multiple R-squared:  0.7014, Adjusted R-squared:  0.6516 
## F-statistic: 14.09 on 2 and 12 DF,  p-value: 0.0007095
# intercept = -29.45
# slope height = 2.45
# slope female = 29.01
# standard error = 1.85 
1.85 * sqrt(15) # 7.16 sigma (se * sqrt(n))
## [1] 7.165019
# Plotting both lm regression lines
plot(weightData$height_in, weightData$weight_lbs, col = weightData$gender, xlab="Height", ylab="Weight", main="Height vs. Weight regressed by efect of gender")
abline(coef=coef(linearModelGenderFemale), col="black")
abline(coef=coef(linearModelGenderMale), col="red") 

#b) likelihood model with gender included: weight~b1*height + b2*gender + intercept
# remember betas are slopes
########
# Calculating MLE for Females, using parameters from LM above
# intercept = -0.399
# slope height = 2.45
# slope female = -29.00 
# 7.16 sigma (se * sqrt(n))

parameterVectorFemale <- c(-0.399,2.45,7.16,-29.0) # this vector is to be called in parVec2. Numbers in vecor correspond with: First = beta0 (intercept), second = beta1 (slope for height), third = stdev, fourth = beta 3 (slope for gender) 
parameterVectorFemale
## [1]  -0.399   2.450   7.160 -29.000
nllNormGenderFemale <- function(parVec2, data) {
  beta0 <- parVec2[1] # INTERCEPT
  beta1 <- parVec2[2] # SLOPE FOR HEIGHT
  stdev <- parVec2[3] # STDEV
  beta3 <- parVec2[4] # SLOPE FOR GENDER (FEMALES)
  predictedWeight <- beta0+beta1*weightData$height_in+beta3*weightData$genderNoFemale
  nllik <- -sum(dnorm(x=weightData$weight_lbs, mean = predictedWeight, sd = stdev, log = T))
  return(nllik)
    }

#nllNormResultsFemale <- nllNormGender(parameterVectorFemale, weightData)

mleResultFemale <- optim(parameterVectorFemale, nllNormGenderFemale, weightData)

mleResultFemale$par # RESULTS FROM MLE MODEL FOR FEMALES
## [1]  -1.163321   2.468135  12.996923 -28.958684
# intercept = -1.16
# slope height = 2.46
# stdev = 12.99
# slope for females = -28.95

#########
# Calculating MLE for males, using parameters from LM above
# intercept = -29.45
# slope height = 2.45
# slope female = 29.01
# 7.16 sigma (se * sqrt(n))

parameterVectorMale <- c(-29.45,2.45,7.16,-29.01) # this vector is to be called in parVec3. Numbers in vecor correspond with: First = beta0 (intercept), second = beta1 (slope for height), third = stdev, fourth = beta 3 (slope for gender) 
parameterVectorMale
## [1] -29.45   2.45   7.16 -29.01
nllNormGenderMale <- function(parVec3, data) {
  beta0 <- parVec3[1] # INTERCEPT
  beta1 <- parVec3[2] # SLOPE FOR HEIGHT
  stdev <- parVec3[3] # STDEV
  beta3 <- parVec3[4] # SLOPE FOR GENDER (FEMALES)
  predictedWeight <- beta0+beta1*weightData$height_in+beta3*weightData$genderNoMale
  nllik <- -sum(dnorm(x=weightData$weight_lbs, mean = predictedWeight, sd = stdev, log = T))
  return(nllik)
    }

#nllNormResultsMale <- nllNormGender(parameterVectorMale, weightData)

mleResultMale <- optim(parameterVectorMale, nllNormGenderMale, weightData)

mleResultMale$par # RESULTS FROM MLE MODEL FOR MALES
## [1] -29.562301   2.459573  13.002862  28.997590
# intercept = -29.45
# slope height = 2.45
# stdev = 7.16
# slope for females = -29.01

# Plotting both regression lines caluclated by MLE
plot(weightData$height_in, weightData$weight_lbs, col = weightData$gender, xlab="Height", ylab="Weight", main="Height vs. Weight regressed by efect of gender")
abline(a=-1.16, b=2.45, col="red") # male regression. slope is only effect of height, not gender
abline(a=-29.45,b=2.45, col="black") # male regression. slope is only effect of height, not gender

###trying to code a better regression line for the plot...
#maleRegressionMlE <-  -29.45+2.45*weightData$height_in + -29.01*weightData$genderNoMale
#lines(weightData$height_in,maleRegressionMlE)
  1. Report AIC values for the linear model with and without the gender term based on your own likelihood functions and the ‘lm’ fits. Which model is better supported
    based on AIC values?
##AIC for LM

### CODE BROKEN? BUT RESULTS ARE AS FOLLOWS
##CODE:
#AIC(linearModel,linearModelGender)

#RESULT
#AIC for linear model: 132.94
#AIC for linear model with gender effects: 127.51

##AIC for MLE
#Calculating the associated AIC with the model fit from Rvmmin minimizer. AIC = 2k-2log(likelihood) where k is number of fitted parameters.

#AIC for MLE (without gender)
2*1-2*mleResult$value # 124.94
## [1] -124.9438
#AIC for gender (females, 3 parameters)
3*1-2*mleResultFemale$value # 116.51
## [1] -116.5149
#AIC for gender (males, 3 parameters)
3*1-2*mleResultMale$value # 3???
## [1] -116.5148
#The lowest AIC is the favored model