vendredi 26 février 2021

Trying to find test and training errors for ridge regression as a function of sample size

I am using the Hitters dataset in R. Currently I fit a linear regression predicting Salary from all other covariates with varying sample sizes from 20 to 75 and I calculated the average test/training errors :

data("Hitters", package = 'ISLR')
Hitters = na.omit(Hitters)
set.seed(1)
train.idx = sample(1:nrow(Hitters), 75,replace=FALSE)
train = Hitters[train.idx,-20]
test = Hitters[-train.idx,-20]

errs <- rep(NA,56)
for (ii in 20:75){
  train.idx = sample(1:nrow(Hitters), ii,replace=FALSE)
  train = Hitters[train.idx,-20]
  test = Hitters[-train.idx,-20]
  train.lm <- lm(Salary ~., - Salary, data = train)
  train.pred <- predict(train.lm, train)
  test.pred <- predict(train.lm, data = test)
  errs[ii-19] <-  mean((test.pred - train$Salary)^2)
}
errs

Now I am trying to do the same with Ridge regression using those samples I created from before with a regularization parameter of 20. I tried:

x_train = model.matrix(Salary~., train)[,-1]
x_test = model.matrix(Salary~., test)[,-1]

y_train = train$Salary
y_test = test$Salary

#cv.out = cv.glmnet(x_train,y_train, alpha = 0)
#lam = cv.out$lambda.min


errs.train <- rep(NA, 56)
for (ii in 20:75){
  ridge_mod = glmnet(x_train, y_train, alpha=0, lambda = 20)
  ridge_pred = predict(ridge_mod, newx = x_test)
  #errs.test[ii] <- mean((ridge_pred - y_test)^2)
  errs.train[ii-19] <- mean((ridge_pred - y_train)^2)
}

errs.train

But all the errors are coming out the same. How can I fix this?

Aucun commentaire:

Enregistrer un commentaire