Agregate models with caretEnsemble

Introduction

Suppose you have a dataset, and you are narowing possible machine learning models to 2 or 3 models, but you still cant choose which you want : Will the benefit of understandability from my CART cost me too much compare to a random forest or some bootsting ?

Well you dont necessarily have to choose : juste agregate the models you have to make a better one. Typicaly, if you have models that dont uses the same features of the dataset, or give very different ansewrs but are still all good in term of a pre-selected metric (let’s say RMSE for regression, area under ROC for classification), ensembling them could be a good idea.

If you do you regression with the caret package, which i recomand, you should take a look at the caretEnsemble package. After introducing the dataset we’ll work with, i’ll talk a little more about ensembles.

Tidying the data

Let’s load the cu.summary dataset, caret and caretEnsemble libraries, plus rpart and ranger for models.

data("cu.summary")
df <- cu.summary
rm(cu.summary)
head(df)
Price Country Reliability Mileage Type
Acura Integra 4 11950 Japan Much better NA Small
Dodge Colt 4 6851 Japan NA NA Small
Dodge Omni 4 6995 USA Much worse NA Small
Eagle Summit 4 8895 USA better 33 Small
Ford Escort 4 7402 USA worse 33 Small
Ford Festiva 4 6319 Korea better 37 Small

This dataset looks a lot like the cars dataset, but as more rows. The first thing to notice here is that the row names contains informations… Usualy, information should not be in row.names. let’s tidy a little this dataset :

df$name <- row.names(df)
row.names(df) <- NULL
df %<>% 
{str_split(.$name," ",n = 2)} %>%
{do.call(rbind,.)} %>%
  as.data.frame %>%
  set_names(c("brand","car")) %>% 
  cbind(df) %>% 
  set_names(tolower(names(.))) %>%
  as.tibble %>% 
  drop_na %>% 
  select(-name) %T>% 
  print
brand car price country reliability mileage type
Eagle Summit 4 8895 USA better 33 Small
Ford Escort 4 7402 USA worse 33 Small
Ford Festiva 4 6319 Korea better 37 Small
Honda Civic 4 6635 Japan/USA Much better 32 Small
Mazda Protege 4 6599 Japan Much better 32 Small
Mercury Tracer 4 8672 Mexico better 26 Small

Ok that’s better : we got 2 more regressors to play with.

Partitionning the data

We need to partition the dataset into a training and testing set to be able to assess performance of our models. While this is theoriticaly not needed, it’s good practice to left some line of the original data out of touch during the analysis.

We’ll use the createDataPartition from caret wich does exactly that : set 20% of the data apart. The function needs to know about the reponse variable (here, the price) to condition the splitting on response level (if the response is categorial) or response quantiles (if it’s continuous, wich is the case here), so that the training and testing sets are alike.

inTrain <- createDataPartition(y = df$price, p = .80, list = FALSE)
df.train <- df[ inTrain,]
df.test <-  df[-inTrain,]
df.pred <- df.test %>% select(price)
rm(inTrain)

Fitting the different models

First, let’s declare the formula we’ll use to do the regression. Then we’ll declare the controls we want for the caret regressions. For the sake of simplicity, i just took 10 bootstrap.

Here, we also have to specify the savePredictions and index parameters to be able to compare models.

formula = price ~ brand + country + reliability + mileage + type

controls <- trainControl(
  method="boot", # On va fitter le modèle sur des echantillons bootstrap de la base de train
  number=10, # On choisis le nombre d' echantillons bootstrapp
  savePredictions="final",
  index=createResample(df.train$price, 10),
  verboseIter = TRUE
)

Then we’ll use caretEnsemble::caretList to fit some models. Here, i choosed to use a glmStepAIC and a simple ranger, but you could also try a CART or anything else. The procedure take around 30secondes to run.

models <-
  caretEnsemble::caretList(
    formula,
    data=df.train,
    trControl=controls,
    tuneList=list(
      rpart=caretModelSpec(method="rpart"),
      ranger=caretModelSpec(
        method="ranger",
        tuneGrid = expand.grid(
          mtry = 1:20, 
          splitrule="variance", 
          min.node.size=1
        ),
        verbose=TRUE,
        importance = 'impurity'
      )
    )
  )

Agregating models

Now that the models have run, we can look at the corelation between them :

modelCor(resamples(models))

Ok models are a little correlated, which is normal since it’s a cart and a radom forest based on carts. So maybe agregating them will help making a better prediction ? Let’s use the main function from the caretEnsemble package, with does a linear combination of the models, weighted by their quality of prediction :

merge.glm <- caretEnsemble(
  models,
  trControl=trainControl(
    method = "boot",
    number=10,
    verboseIter = TRUE
  ))

The variable importance is :

merge.glm

Now let’s compare the RMSE from both models :

merge.glm$models %>%
  map("results") %>%
  map_dfr(~filter(.x,RMSE == min(RMSE)) %>% select(RMSE)) %>%
  mutate(name = names(merge.glm$models)) %>%
  add_row(name = "merged",RMSE = merge.glm$ens_model$results$RMSE) %>%
  select(name,RMSE)

Ok we lost a little compare to the ranger model. But it’s because we didnt choose wisely enough our models. MAybe if we try something else, it’ll work ? [Not finished…]

If we had choose better input, the result would have been much better.

Oskar Laverny
Oskar Laverny
Maître de Conférence

What would be the dependence structure between quality of code and quantity of coffee ?

comments powered by Disqus