Agregate models with caretEnsemble

Aug 19, 2018 4 min read R, models, machine learning

Introduction

Suppose you have a dataset, and you are narowing possible machine learning models to 2 or 3 models, but you still cant choose which you want : Will the benefit of understandability from my CART cost me too much compare to a random forest or some bootsting ?

Well you dont necessarily have to choose : juste agregate the models you have to make a better one. Typicaly, if you have models that dont uses the same features of the dataset, or give very different ansewrs but are still all good in term of a pre-selected metric (let’s say RMSE for regression, area under ROC for classification), ensembling them could be a good idea.

If you do you regression with the caret package, which i recomand, you should take a look at the caretEnsemble package. After introducing the dataset we’ll work with, i’ll talk a little more about ensembles.

Tidying the data

Let’s load the cu.summary dataset, caret and caretEnsemble libraries, plus rpart and ranger for models.

data("cu.summary")
df <- cu.summary
rm(cu.summary)
head(df)

	Price	Country	Reliability	Mileage	Type
Acura Integra 4	11950	Japan	Much better	NA	Small
Dodge Colt 4	6851	Japan	NA	NA	Small
Dodge Omni 4	6995	USA	Much worse	NA	Small
Eagle Summit 4	8895	USA	better	33	Small
Ford Escort 4	7402	USA	worse	33	Small
Ford Festiva 4	6319	Korea	better	37	Small

This dataset looks a lot like the cars dataset, but as more rows. The first thing to notice here is that the row names contains informations… Usualy, information should not be in row.names. let’s tidy a little this dataset :

df$name <- row.names(df)
row.names(df) <- NULL
df %<>% 
{str_split(.$name," ",n = 2)} %>%
{do.call(rbind,.)} %>%
  as.data.frame %>%
  set_names(c("brand","car")) %>% 
  cbind(df) %>% 
  set_names(tolower(names(.))) %>%
  as.tibble %>% 
  drop_na %>% 
  select(-name) %T>% 
  print

brand	car	price	country	reliability	mileage	type
Eagle	Summit 4	8895	USA	better	33	Small
Ford	Escort 4	7402	USA	worse	33	Small
Ford	Festiva 4	6319	Korea	better	37	Small
Honda	Civic 4	6635	Japan/USA	Much better	32	Small
Mazda	Protege 4	6599	Japan	Much better	32	Small
Mercury	Tracer 4	8672	Mexico	better	26	Small

Ok that’s better : we got 2 more regressors to play with.

Partitionning the data

We need to partition the dataset into a training and testing set to be able to assess performance of our models. While this is theoriticaly not needed, it’s good practice to left some line of the original data out of touch during the analysis.

We’ll use the createDataPartition from caret wich does exactly that : set 20% of the data apart. The function needs to know about the reponse variable (here, the price) to condition the splitting on response level (if the response is categorial) or response quantiles (if it’s continuous, wich is the case here), so that the training and testing sets are alike.

inTrain <- createDataPartition(y = df$price, p = .80, list = FALSE)
df.train <- df[ inTrain,]
df.test <-  df[-inTrain,]
df.pred <- df.test %>% select(price)
rm(inTrain)

Fitting the different models

First, let’s declare the formula we’ll use to do the regression. Then we’ll declare the controls we want for the caret regressions. For the sake of simplicity, i just took 10 bootstrap.

Here, we also have to specify the savePredictions and index parameters to be able to compare models.

formula = price ~ brand + country + reliability + mileage + type

controls <- trainControl(
  method="boot", # On va fitter le modèle sur des echantillons bootstrap de la base de train
  number=10, # On choisis le nombre d' echantillons bootstrapp
  savePredictions="final",
  index=createResample(df.train$price, 10),
  verboseIter = TRUE
)

Then we’ll use caretEnsemble::caretList to fit some models. Here, i choosed to use a glmStepAIC and a simple ranger, but you could also try a CART or anything else. The procedure take around 30secondes to run.

models <-
  caretEnsemble::caretList(
    formula,
    data=df.train,
    trControl=controls,
    tuneList=list(
      rpart=caretModelSpec(method="rpart"),
      ranger=caretModelSpec(
        method="ranger",
        tuneGrid = expand.grid(
          mtry = 1:20, 
          splitrule="variance", 
          min.node.size=1
        ),
        verbose=TRUE,
        importance = 'impurity'
      )
    )
  )

Agregating models

Now that the models have run, we can look at the corelation between them :

modelCor(resamples(models))

Ok models are a little correlated, which is normal since it’s a cart and a radom forest based on carts. So maybe agregating them will help making a better prediction ? Let’s use the main function from the caretEnsemble package, with does a linear combination of the models, weighted by their quality of prediction :

merge.glm <- caretEnsemble(
  models,
  trControl=trainControl(
    method = "boot",
    number=10,
    verboseIter = TRUE
  ))

The variable importance is :

merge.glm

Now let’s compare the RMSE from both models :

merge.glm$models %>%
  map("results") %>%
  map_dfr(~filter(.x,RMSE == min(RMSE)) %>% select(RMSE)) %>%
  mutate(name = names(merge.glm$models)) %>%
  add_row(name = "merged",RMSE = merge.glm$ens_model$results$RMSE) %>%
  select(name,RMSE)

Ok we lost a little compare to the ranger model. But it’s because we didnt choose wisely enough our models. MAybe if we try something else, it’ll work ? [Not finished…]

If we had choose better input, the result would have been much better.

caret caretEnsemble machine learning iris

Oskar Laverny

Maître de Conférence

What would be the dependence structure between quality of code and quantity of coffee ?