Skip to content Skip to sidebar Skip to footer

Splitting Data To Training, Testing And Valuation When Making Keras Model

I'm a little confused about splitting the dataset when I'm making and evaluating Keras machine learning models. Lets say that I have dataset of 1000 rows. features = df.iloc[:,:-1]

Solution 1:

Generally, in training time (model. fit), you have two sets: one is for the training set and another is for validation/tuning/development set. With the training set, you train the model, and with the validation set, you need to find the best set of hyper-parameter. And when you're done, you may then test your model with unseen data set - a set that was completely hidden from the model unlike the training or validation set.


Now, when you used

X_train, X_test, y_train, y_test = train_test_split(features, results, test_size=0.33)

By this, you split the features and results into 33% of data for testing, 67% for training. Now, you can do two things

  1. use the (X_test and y_test as validation set in model.fit(...). Or,
  2. use them for final prediction in model. predict(...)

So, if you choose these test sets as a validation set ( number 1 ), you would do as follows:

model.fit(x=X_train, y=y_trian, 
         validation_data = (X_test, y_test), ...)

In the training log, you will get the validation results along with the training score. The validation results should be the same if you later compute model.evaluate(X_test, y_test).


Now, if you choose those test set as a final prediction or final evaluation set ( number 2 ), then you need to make validation set newly or use the validation_split argument as follows:

model.fit(x=X_train, y=y_trian, 
         validation_split = 0.2, ...)

The Keras API will take the .2 percentage of the training data (X_train and y_train) and use it for validation. And lastly, for the final evaluation of your model, you can do as follows:

y_pred = model.predict(x_test, batch_size=50)

Now, you can compare with y_test and y_pred with some relevant metrics.

Solution 2:

Generally, you'd want to use your X_train, y_train data that you have split as arguments in the fit method. So it would look something like:

history = model.fit(X_train, y_train, batch_size=50)

While not splitting your data before throwing it into the fit method and adding the validation_split arguments work as well, just be careful to refer to the keras documentation on the validation_data and validation_split arguments to make sure that you are splitting them up as expected.

There is a related question here: https://datascience.stackexchange.com/questions/38955/how-does-the-validation-split-parameter-of-keras-fit-function-work

Keras documentation: https://keras.rstudio.com/reference/fit.html

Solution 3:

I have read on the internet that fitting the data into model should look like this:

That means you need to fit features and labels. You already split them into x_train & y_train. So your fit should look like this:

history = model.fit(x_train, y_train, validation_split = 0.2, epochs = 10, batch_size=50)

So confusion starts when I need to evaluate the model:

score = model.evaluate(x_test, y_test, batch_size=50) --> Is this correct?

That's correct, you evaluate the model by using testing features and corresponding labels. Furthermore if you want to get only for example predicted labels, you can use:

y_hat = model.predict(X_test)

Then you can compare y_hat with y_test, i.e get a confusion matrix etc.

Post a Comment for "Splitting Data To Training, Testing And Valuation When Making Keras Model"