In this article, we’ll look at overfitting, and what are some of the ways to avoid overfitting your model. There is one sole aim for machine learning models – to generalize well.
The efficiency of both the model and the program as a whole depends strongly on the model’s generalization. It serves its function if the model generalizes well. Building on that idea, terms such as overfitting and underfitting apply to flaws that could suffer from the success of the model.
Overfitting – Defining and Visualizing
After training for a certain threshold number of epochs, the accuracy of our model on the validation data would peak and would either stagnate or continue to decrease.
Instead of generalized patterns from the training data, the model instead tries to fit the data itself. Therefore, fluctuations that are specific to the training data are learned, along with outlier information.
Hence for regression, instead of a smooth curve through the center of the data that minimizes the error like this:
We start getting a curve like this:
Similarly, for classification problems, overfitting occurs if we are not careful and try to over train our model for better result. Compare the output of an overfit classification model:
to a non-overfit one:
In other words, the training data will overfit our model if we train it too much.
How to Avoid Overfitting in Machine Learning Models?
Although high precision on the training set can always be achieved, what we really want is to build models that generalize well to a testing set (or data that they have not seen before).
Then this model of overfitting can make assumptions dependent on the noise. On its training data, it can do unusually well … but very poorly on fresh, unknown data.
Therefore, it is important to learn how to handle overfitting.
1. Collect/Use more data
This makes it possible for algorithms to properly detect the signal to eliminate mistakes. It will not be able to overfit all the samples while the consumer feeds more training data into the model, and will be required to generalize to achieve better information.
This approach is, however, considered costly, and consumers should also ensure that the data used is relevant and safe.
2. Data augmentation
We have covered data augmentation before. Check that article out for an amazing breakdown along with a real kaggle dataset example.
Data augmentation lets a sample data appear subtly different each time the algorithm processes it. The approach makes each data set look unique to the model and stops the model from understanding the data sets’ characteristics.
Adding noise to the input and output data is another option. Adding noise to the input keeps the model robust without compromising the accuracy and privacy of information, thus adding noise to the output makes the information more varied. This must, however, be performed with moderation.
3. Simplify the data/Remove features
Even though this method may lead to some loss in information, we could just reduce the hierarchy and complexity of the data. Pruning, reducing the parameters in a neural network, and using dropouts are some of the techniques that can be introduced.
4. Ensemble Learning
To understand this method better, you can check out this article on ensemble learning,
A group producing a single effect.Ensemble definition, merriam-webster dictionary
EL is a technique of machine learning that operates by integrating two or more different models’ predictions.
The most common strategies for assembly include boosting and bagging.
Boosting – works to increase its overall complexity by using simple base models. It teaches a large number of poor learners structured in a series, such that each learner learns from the learner’s errors before him in the series.
Bagging – is the opposite of boosting and is the other ensemble process. Bagging operates by teaching a huge number of powerful learners arranged in a parallel pattern to optimize their forecasts and then merging them.
CV is a powerful technique to avoid overfitting.
We partition the data into k subsets, referred to as folds, in regular k-fold cross-validation. Then, by using the remaining fold as the test set (called the “holdout fold”), we train the algorithm iteratively on k-1 folds.
This helps us to use only the initial training set to tune hyperparameters. This helps us to retain our test collection for choosing our final model as a truly unknown dataset.
6. Early stopping
This method is kind of intuitive. The problem we have is that our model trains too long, and overfits. What’s the solution?
Don’t train too long!
Before the learner passes the stage, we stop the training phase. Simple, right?
Regularization is a whole class of similar methods that are used to force the model to simplify itself with the least loss in information.
The types of regularization are:
L1: A type of regularization that penalizes weights in proportion to the sum of the absolute values of the weights.
L2: A type of regularization that penalizes weights in proportion to the sum of the squares of the weights.
Dropout: This one acts as a layer and is for Neural Networks. It randomly selects certain nodes at every iteration and eliminates them along with both their incoming and outgoing ties, as seen below.
There is also a new set of nodes in each iteration and this results in a new set of outputs. In machine learning, it can also be thought of as an ensemble technique.
As they capture more randomness, ensemble models typically do better than a single model. Likewise, dropout often works better than a standard variant of the neural network.
This chance of deciding how many nodes can be lowered is the dropout function hyperparameter.
Well, that turned out pretty lengthy. Hope you understood. There are hundreds of other great articles upcoming, so be sure to bookmark the website to keep updated.