Back to guides

Visual Walkthrough: Overfitting in Machine Learning and How to Prevent It

---

Admin
3 min read
40 views

Visual Walkthrough: Overfitting in Machine Learning and How to Prevent It


1. What is Overfitting?

Definition:
Overfitting occurs when a machine learning model learns not only the underlying patterns in the training data but also the noise and anomalies. As a result, it performs well on the training set but poorly on unseen data.

Visual Explanation:

Training Data    |   Model Fit
-----------------|-----------------------------
Simple Function  |   Underfit (high bias)
Complex Curve    |   Overfit (high variance)
Just Right       |   Good Fit (low bias, low variance)

Overfitting Graph


2. Symptoms of Overfitting

  • High accuracy on training data
  • Poor accuracy on validation/test data
  • Large gap between training and validation performance

3. Industry Best Practices to Prevent Overfitting

A. Cross-Validation

What:
Split data into multiple folds, train on subsets, validate on the rest.

How:

  • Use k-fold cross-validation (commonly k=5 or 10)
  • Average performance across folds

Code Example (scikit-learn):

from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5)

B. Regularization

What:
Add a penalty for large weights to the loss function to discourage complexity.

Types:

  • L1 (Lasso): Encourages sparsity (many zero weights)
  • L2 (Ridge): Shrinks all weights (no zeros, but smaller values)

Code Example (scikit-learn):

from sklearn.linear_model import Ridge
model = Ridge(alpha=1.0)

C. Early Stopping

What:
Monitor validation performance during training. Stop when it starts to degrade.

How:

  • Track validation loss after each epoch
  • Stop training when validation loss increases

Code Example (Keras):

from keras.callbacks import EarlyStopping
early_stop = EarlyStopping(monitor='val_loss', patience=5)
model.fit(X_train, y_train, validation_data=(X_val, y_val), callbacks=[early_stop])

D. Pruning (for Tree Models)

What:
Limit the depth or number of leaves in decision trees/ensembles.

How:

  • Set max_depth, min_samples_leaf, max_leaf_nodes parameters

Code Example (scikit-learn):

from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(max_depth=5, min_samples_leaf=10)

E. Dropout (for Neural Networks)

What:
Randomly drop neurons during training to prevent co-adaptation.

How:

  • Add Dropout layers with probability (e.g., 0.5)

Code Example (Keras):

from keras.layers import Dropout
model.add(Dropout(0.5))

F. Data Augmentation

What:
Expand training dataset with transformed copies (images: flips, rotations, crops).

How:

  • Use libraries (e.g., ImageDataGenerator in Keras)

Code Example (Keras):

from keras.preprocessing.image import ImageDataGenerator
datagen = ImageDataGenerator(rotation_range=20, horizontal_flip=True)

G. Reduce Model Complexity

What:
Choose simpler models or reduce the number of parameters/features.

How:

  • Feature selection
  • Principal Component Analysis (PCA)

4. Advanced Tips

  • Ensemble Methods: Combine predictions from multiple models to reduce variance.
  • Hyperparameter Tuning: Use grid search or Bayesian optimization to find optimal settings.
  • Monitor Learning Curves: Plot training vs. validation loss to detect overfitting early.

5. Summary Table

MethodUse CaseKey Parameter/Tool
Cross-ValidationAll modelscv= in scikit-learn
RegularizationLinear/NN modelsalpha, Dropout
Early StoppingIterative trainingEarlyStopping callback
PruningTrees/Ensemblesmax_depth, min_leaf
Data AugmentationImage/Text dataImageDataGenerator
Reduce ComplexityAll modelsFeature selection/PCA

6. Visual Checklist

  • Use cross-validation
  • Apply regularization
  • Monitor validation performance
  • Limit model complexity
  • Augment your dataset
  • Stop training early if needed

References


By following these best practices, you can greatly reduce the risk of overfitting and build robust, generalizable machine learning models.