Addressing Data Scarcity In Your AI Journey

By Nilesh Raghuvanshi

All models are wrong but some are useful — Background Photo by Jan Antonin Kolar on Unsplash

We know that Machine Learning (ML) and especially Deep Learning (DL) require BIG data to work well. While companies like Amazon and Google have access to almost unlimited data, most others do not have such access.

When you don’t have access to BIG data, building a Machine Learning model which generalizes well is complicated. Complex ML models based on Neural Networks tend to memorize small training data without learning correct underlying patterns leading to over-fitting. Over-fitting means that the model shows good performance during the training phase but performs poorly when out in the real world.

To address various data scarcity situations, I have provided a detailed overview here. The article covers a wide range of data situations from having no data, small data to having costly, rare, or imbalanced data and then surveys various techniques to address these issues. Below is the summary of ideas that the article covers:

If you have no data, you may explore if any available open-source dataset could help solve the problem. Suppose the data is difficult to obtain due to privacy or confidentiality concerns. In that case, you can use methods like Incremental Learning or Federated Learning in combination with secure and privacy-preserving techniques like Secure Aggregation, Multiparty Computation, Differential Privacy, Homomorphic Encryption.
If you have small data to work with, you can augment it by adding variations to build a big enough dataset. You may also use a technique like Transfer Learning, where we transfer knowledge from one domain to another. Finally, using synthetic data may also aid in this situation.
In rare data situations like fault data from factory machines, a Few-Shot Learning method could help, which derives inspiration from our ability to generalize from very few experiences.
If you have the data to solve your problem but labeling it is an expensive proposition, you can employ Self-Supervised or Semi-Supervised Learning techniques. Tools and frameworks based on Active Learning and Weak Supervision also help speed up data labeling with substantial cost savings.
If you have data with an unequal distribution of classes (i.e., imbalanced data), it can lead to undesirable bias. To solve this, you can use various sampling and weighting techniques while training your model.

The good news is, if you find yourself in any of these data situations, there are techniques to help your ML model generalize better. As mentioned earlier, I covered many more of these techniques with intuitive examples in detail here.

What to read next:

Leave a Reply Cancel reply