Training Models on Imbalanced Text Data
Text Classification, Oversampling, Text Generation, RNN
We just launched our liveProject platform — where you can sign up for a structured project and get real-world experience.
In this liveProject, you’ll take on the role of a data scientist working for an online movie streaming service. Your bosses want a machine learning model that can analyze written customer reviews of your movies, but you discover that the data is biased towards negative reviews. Training a model on this imbalanced data would hurt its accuracy, and so your challenge is to create a balanced dataset for your model to learn from. You’ll start by simulating your company’s data by deliberately introducing imbalance to an IMDB (Internet Movie Database) review dataset. You’ll experiment with two different methods for balancing this dataset: using sampling techniques, and generating a new synthetic corpus with deep learning text generation. You’ll build and train a simple machine learning model on each dataset to compare the effectiveness of each approach.
Learn more about liveProject here: https://liveproject.manning.com/