The Computer Vision Pipeline, Part 4: feature extraction
From Deep Learning for Vision Systems by Mohamed Elgendy
In this part, we will take a look at feature extraction — a core component of the computer vision pipeline.
Feature extraction is a core component of the computer vision pipeline. In fact, the entire deep learning model works around the idea of extracting useful features which clearly define the objects in the image. We’re going to spend a little more time here because it’s important that you understand what a feature is, what a vector of features is, and why we extract features.
A feature in machine learning is an individual measurable property or characteristic of a phenomenon being observed. Features are the input that you feed to your machine learning model to output a prediction or classification. Suppose you want to predict the price of a house, your input features (properties) might include: square_foot, number_of_rooms, bathrooms, etc. and the model will output the predicted price based on the values of your features. Selecting good features that clearly distinguish your objects increases the predictive power of machine learning algorithms.
What is a feature in computer vision?
In computer vision, a feature is a measurable piece of data in your image which is unique to this specific object. It may be a distinct color in an image or a specific shape such as a line, edge, or an image segment. A good feature is used to distinguish objects from one another. For example, if I give you a feature like a wheel, and ask you to guess whether the object is a motorcycle or a dog. What would your guess be? A motorcycle. Correct! In this case, the wheel is a strong feature that clearly distinguishes between motorcycles and dogs. If I give you the same feature (a wheel) and ask you to guess whether the object is a bicycle or a motorcycle. In this case, this feature isn’t strong enough to distinguish between both objects. Then we need to look for more features like a mirror, license plate, maybe a pedal that collectively describes an object.
In machine learning projects, we want to transform the raw data (image) into a features vector to show our learning algorithm how to learn the characteristics of the object.
In the image above, we feed the raw input image of a motorcycle to a feature extraction algorithm. Let’s treat the feature extraction algorithm as a black box for now and we’ll come back to it soon. For now, we need to know that the extraction algorithm produces a vector that contains a list of features. This is called features vector which is a 1D array that makes a robust representation of the object.
It is important to call out that the image above reflects features extracted from just one motorcycle. A very important characteristic of a feature is repeatability. As in the feature should be able to detect the motorcycles in general not just this specific one. So, in real world problems, the feature will not be an exact copy of the piece in the input image.
If we take the wheel feature for example, the feature will not look exactly like the wheel on just one motorcycle. Instead, it looks like a circular shape with some patterns that identify wheels in all images in the training dataset. When the feature extractor sees thousands of images of motorcycles, it recognizes patterns that define wheels in general regardless of where they appear in the image and what type of motorcycle it is.
What makes a good (useful) feature?
Machine learning models are only as good as the features you provide. That means coming up with good features is an important job in building ML models. But what makes a good feature? And how can you tell?
Let’s discuss this by an example: Suppose we want to build a classifier to tell the difference between two types of dogs, Greyhound and Labrador. Let’s take two features and evaluate them: 1) the dogs’ height and 2) their eye color.
Let’s begin with height. How useful do you think this feature is? Well, on average, Greyhounds tend to be a couple of inches taller than Labradors, but not always. A lot of variation exists in the world. Let’s evaluate this feature across different values in both breeds population. We can visualize the height distribution on a toy example in the histogram below:
From the histogram above, we can see that if the dog’s height is twenty inches or less, there’s more than an 80% probability that this dog is a Labrador. On the other side of the histogram, if we look at dogs which are taller than thirty inches, we can be pretty confident that the dog is a greyhound. Now, what about the data in the middle of the histogram (heights from twenty to thirty inches)? We can see that the probability of each type of dog is pretty close. The thought process in this case is as follows:
if height <=20: return higher probability to Labrador
if height >=30: return higher probability to greyhound
if 20 < height >30: look for other features to classify the object
The “height” of the dog in this case is a useful feature because it helps (adds information) distinguish between both dog types. We can keep it, but it doesn’t distinguish between Greyhounds and Labradors in all cases, which is fine. In ML projects, there’s usually no one feature which can classify all objects on its own. This is why with machine learning we almost always need multiple features where each feature captures a different type of information. If only one feature does the job, we can write if-else statements instead of bothering with training a classifier.
Similar to what we did earlier with color conversion (color vs grayscale), to figure out which features you should use for a specific problem, do a thought experiment. Pretend you are the classifier. If you want to differentiate between greyhounds and labradors, what information you would need to know? You might ask about the hair length, or the body size, to color, and so on.
Another quick example of a non-useful feature to drive this idea home. Let’s look at eye color. For this toy example, imagine that we have only two eye colors, blue and brown. Here’s what a histogram might look like for this example:
It’s clear that for most values, the distribution is about 50/50 for both types. Practically this feature tells us nothing because it doesn’t correlate with the type of dog. Hence, it doesn’t distinguish between Greyhounds and Labradors.
- Easily tracked and compared
- Consistent across different scales, lighting conditions, and viewing angles
- Still visible in noisy images or when only part of an object is visible
Extracting features (hand-craft vs automatic extracting)
Okay, this a can be a large topic in machine learning that needs an entire book to discuss. Typically described in the context of a topic called feature engineering. In this section we’re only concerned with extracting features in images. I’m going to touch on the idea quickly.
Traditional machine learning uses hand-crafted features
In traditional machine learning problems, we spend a good amount of time in manual features selection and engineering. In this process we rely on our domain knowledge (or partnering with domain experts) to create features which make machine learning algorithms work better. We then feed the produced features to a classifier like Support Vector Machines (SVM) or Adaboost to predict the output. Some of the handcrafted feature sets are:
- Histogram of Oriented Gradients (HOG)
- Haar Cascades
- Scale-Invariant Feature Transform (SIFT)
- Speeded Up Robust Feature (SURF)
Deep learning automatically extracts features
In deep learning, we don’t need to manually extract features from the image. The network automatically extracts features and learns their importance on the output by applying weights to its connections. You feed the raw image to the network and, as it passes through the network layers, it identifies patterns within the image to create features. Neural networks can be thought of as feature extractors + classifiers which are end-to-end trainable as opposed to traditional ML models that use hand-crafted features.
How do neural networks distinguish useful features from the non-useful features?
You might get the impression that neural networks only understands the useful features but that’s not entirely true. Neural Networks scoop all the features available and give them random weights. During the training process it adjusts these weights to reflect their importance and how they should impact the output prediction. The patterns with the highest appearance frequency will have higher weights and in turn are considered more useful features. Whereas, features with lowest weights will have very low impact on the output. This learning process is going to be discussed in deep details in the next chapter.
Why use features?
The input image has too much extra information which isn’t necessary for classification. Therefore, the first step after preprocessing the image is to simplify the image by extracting the important information and throwing away non-essential information. By extracting important colors or image segments, we can transform complex and large image data into smaller sets of features. This makes the task of classifying images based on their features done simpler and faster.
Consider the example below. Suppose we’re given a dataset of 10,000 images of motorcycles each of 1,000 width x 1,000 height. Some images have solid backgrounds and others have busy backgrounds of unnecessary data. When these thousands of images are fed to the feature extraction algorithms, we lose all the unnecessary data that isn’t important to identify motorcycles and we only keep a consolidated list of useful features which can then be fed directly to the classifier. This process is a lot simpler than having the classifier look at a dataset of 10,000 images to learn the properties of motorcycles.