From Grokking Deep Learning for Computer Vision by Mohamed Elgendy
Check out part 1 for an intro to the computer vision pipeline.
In computer vision applications, we deal with image or video data. Let’s talk about grayscale and color images.
Image as functions
Images can be represented as a function of two variables, X and Y, which define a two dimensional area. Digital images are made of a grid of pixels. The Pixel is the raw building block of an image. Every image consists of a set of pixels in which their values represent the intensity of light that appears in a given place in our image. Let’s take a look at the motorcycle example again after applying the pixels grid to it.
The image above has a size of 32 x 16. This means that the dimensions of the image are 32 pixels wide and 16 pixels tall. The X axis starts from 0 to 31 and Y axis from 0 to 16. Overall, the image has 32×16 = 512 pixels. In this grayscale image, each pixel contains a value that represents the intensity of light on this specific pixel. The pixel values vary from 0 to 255. Because the pixel value represent the intensity of light, then the value 0 represents dark pixels (black), 255 is bright (white) and the values in between represent the intensity on the grayscale.
You can see that the image coordinate system is similar to the Cartesian coordinate system: images are two dimensional and lie on the x-y plane. The origin (0, 0) is at the top left of the image. To represent a specific pixel we use the following notations: “F” as a function, “x,y” the location of the pixel on the x, y coordinates. For example, the pixel located in x=9 and y= 14 is white, this is represented by the following function: F(9, 14) = 255. Similarly, the pixel (27,7) that lies on the front wheel of the motorcycle is black, represented as F(27,7) = 0.
Grayscale =>F(x,y) gives the intensity at position (x,y)
That was for grayscale images. How about color image?
In color images, instead of representing the value of the pixel by one number, the value’s represented by three numbers representing the intensity of each color in this pixel. In RGB system for example, the value of the pixel is represented by three numbers: the intensity of red, intensity of green, and intensity of blue. Other color systems for images exist, like HSV and LAB. All follow the same concept when representing the pixel value. More on colored images in the next page.
Here’s the function representing color images in the RGB system:
Color image in RGB =>F(x,y) = [red(x,y), green (x,y), blue(x,y)]
Thinking of an image as a function is useful in image processing. This way we can think of an image as a function of F(x, y) and operate on it mathematically to transform it to a new image function G(x, y).
Let’s take a look at the following image transformation examples:
How do computers see images?
When we look at an image, we see objects, landscape, colors, etc., but this isn’t the case with computers. Consider the picture below-your human brain can look at it and immediately know that it’s a picture of a motorcycle. To a computer the image looks like a 2D matrix of the pixels’ values which represent intensities across the color spectrum.
Without context, it’s a massive pile of data.
You can see that the image above is of size 24*24. This size shows width and height of an image; 24 pixels horizontally and 24 vertically. That means there are a total of 576 (24*24) pixels. If the image is the size of 700 * 500 then the dimensionality of the matrix is (700, 500). Where each pixel in the matrix represents the intensity of brightness in that pixel. 0 represents black and 255 represents white color.
In grayscale images, each pixel represents the intensity of only one color. Whereas in standard RGB system, color images have three channels (red, green blue). Color images are represented by three matrices: one represents the intensity of red in the pixel, one for green and one for blue.
As you can see in above image, the color image is composed of three channels red, green and blue. Now the question is, how do computers see this image? Again, the answer is that they see the matrix, unlike grayscale images where we had only one channel. In this case, we’ll have three matrices stacked on top of each other, which is why it’s 3D matrix. Dimensionality of 700 * 700 color image is (700, 700, 3). Let’s say, the first matrix represents red channel, then each element of that matrix represents an intensity of red color in that pixel, likewise in green and blue. Each pixel in color image has three numbers (0 to 255) associated with it. These numbers represent intensity of red, green and blue color in that particular pixel.
If we take the pixel (0,0) as an example, we’ll see that it represents the top-left pixel of the image of green grass. When we view this pixel in the color images, it looks like this:
Take a look at the example below of some shades of the color green and their RGB values:
To recap: computers see an image as matrices. Grayscale images have one channel (gray), and we can represent grayscale images in the 2D matrix, where each element represents the intensity of brightness in that particular pixel. Remember, 0 means black and 255 means white. Grayscale images have one channel, whereas color images have three channels RGB (red, green, blue). We can represent color images in the 3D matrix where the depth will be three.
We’ve also seen how images can be treated as functions of space. This concept allows us to operate on images mathematically and change or extract information from them. Treating images as functions is the basis of many image processing techniques like converting color to grayscale or scaling an image. Each of these steps operate mathematical equations to transform an image pixel-by-pixel.
Grayscale: f x, y gives the intensity at position x, y
Color image: f x, y red x, y, green x, y, blue x, y
About the author:
Mohamed Elgendy is the head of engineering at Synapse Technology, a leading AI company that builds proprietary computer vision applications to detect threats at security checkpoints worldwide. Previously, Mohamed was an engineering manager at Amazon, where he developed and taught the deep learning for computer vision course at Amazon’s Machine Learning University. He also built and managed Amazon’s computer vision think tank, among many other noteworthy machine learning accomplishments. Mohamed regularly speaks at many AI conferences like Amazon’s DevCon, O’Reilly’s AI conference and Google’s I/O.
Originally published at https://freecontent.manning.com.