From Deep Learning with PyTorch by Eli Stevens and Luca Antiga
This article explores the use and capabilities of GANs, using a fictional (and inadvisable) example of aspiring art forgers.
Let’s suppose, for a moment, that we are career criminals who want to move into selling forgeries of “lost” paintings by famous artists. We’re criminals, though, not painters, and as we paint our fake Rembrandts and Picassos it quickly becomes apparent that they are amateur imitations rather than the real deal. Even if we spend a bunch of time practicing until we get a canvas that we can’t tell is fake, trying to pass it off at the local art auction house is going to get us kicked out instantly. Even worse, being told “this is clearly fake; get out,” doesn’t help us improve! We’d have to randomly try a bunch of things, gauge which ones took slightly longer to recognize as forgeries, and emphasize those traits on our future attempts, which seems like it’d take far too long.
Instead, we need to find an art historian of questionable moral standing to inspect our work and tell us exactly what it was that tipped them off that the painting wasn’t legit. With that feedback, we can improve our output in clear, directed ways, until our sketchy scholar can no longer tell them from the real thing.
Soon, we’ll have our “Botticelli” in the Louvre, and their Benjamins in our pockets. We’ll be rich!
Although this scenario is a bit farcical, the underlying approach and technology are sound, and will likely have a profound impact on the perceived veracity of digital data in the years to come. The entire concept of “photographic evidence” is likely to become entirely suspect, given how easy it will be to automate the production of convincing, yet fake images and video. The only key ingredient is data. Let’s see how this process works.
The GAN game
In the context of deep learning, what we’ve described is known as “the GAN game,” where two networks, one acting as the painter and one as the art historian, compete to outsmart each other at creating and detecting forgeries. GAN stands for Generative Adversarial Network, where generative means that something is being created (in this case, fake masterpieces), adversarial means that the two networks are competing to outsmart the other and, well, network is pretty obvious. These networks are one of the most original outcomes of recent deep learning research.
Remember that our overarching goal is to produce synthetic examples of a class of images that can’t be recognized as fake. When mixed in with legitimate examples, a skilled examiner would have trouble determining which ones are real, and which are our forgeries.
The generator network takes the role of the painter in our scene above, tasked with producing realistic-looking images starting from an arbitrary input. The discriminator network is the amoral art inspector, needing to tell whether a given image was fabricated by the generator or it belonged in a set of real images. This two-network design is atypical for most deep learning architectures, but when used to implement a GAN game it can lead to incredible results.
In Figure 1 there is a rough picture of what’s going on. The end-goal for the generator is to fool the discriminator into mixing up real and fake images. The end-goal for the discriminator is to find out when it’s being tricked, but it also helps inform the generator about the identifiable mistakes that the generated images have. At the start, the generator produces confused, three-eyed monsters that look nothing like a Rembrandt portrait. The discriminator is easily able to distinguish the muddled messes from the real paintings. As training progresses, information flows back from the discriminator, and the generator uses that to improve. By the end of training, the generator is able to produce convincing fakes, and the discriminator is no longer able to tell which is which.
Note that “Discriminator wins” or “Generator wins” shouldn’t be taken literally, as there’s no explicit tournament between the two. Both networks are trained based on the outcome of the other network, which drives the optimization of the parameters of each network.
This technique has proven itself to be able to lead to generators that produce realistic images out of noise and a conditioning signal, like an attribute (e.g. for faces, young, female, glasses on), or another image. A well-trained generator learns a plausible model for generating realistic-looking images, even when examined by humans.
An interesting evolution of this concept is CycleGAN. A CycleGAN can turn images of one domain into images of another domain (and back), without the need for explicitly providing matching pairs in the training set.
In Figure 2 we have a CycleGAN workflow for the task of turning a photo of a horse into a zebra, and vice versa. Note that there are two separate generator networks, as well as two distinct discriminators.
As the figure shows, the first generator learns to produce an image conforming to a target distribution (zebras, in this case) starting from an image belonging to a different distribution (horses), and the discriminator can’t tell if the image produced from a horse photo is a genuine picture of a zebra. At the same time, and here’s where the Cycle prefix in the acronym comes in, the resulting fake-zebra is sent through a different generator going the other way, zebra to horse in our case, to be judged by another discriminator on the other side. Creating such a cycle stabilizes the training process considerably, which addresses one of the original issues with GANs.
The fun part is that we don’t need pairs of matched horse/zebra as ground truths (good luck getting them to match poses!). It’s enough to start from a collection of unrelated horse images and zebra photos for the generators to learn their task, going beyond a purely supervised setting. The implications of this model go even further than this: the generator learns how to selectively change the appearance of objects in the scene without supervision on what is what. No signal indicates that manes are manes and legs are legs, but they get translated to something that lines up with the anatomy of the other animal.
A network that turns horses into zebras
We can play with this model right now. The CycleGAN network has been trained on a dataset of (unrelated) horse images and zebra images extracted from the ImageNet dataset. The network learns to take an image of one or more horses and turn them all into zebras, leaving the rest of the image as unmodified as possible. Although humankind hasn’t held its breath over the last few thousand years for a tool that turns horses into zebras, this task showcases the ability of these architectures to model complex real-world processes with distant supervision. Although they have limits, there are hints that in the near future we won’t be able to tell real from fake in a live video feed, which opens a can of worms that we’ll duly close right now — it’s a bit beyond the scope of this article.
Playing with a pre-trained CycleGAN gives us the opportunity to take a step closer and look at how a network, a generator in this case, is implemented. Let’s do it right away: this is what a possible generator architecture for the horse to zebra task looks like. In our case it’s our old friend ResNet. We’ll define a
ResNetGenerator class off-screen. The code is in the first cell of the
3_cyclegan.ipynb file, but the implementation isn’t relevant right now, and it’s too complex to follow until we’ve gotten a lot more PyTorch experience. Right now, we’re focused on what it can do, rather than how it does it. Let’s instantiate the class with default parameters:
Listing 2. 2. code/p1ch2/3_cyclegan.ipynb
netG = ResNetGenerator()
The netG model has now been created, but it contains random weights. We mentioned earlier that we’d run a generator model which had been pre-trained on the horse2zebra dataset. The weights of the model are saved in a pth file, which is nothing but a pickle file of the tensor parameters of the model. We can load those into our ResNetGenerator using the load_state_dict method of the model:
model_path = '../data/p1ch2/horse2zebra_0.4.0.pth'
model_data = torch.load(model_path)
At this point
netG acquired all the knowledge it achieved during training.
Let’s put the network in
eval mode, as we did for ResNet101:
Printing out the model as we did earlier we can appreciate that the model it’s pretty condensed for doing what it does. It takes an image, recognizes one or more horses in it by looking at pixels and individually modifies the values of those pixels such that what comes out looks like a credible zebra. We won’t recognize anything zebra-like that in the printout (or in the source code, for that matter): this is because there’s nothing zebra-like in there, the network is a scaffold, and the juice is in the weights.
We’re ready to load some random image of a horse and see what our generator produces. First of all, we need to import
from PIL import Image
from torchvision import transforms
Then we define a few input transformations to make sure data enters the network with the right shape and size:
preprocess = transforms.Compose([transforms.Resize(256),
Let’s open a horse file
img = Image.open("../data/p1ch2/horse.jpg")
Oh, there’s a dude on the horse. Not for long, judging by the picture. Anyhow, let’s pass it through preprocessing and turn it into a properly shaped variable:
img_t = preprocess(img)
batch_t = torch.unsqueeze(img_t, 0)
We shouldn’t worry about the details right now. The importance is that we follow from a distance. At this point,
batch_t can be sent to our model
batch_out = netG(batch_t)
batch_out is now the output of the generator, that we can convert back to an image
out_t = (batch_out.data.squeeze() + 1.0) / 2.0
out_img = transforms.ToPILImage()(out_t)
<PIL.Image.Image image mode=RGB size=316x256 at 0x23B24634F98>
Oh, man. Who rides a zebra that way? The resulting image isn’t perfect, but consider that it’s somewhat unusual for the network to find someone (sort of) riding on top. It bears repeating that the learning process hasn’t passed through direct supervision, where humans have delineated tens of thousands of horses, or manually photoshopped out thousands of zebra stripes. The generator has learned to produce an image that fools the discriminator into thinking that it’s a zebra and that there’s nothing fishy with the image (clearly the discriminator has never been to a rodeo).
Many other fun generators have been developed using adversarial training or with other approaches. Some of them are capable of creating credible human faces of non-existing individuals, other can translate sketches into real looking pictures of imaginary landscapes. Generative models are also being explored for producing real sounding audio, credible text or enjoyable music. It’s likely that these models will be at the basis of future tools that support the creative process.
On a serious note, it’s hard to overstate the implications of this kind of work. Tools like the one we downloaded are only to get higher quality and more ubiquitous. Face-swapping technology in particular has gotten considerable media attention. Searching for “deep fakes” will turn up a plethora of example content (though we must note that there’s a non-trivial amount of “not safe for work” content labeled as such; as with everything on the internet, click carefully).
That’s all for now.