Content Understanding is Key to the Next Generation of Compression

WaveOne, Inc.

September 2018

    

“He has brown curly hair, large hazel eyes, and a sharp jaw line.” In just a few words, you have given your friend a rough idea of how your cousin's face looks. This is an example of extreme compression where you are only communicating key features and a lot of information is presumed to be true. Your friend would be worried about you if you had said “His left eye is hazel, and so is his right eye, and they are on the opposite sides of his nose.”

While current codecs focus on minimizing pixel-level differences, we believe that the next generation of compression algorithms should be able to intelligently fill in the missing information the way we humans are able to. The next wave of compression technology will revolve around understanding the content, and making progress on the core computer vision problem of image understanding will lead to breakthroughs in compression as well.

Extreme Compression of Faces

While we are still years away from building a system that understands the content of images the way humans do, we can still explore the problem in a limited setting. In particular, we develop a codec for frontal faces at very aggressive compression factors and study the mistakes it makes. We share a few examples below from identities that the codec has not seen during training.

In this example, we compress a face to about 0.011 bits per pixel (in other words, almost 100 pixels are coded by a single bit!) and show comparisons with popular image codecs (click on an image to see the high-resolution version):
 

Extreme 2000:1 Compression of 512x512 Faces

*Some codecs even, at their lowest size setting, cannot produce images that small. In cases like this we make the image small enough so the codec can compress it and then enlarge the result with bicubic upsampling.

The byte sizes listed in this blog are the total number of bytes saved in the compressed file, including all metadata needed for reconstruction.
 

Detailed comparison between the original and WaveOne reconstruction

While our result is noticeably different from the original, the result is neither blurry nor pixelated, and the mistakes it makes are semantic in nature. In this example, even though each strand of hair differs, the hair still looks like hair. Our engine invents specular reflections in the eyes that are consistent. It also automatically chooses to use all bandwidth on the face, at the expense of the background.
 

The WaveOne reconstruction appears as a different person because our training objective was to generate a realistic face. This might be useful if the identity is less important (for example, for faces in the background). If, instead, preserving the identity is more important (for example, for face recognition), then preserving facial attributes can be included as part of the training objective. One could even train a codec to explicitly make faces less similar, for privacy reasons.

 

Extreme 2000:1 Compression of 512x512 Faces

In the above example, the codec changes the color of the eyes. Notice however that both eyes have the same color.

 

Extreme 2000:1 Compression of 512x512 Faces

And above, our engine makes up its mind that it should make up make-up! Seriously, we are not making this up.


Of course, very unusual face images, such as overlapping text or a face veil can result in spectacular failures, such as turning text into teeth (or nostrils?):

 

Failure Modes of Face Compression

More examples

You can see more examples of compressed faces from our test set, as well as performance at even more aggressive compression rates here. To make sure there is no overfitting on identity, photos of the same person have not been mixed in training and test sets.

Face Super-Resolution

If you have watched CSI, you may be familiar with the zoom-and-enhance feature. For fun, we tried to do this for real by training a custom upsampling module on aligned frontal faces at an aggressive 16x or even extreme 64x magnification factor. The 64x magnification does not always create pleasing results, but in many cases it is able to do reasonably well:

Aggressive Magnification of Faces by a Factor of 16

Extreme Magnification of Faces by a Factor of 64

 

We take the original image (left column) and downsample it to 32x32 or 8x8 pixels (the middle column). Our face magnification engine takes this tiny image as the only input and synthesizes a high-resolution version (the right column). While there are significant differences in identity, facial expression, eye color and gaze, the mistakes our super-resolution method makes are semantic and the image is self-consistent (for example, both eyes have the same shape and look in the same direction). In practice, of course, one would never use such a high magnification factor.

Conclusion

We are excited by the early signs of our technology being able to understand the image content in this limited setting. While traditional codecs blur the image, our codec produces visually-pleasing outputs as the mistakes it makes are semantic in nature.

So what is the catch? This compression performance in such extreme settings comes with some limitations, the most important of which is that the model is specialized for aligned frontal faces. And, in fact, it does not work well when we apply it to general images. In some cases it will turn fur into human hair or produce abstract art:

Face Compression Applied to Non-Facial Images

Compression and super-resolution of faces are interesting, domain-constrained experiments that have recently been explored by other researchers. Some of the more recent approaches are [Yu Chen et al. 2017], [Ustinova and Lempitsky, 2017], [Tschannen et al. 2018], and [Santurkar et al. 2017]. Let us know if we are missing your work.

Join us!

Our mission at WaveOne is to reinvent the way digital media is represented, which leads to smaller storage, better transmission, and higher quality reconstruction. We believe that the ideal representation will be based on understanding the content in images in a way we humans are able to.

In this blog we have explored representations that, when pushed to the limit, fail in ways that reveal some structural understanding. They only work in the limited domain of frontal faces. Designing an intelligent, content-aware codec that works for any image remains an exciting and largely unsolved problem. Do you want to join us in our journey to solve this challenge? Let us know. We are hiring!