Real-Time Adaptive Image Compression

WaveOne, Inc.

May 2017

[Paper PDF]       [BibTeX]      

Image compression is an important step towards our long-term goals at WaveOne. We are excited to share some early results of our research which will appear at ICML 2017. You can find the paper on arXiv.

As of today, WaveOne image compression outperforms all commercial codecs and research approaches known to us on standard datasets where comparison is available. Furthermore, with access to a GPU our codec runs orders of magnitude faster than other recent ML-based solutions: for example, we typically encode or decode the Kodak dataset at over 100 images per second.

Here is how different image codecs compress an example 480x480 image to a file size of 2.3kB (~0.08 bits per pixel):


Now is the time to rethink compression

Even though over 70% of internet traffic today is digital media, the way images and video are represented and transmitted has not evolved much in the past 20 years (apart from Pied Piper's Middle-Out algorithm). It has been challenging for existing commercial compression algorithms to adapt to the growing demand and the changing landscape of transmission settings and requirements. The available codecs are ''one-size-fits-all'': they are hard-coded, and cannot be customized to particular use cases beyond high-level hyperparameter tuning. 

At the same time, in the last few years, deep learning has revolutionized a variety of fields, such as machine translation, object recognition/detection, speech recognition, and photo-realistic image generation. Even though the world of compression seems a natural domain for deep learning, it has not yet fully benefited from these advancements, for two main reasons:

  1. Our deep learning primitives, in their raw forms, are not well-suited for constructing compact representations. Vanilla autoencoder architectures cannot compete with the carefully engineered and highly tuned traditional codecs.
  2. It has been particularly challenging to develop a deep learning approach which can run in real-time in environments constrained by computation power, memory footprint and battery life.

Fortunately, the ubiquity of deep learning has been catalyzing the development of hardware architectures for neural network acceleration. In the years ahead, we foresee dramatic improvements in speed, power consumption and widespread availability of neural network hardware, and this will enable the proliferation of codecs based on deep learning. 

In the paper, we present progress on both performance and computational feasibility of ML-based image compression.


WaveOne compression performance

Here are some more detailed results from our paper (click to enlarge):

legend_blog.png

Performance on the Kodak PhotoCD dataset measured in the RGB domain (dataset and colorspace chosen to enable comparison to other approaches; see the paper for more extensive results). The plot on the left presents average reconstruction quality as function of the number of bits per pixel fixed for each image. The plot on the right shows average compressed file sizes relative to ours for different target MS-SSIM values.


Real-time performance is critically important to us, and we carefully designed and optimized our architecture to make things fast. Here's the raw encode/decode performance:

time_encdec_MS_SSIM.png

Average times to encode and decode images from the RAISE-1k 512×768 dataset using the WaveOne codec on the
NVIDIA GTX 980 Ti GPU and batch size of 1.

While we are slightly faster than JPEG (libjpeg) and significantly faster than JPEG 2000, WebP and BPG, our codec runs on a GPU and traditional codecs do not — so we do not show this comparison. We also don't have access to the full code of recent ML approaches to compare.


Domain-adaptive compression

Here is an interesting tidbit not discussed in the paper. Since our compression algorithm is learned rather than hard-coded, we can easily train codecs custom-tailored to specific domains. This enables capturing particular structure that the traditional one-size-fits-all codecs would not be able to characterize.

To illustrate this, we created a custom codec specifically to compress aerial views. To do this, we only replaced our training dataset: we preserved exactly the same architecture and training procedure we followed for our generic image codec. This aerial codec was able to achieve another 14% boost in compression over our generic codec on our aerial view test set.

JPEG

JPEG 2000

WaveOne Aerial

Crops from reconstructions by different codecs for a target file size of 6kB (or equivalently 0.11 bits per pixel). WebP is not shown as it was not able to generate files of such a small bitrate. The WaveOne codec was custom-trained on aerial views to capture characteristic structure.


Brief technical overview

Background

Traditional image compression

Compression, in general, is almost synonymous with pattern recognition. If we can identify structure in our input, we can eliminate this redundancy to represent it more succinctly.

Image codecs can often be broken down into 3 different modules:

  1. Transformation: map the input to some an alternative representation space more suitable for further processing.
  2. Quantization: remove information that is deemed less important.
  3. Coding: leverage low entropy content towards a more compact representation.

In traditional codecs, all these components are hard-coded: they are not automatically adapted (apart from high-level hyperparameters, possibly), and are thus heavily engineered to fit together. Even though these pipelines have been assembled very carefully, there still remains significant room for improvement of compression efficiency.

For example, the transformation is designed in an effort to target the distribution of inputs, but is ultimately fixed in place. As such, in practice it does not capture the input statistics perfectly nor can be easily refined to match more specialized structure. In addition, hard-coded approaches tend to compartmentalize the loss of information within the quantization step. This implies that the transformation module must often be chosen to be bijective — but this limits the ability to reduce redundancy prior to coding. Moreover, the encode-decode pipeline cannot be optimized for a particular metric beyond manual tweaking: even if we had the perfect metric for image quality assessment (which is an interesting question in and of itself!), traditional approaches cannot directly optimize their reconstructions for it.


ML enters the picture

In approaches based on machine learning, structure is automatically discovered, rather than manually engineered. 

At a high level, one natural approach to implement the encoder-decoder image compression pipeline is to use an autoencoder to map the target through a bitrate bottleneck, and train the model to minimize a loss function penalizing it from its reconstruction. This requires designing a feature extractor and synthesizer for the encoder and decoder, selecting an appropriate objective, and possibly introducing a coding scheme to further compress the fixed-size representation to attain variable-length codes. Many of the existing ML-based approaches to image compression (including ours) follow this general strategy.

Our approach

Here we dive into a bit more detail about our model. We only provide a brief overview here, and invite you to have a look at the paper if you'd like to learn more.

 

The overall architecture of our model. The feature extractor discovers structure and reduces redundancy via the pyramidal decomposition and interscale alignment modules. The lossless coding scheme further compresses the quantized tensor via bitplane decomposition and adaptive arithmetic coding. The adaptive codelength regularization modulates the expected codelength to a prescribed target bitrate. Distortions between the target and its reconstruction are penalized by the reconstruction loss. The discriminator loss encourages visually pleasing reconstructions by penalizing discrepancies between their distributions and the targets’.


Feature extraction

Images feature a number of different types of structure: across input channels, within individual scales, and across scales. We design our feature extraction architecture to recognize these. It consists of two major components:

  1. Pyramidal decomposition. This module analyzes the input across a number of different scales via learned transformations. It is loosely inspired by use the use of wavelets for multiresolution analysis. 
  2. Interscale alignment. This module learns to exploit structure shared across the different scales.


Code computation and regularization

The output of the feature extraction module is a single monolithic tensor. We proceed to encode it into a binary code, and regularize the corresponding codelength. This amounts to a number of steps:

  1. Quantization. We truncate the binary expansion of the tensor to a given precision.
  2. Bitplane decomposition. We losslessly transform the tensor into a binary counterpart suitable for encoding.
  3. Adaptive arithmetic coding. The output of the bitplane decomposition features significant structure (due to the adaptive codelength regularization below). This module is trained to leverage this low entropy to compress the binary tensor further into a variable-length binary sequence.
  4. Adaptive codelength regularization. This unit modulates the distribution of the quantized representation to achieve a target expected bit count across inputs. It does this by encouraging the presence of structure exactly where the coding scheme is able to exploit it for better compression.


Multiscale Adversarial Training

One of the most exciting innovations in machine learning in the last few years is the idea of Generative Adversarial Networks (GANs). The idea is to construct a generator network whose goal is to synthesize outputs according to a particular target distribution, and a discriminator network whose goal is to distinguish between examples sampled from the ground truth distribution, and ones produced by the generator.

We find the adversarial training framework to be particularly relevant to the compression world. In traditional codecs, distortions often take the form of blurriness, pixelation, and so on. These artifacts are unappealing, but are increasingly noticeable as the bitrate is lowered. We propose a multiscale adversarial training model to encourage reconstructions to match the statistics of their ground truth counterparts, resulting in sharp and visually pleasing results even for very low bitrates.

In our compression approach, we take the generator as the encoder-decoder pipeline, to which we append a discriminator. We specialize the classical GAN to the task of compression in several ways, such as introducing multiscale discrimination as well as joint analysis of the target and reconstruction (rather than labeling them independently).

Compression is only the first step

We see image and video compression as just a stepping stone in our mission. We are working on building the next generation of transmission of digital media. We are developing technology that has broad applications in multiple verticals, such as social media sharing, streaming, VR, autonomous driving, satellite communication, storage, drones, medical imaging, and surveillance.

We are a small and focused team looking for superstars to join us on this journey. We are looking for machine learners as well as high-performance engineers. Please reach out!

Related work

Ballé, Johannes, Laparra, Valero, and Simoncelli, Eero P. End-to-end optimized image compression. In International Conference on Learning Representations, 2017.

Johnston, Nick and Vincent, Damien and Minnen, David and Covell, Michele and Singh, Saurabh and Chinen, Troy and Hwang, Sung Jin and Shor, Joel and Toderici, George. Improved Lossy Image Compression with Priming and Spatially Adaptive Bit Rates for Recurrent Networks. arXiv preprint, 2017.

Rabbani, Majid and Joshi, Rajan. An overview of the JPEG 2000 still image compression standard. In Signal processing: Image communication, 2002.

Theis, Lucas, Shi, Wenzhe, Cunningham, Andrew, and Huszar, Ferenc. Lossy image compression with compressive autoencoders. In International Conference on Learning Representations, 2017.

Toderici, George, O’Malley, Sean M, Hwang, Sung Jin, Vincent, Damien, Minnen, David, Baluja, Shumeet, Covell, Michele, and Sukthankar, Rahul. Variable rate image compression with recurrent neural networks. In International Conference on Learning Representations, 2016.

Toderici, George, Vincent, Damien, Johnston, Nick, Hwang, Sung Jin, Minnen, David, Shor, Joel, and Covell, Michele. Full resolution image compression with recurrent neural networksarXiv preprint, 2016.

Wallace, Gregory K. The JPEG still picture compression standard. In IEEE transactions on consumer electronics, 1992.