Learned Video Compression
Video compression is a critically important problem: video content consumes more than 75% of all internet traffic, and video consumption on mobile doubles every 18 months. Improvements to compression lead to higher quality content, less buffering, and lower bandwidth and storage costs.
At WaveOne, we have been building a new algorithm for video compression, learned end-to-end using deep learning. We are excited to report some of our early results in an important mode of video compression used for real-time communication: the low-latency mode (simply meaning that the encoder has no access to information from future frames when encoding a frame).
To the best of our knowledge, we are the first machine learning method to outperform all commercial video codecs in this setting. For instance, on typical SD videos, our codec achieves on average 30%-40% reduction in file size for the same quality over HEVC/H.265.
Here is an example frame from a popular HD 1080p video for compression benchmarking:
We compress this video to a rate of 0.06 bits per pixel using H.264, H.265, and our codec. Below we zoom in on the highlighted areas to visually compare the different codecs:
Video compression in a nutshell
The key idea behind video compression is that consecutive frames in video are similar, and so it is advantageous to compress the difference each frame has with the previous one. This is done with two steps:
Motion compensation: many differences between consecutive frames are due to objects moving (or the camera panning). So, it is useful, for each pixel, to encode its motion vector: that is, where its relative location was in the previous frame. Given these motion vectors, we don’t need to encode the color of each pixel — we can simply copy it from the right location in the previous frame, obtaining what is known as the motion-compensated frame.
Residual compression: the difference between the motion-compensated frame and the original, called the residual, is often compressed using plain image compression techniques.
Thus, in existing codecs, the compressed representation of a frame consists of the motion vectors and the residual. The decoder applies the motion vectors to the previously decoded frame and adds the residual to reconstruct the current frame. The process is repeated for each frame. Since pixels of an object tend to move together in the same direction, the motion vectors can be represented compactly by grouping nearby pixels and specifying the motion for each group.
This is a very high-level description of some of the core concepts. The modern codecs that we compare against include hundreds of algorithms and options that can be tuned.
Advantages of our approach
Here we give a high-level intuition of what makes our method work better than the traditional codecs. For more in-depth technical description please see our paper.
Our model is inspired by traditional video codecs — but generalizes concepts such as motion compensation and residual compression, and uses neural networks for all components. Our model is trained end-to-end on the task of video compression.
Prediction beyond translation. Traditional codecs are very efficient in predicting the next frame when there is a simple constant movement, such as a car passing by. However, there are many common spatio-temporal patterns that are not easy to describe with simple movement of pixels, such as an animal turning its head or a person walking. Deep learning methods for frame prediction are able to fit such patterns from the data and do a better job in predicting them.
Powerful motion vector representation. To compress motion vectors, traditional codecs partition the frame into a hierarchy of blocks and specify the same motion vector for all pixels in a block, quantized to a particular precision (such as 1/4 of a pixel). Areas of uniform motion are represented by large blocks, whereas complex motion is represented by subdividing into smaller blocks, as shown on the left side below. While this representation compresses well, it cannot capture complex and fine motion effectively.
Our model has the flexibility of distributing the bandwidth so that areas that matter more have sophisticated motion boundaries and very precise motion vectors, whereas areas of less importance are represented very efficiently.
Propagation of a learned state. Traditional codecs use the previous frame and motion vectors as a “prior knowledge” to help encode the next frame. For example, since motion vectors don’t change too much from frame to frame, it is common to encode their difference. More sophisticated algorithms allow for using multiple previously encoded frames so we can copy certain region from one frame, and another from another frame. However, representing prior knowledge only in the form of color and motion is very limited. Other useful information includes changes in lighting, or 3D structure, texture features, or part of the object that is currently occluded. Our model allows us to propagate arbitrary information that the algorithm learns is important. It also has the freedom to retain any information from the distant past at an arbitrary level of precision.
Joint compression of motion and residual. To get a better motion representation, we would need to spend more bandwidth on it, but at the same time the motion-compensated image will be closer to the original and we will spend less bandwidth on the residual, and vice versa. The optimal tradeoff between motion and residual is important and is different for each frame. Traditional codecs encode these two representations separately which makes it hard to balance them. Instead, we use a single information bottleneck and allow our network fine control over the optimal tradeoff for each frame.
Multi-flow representation. Consider a video of a train moving behind the branches of a tree. Such a scene is highly inefficient to represent with traditional codecs that use a single layer of motion vectors, as there are small occlusion patterns that break the flow. Our model can represent multiple flows. For example, it can choose one simple flow for horizontal movement of the train, another simple flow for the leaves, and a mask that defines which flow to use for each pixel.
Spatial rate control. In video compression it is very important to have a mechanism that decides which parts of the frame are more important and require that we spend more bandwidth on. If we spend less bandwidth in a given area, not only will the quality be lower, the error will accumulate in subsequent frames and it will become even harder to recover. While traditional codecs do have spatial rate control, it is more challenging to do so for ML models based on auto-encoders, since different network architectures work better for different bitrates. We propose a method for ML-based spatial rate control that allows a single architecture to achieve the same performance that one can get by tuning separate architectures for each different bitrate.
For a detailed description of our algorithms, data and performance results, please refer to our paper.
Below is a quick summary of some of our results. We show our rate-distortion curves on the low-latency video compression task compared to other modern mainstream codecs. We test on popular video compression benchmarking datasets: the Xiph HD library of 22 1080p videos, and an SD dataset of 34 VGA videos from the Consumer Digital Video Library (CDVL).
We tested against H.264 and H.265 with medium and slower presets, VP9, as well as HEVC HM 16.0. We disabled B-frames and used the -ssim option to make the baselines improve performance on the SSIM metric. HEVC HM is the reference implementation of H.265 standard, with all the bells-and-whistles included. We are using its encoder_lowdelay_P_main.cfg profile for low-latency. It takes between 3 and 26 seconds to encode a single 640x480 frame. Our model takes, on average, 0.5 seconds to encode and 0.1 seconds to decode a 640x480 frame using an Nvidia Volta V100 GPU. So, this speed isn’t real-time just yet, but it is to be significantly optimized in future work.
Please see our paper for more comprehensive results and a detailed description of our evaluation procedure.
Our vision of the future of video compression
Video compression is a very compute-intensive task. Hardware acceleration is critical to ensure that video can be played in real-time and in a battery-friendly way. Today every phone and many edge devices have a hardware chip implementing a video standard, such as H.264. Since the video compression algorithms are hard-coded into the chip, it is critical that the set of algorithms be commonly accepted by the community into a video coding standard. Changing the coding algorithms is very difficult as it requires agreement by a large number of entities, and it requires getting new hardware into every phone. For example, the timeline from H.264 to H.265 was about 10 years!
We at WaveOne believe that the world is in a local minimum. It is appealing to stick to standards, as they provide an open ecosystem and hardware acceleration to a huge number of devices. At the same time, they constrain us to using algorithms whose fundamentals have not changed significantly for 20 years.
We are convinced that future video codecs will be based on neural networks. Due to the popularity of neural networks, more and more companies are starting to provide hardware acceleration for them. There are architectures for deep learning acceleration in the latest iPhones and Android devices, in modern surveillance and security cameras, drones, autonomous cars, head-mounted displays and so on. Since our technology is based on neural networks, we can leverage the existing hardware acceleration, which opens the doors to efficient and battery-friendly implementations. Future video compression standards will be much simpler and allow for more flexibility, as the entire algorithm will be represented with a neural network, specified with a standard format, such as ONNX, and simply sent to the device.
Streaming the codec itself is a powerful concept. Imagine a world where your codec can improve after you have purchased your phone, taking into account the latest machine learning innovations. Imagine your home security camera adapting itself to the particular viewpoint and learning to ignore the cat while paying more attention to the potential intruder. Imagine streaming the Godfather movie by first sending the network parameters trained just for that movie, followed by the streaming the movie bits themselves. Imagine instructing a street camera to “focus on license plates” or even “focus on red cars” because of an AMBER Alert. Imagine autonomous vehicles capturing multi-sensor input in a much more compact representation and using fewer bits on shrubbery and more on pedestrians.
Today’s compression algorithms and quality metrics treat every pixel equally, and this is highly suboptimal. We are developing compression networks that understand the content they are compressing and can make more intelligent decisions — for example, to spend more bandwidth on the faces and less bandwidth on shrubbery. The optimal compression also depends on the intent. If the goal is for a human to enjoy a movie, we would like to increase the quality in areas in the video where we anticipate the human to look. If the goal is to do unconstrained face recognition, then faces in the background should be compressed at higher quality. In short, we believe future compression technology will be content-aware, task-aware, and recipient-aware.
Understanding the content of digital media with machine learning is becoming increasingly important, as it enables search, organization, filtering out objectionable content, deciding which content is appropriate to whom, and so on. Today, machine learning analysis is completely decoupled from the data transmission pipeline — which is highly suboptimal. We are developing technology to enable machine learning directly on the WaveOne compressed representation, which will enable faster, cheaper and more accurate machine learning.
This is just the beginning
In this work we show that it is possible to outperform the video compression standards by a wide margin using a machine learning-based approach. However, our solution currently only focuses on low-latency mode, and it still does not run in real-time nor does it outperform on the entire rate-distortion range. This may be just the first wave, but there are many more to come. We believe there is a bright future for intelligent media representation and we are excited to be part of it.
We are hiring!
Join us in building the next generation of compression technology!