Learned Video Compression
Video compression is a critically important problem: video content consumes more than 75% of all internet traffic, and video consumption on mobile doubles every 18 months. Improvements to compression lead to higher quality content, less buffering, and lower bandwidth and storage costs.
At WaveOne, we have been thinking about the future of video transmission. We have been building a new algorithm for video compression, learned end-to-end using deep learning. We are excited to report some of our early results in an important mode of video compression used for real-time communication: the low-latency mode (simply meaning that the encoder has no access to information from future frames when encoding a frame).
To the best of our knowledge, we are the first machine learning method to outperform all commercial video codecs in this setting. For instance, on typical SD videos, our codec achieves on average 30%-40% reduction in file size for the same quality over HEVC/H.265.
Here is an example frame from a popular HD 1080p video for compression benchmarking:
We compress this video to a rate of 0.06 bits per pixel using H.264, H.265, and our codec. Below we zoom in on the highlighted areas to visually compare the different codecs:
Video compression in a nutshell
The key idea behind video compression is that consecutive frames in video are similar, and so it is advantageous to compress the difference each frame has with the previous one. This is done with two steps:
Motion compensation: many differences between consecutive frames are due to objects moving (or the camera panning). So, it is useful, for each pixel, to encode its motion vector: that is, where its relative location was in the previous frame. Given these motion vectors, we don’t need to encode the color of each pixel — we can simply copy it from the right location in the previous frame, obtaining what is known as the motion-compensated frame.
Residual compression: the difference between the motion-compensated frame and the original, called the residual, is often compressed using plain image compression techniques.
Thus, in existing codecs, the compressed representation of a frame consists of the motion vectors and the residual. The decoder applies the motion vectors to the previously decoded frame and adds the residual to reconstruct the current frame. The process is repeated for each frame. Since pixels of an object tend to move together in the same direction, the motion vectors can be represented compactly by grouping nearby pixels and specifying the motion for each group.
This is a very high-level description of some of the core concepts. The modern codecs that we compare against include hundreds of algorithms and options that can be tuned.
Advantages of our approach
Here we give a high-level intuition of what makes our method work better than the traditional codecs. For more in-depth technical description please see our paper.
Our model is inspired by traditional video codecs — but generalizes concepts such as motion compensation and residual compression, and uses neural networks for all components. Our model is trained end-to-end on the task of video compression.
Prediction beyond translation. Traditional codecs are very efficient in predicting the next frame when there is a simple constant movement, such as a car passing by. However, there are many common spatio-temporal patterns that are not easy to describe with simple movement of pixels, such as an animal turning its head or a person walking. Deep learning methods for frame prediction are able to fit such patterns from the data and do a better job in predicting them.
Powerful motion vector representation. To compress motion vectors, traditional codecs partition the frame into a hierarchy of blocks and specify the same motion vector for all pixels in a block, quantized to a particular precision (such as 1/4 of a pixel). Areas of uniform motion are represented by large blocks, whereas complex motion is represented by subdividing into smaller blocks, as shown on the left side below. While this representation compresses well, it cannot capture complex and fine motion effectively.
Our model has the flexibility of distributing the bandwidth so that areas that matter more have sophisticated motion boundaries and very precise motion vectors, whereas areas of less importance are represented very efficiently.
Propagation of a learned state. Traditional codecs use the previous frame and motion vectors as a “prior knowledge” to help encode the next frame. For example, since motion vectors don’t change too much from frame to frame, it is common to encode their difference. More sophisticated algorithms allow for using multiple previously encoded frames so we can copy certain region from one frame, and another from another frame. However, representing prior knowledge only in the form of color and motion is very limited. Other useful information includes changes in lighting, or 3D structure, texture features, or part of the object that is currently occluded. Our model allows us to propagate arbitrary information that the algorithm learns is important. It also has the freedom to retain any information from the distant past at an arbitrary level of precision.
Joint compression of motion and residual. To get a better motion representation, we would need to spend more bandwidth on it, but at the same time the motion-compensated image will be closer to the original and we will spend less bandwidth on the residual, and vice versa. The optimal tradeoff between motion and residual is important and is different for each frame. Traditional codecs encode these two representations separately which makes it hard to balance them. Instead, we use a single information bottleneck and allow our network fine control over the optimal tradeoff for each frame.
Multi-flow representation. Consider a video of a train moving behind the branches of a tree. Such a scene is highly inefficient to represent with traditional codecs that use a single layer of motion vectors, as there are small occlusion patterns that break the flow. Our model can represent multiple flows. For example, it can choose one simple flow for horizontal movement of the train, another simple flow for the leaves, and a mask that defines which flow to use for each pixel.
Spatial rate control. In video compression it is very important to have a mechanism that decides which parts of the frame are more important and require that we spend more bandwidth on. If we spend less bandwidth in a given area, not only will the quality be lower, the error will accumulate in subsequent frames and it will become even harder to recover. While traditional codecs do have spatial rate control, it is more challenging to do so for ML models based on auto-encoders, since different network architectures work better for different bitrates. We propose a method for ML-based spatial rate control that allows a single architecture to achieve the same performance that one can get by tuning separate architectures for each different bitrate.
For a detailed description of our algorithms, data and performance results, please refer to our paper.
Below is a quick summary of some of our results. We show our rate-distortion curves on the low-latency video compression task compared to other modern mainstream codecs. We test on popular video compression benchmarking datasets: the Xiph HD library of 22 1080p videos, and an SD dataset of 34 VGA videos from the Consumer Digital Video Library (CDVL).
We tested against H.264 and H.265 with medium and slower presets, VP9, as well as HEVC HM 16.0. We disabled B-frames and used the -ssim option to make the baselines improve performance on the SSIM metric. HEVC HM is the reference implementation of H.265 standard, with all the bells-and-whistles included. We are using its encoder_lowdelay_P_main.cfg profile for low-latency. It takes between 3 and 26 seconds to encode a single 640x480 frame. Our model takes, on average, 0.5 seconds to encode and 0.1 seconds to decode a 640x480 frame using an Nvidia Volta V100 GPU. So, this speed isn’t real-time just yet, but it is to be significantly optimized in future work.
Please see our paper for more comprehensive results and a detailed description of our evaluation procedure.