Our vision of the future of video compression

Video compression is a very compute-intensive task. Hardware acceleration is critical to ensure that video can be played in real-time and in a battery-friendly way. Today every phone and many edge devices have a hardware chip implementing a video standard, such as H.264. Since the video compression algorithms are hard-coded into the chip, it is critical that the set of algorithms be commonly accepted by the community into a video coding standard. Changing the coding algorithms is very difficult as it requires agreement by a large number of entities, and it requires getting new hardware into every phone. For example, the timeline from H.264 to H.265 was about 10 years!

We at WaveOne believe that the world is in a local minimum. It is appealing to stick to standards, as they provide an open ecosystem and hardware acceleration to a huge number of devices. At the same time, they constrain us to using algorithms whose fundamentals have not changed significantly for 20 years.

We are convinced that future video codecs will be based on neural networks. Due to the popularity of neural networks, more and more companies are starting to provide hardware acceleration for them. There are architectures for deep learning acceleration in the latest iPhones and Android devices, in modern surveillance and security cameras, drones, autonomous cars, head-mounted displays and so on. Since our technology is based on neural networks, we can leverage the existing hardware acceleration, which opens the doors to efficient and battery-friendly implementations. Future video compression standards will be much simpler and allow for more flexibility, as the entire algorithm will be represented with a neural network, specified with a standard format, such as ONNX, and simply sent to the device.

Streaming the codec itself is a powerful concept. Imagine a world where your codec can improve after you have purchased your phone, taking into account the latest machine learning innovations. Imagine your home security camera adapting itself to the particular viewpoint and learning to ignore the cat while paying more attention to the potential intruder. Imagine streaming the Godfather movie by first sending the network parameters trained just for that movie, followed by the streaming the movie bits themselves. Imagine instructing a street camera to “focus on license plates” or even “focus on red cars” because of an AMBER Alert. Imagine autonomous vehicles capturing multi-sensor input in a much more compact representation and using fewer bits on shrubbery and more on pedestrians.

Today’s compression algorithms and quality metrics treat every pixel equally, and this is highly suboptimal. We are developing compression networks that understand the content they are compressing and can make more intelligent decisions — for example, to spend more bandwidth on the faces and less bandwidth on shrubbery. The optimal compression also depends on the intent. If the goal is for a human to enjoy a movie, we would like to increase the quality in areas in the video where we anticipate the human to look. If the goal is to do unconstrained face recognition, then faces in the background should be compressed at higher quality. In short, we believe future compression technology will be content-aware, task-aware, and recipient-aware.

Understanding the content of digital media with machine learning is becoming increasingly important, as it enables search, organization, filtering out objectionable content, deciding which content is appropriate to whom, and so on. Today, machine learning analysis is completely decoupled from the data transmission pipeline — which is highly suboptimal. We are developing technology to enable machine learning directly on the WaveOne compressed representation, which will enable faster, cheaper and more accurate machine learning.