75 percent of mobile traffic today is video. But with 5G, half of all video sent over this next generation network will never be seen by human eyes. Nokia’s Ville-Veikko Mattila says the next industrial revolution is underway, and video will play a huge role in our future even if only the machines know it. 

Plus: solving the “cocktail party effect” for your next corporate conference call.

Below is a transcript of this conversation. Some parts have been edited for clarity.

Michael Hainsworth: The kids are watching TikTok. Their parents binge watching Netflix. And content creators are live streaming from wherever the action is. But increasingly, the action will be behind the scenes, as Industry 4.0 eclipses consumer content, fleets of vehicles will stream every moment of a journey to help improve the driving experience, assembly lines will keep a watchful eye out for defects, and smart city technologies will keep us safe from everything from crime to the next novel coronavirus. But getting there requires new technologies. The technologies Ville-Veikko Mattila is developing today as the Head of Research and Standardization at Nokia. We began our conversation by talking about, today, 75 percent of mobile traffic is video.

Ville-Veikko Mattila: Yeah, so video definitely seems to be everywhere today. We all are using our mobile phones to consume video anywhere, anytime. So the consumption patterns are really kind of evolving a lot. And yes, so three-quarters of the mobile traffic is video today. And then if you think about the IP traffic on the internet, more than 80 percent of that traffic is video. So it’s really dominating.

MH: And for those of us who aren’t geeks, we need to understand that you can’t just pump video out like that, willy nilly. You need to have an Encoder-Decoder, a codec technology to compress that video so that it can be consumed on the other end at a reasonable pace.

Live from downtown Los Angeles. It’s the 72nd Emmy Awards. Please welcome your host, Jimmy Kimmel.

MH: I had no idea Nokia has won four Emmy Awards, none of them for soap operas or late night dramas, but is for technology that’s been incorporated into 2 billion devices around the world.

Hello and welcome to the pandemies. Wow. It’s great to finally see people again. Thank you for risking everything to be here. Thank me for risking everything to be here. You know what they say? You can’t have a virus without a host.

VVM: Video compression definitely, is something very important. So if I give you a kind of a reference, so 4K video ( ultra high definition video), we can all enjoy that through our favorite kind of video-on-demand streaming services. 4K video today is about the raw data. That is about six gigabits per second. And if you think about home connections today, personally I have 100 megabit connection. So it would be totally impossible to receive 4K video to home. And that’s why we need a lot of compression in order to be able to distribute video to people, to consumers, to homes and also then for industry applications, or reality video or even augmented reality applications in the future. So they will definitely require even higher bandwidth than the 4K television, or streaming today.

MH: You say that the future of video is built on five pillars that all tie into each other: interactivity, cloudification, machine-to-machine communications, intelligence and immersion. Let’s start with immersion. Immersion needs ultra low latency. So we’re not pulled out of that moment. We’re still immersed in whatever it is we’re consuming. I can imagine that requires an even newer codec than the ones that you’ve worked on with Impact.

VVM: Immersion is, of course, very important because immersion [is creating] a kind of feeling of presence, that feeling of being present in the moment. That when you are watching a movie it becomes a kind of immersive experience, and that can be achieved, for example, by ultra high definition or high resolution, or by high dynamic range for deeper colors. So it’s very much linked to the picture quality, the video quality. But of course it’s not the only aspect that then drives immersion. 

You also mentioned interactivity and being able to interact with the content, it also drives immersion. And we can take, for example, virtual reality, where we have this kind of rotational freedom. We can move our head to experience this large screen display of our content. And that makes it interactive. Of course, then think about the future experiences and these emerging media experiences like augmented reality. There really is that challenge of, can we really capture reality in three dimensions and distribute such volumetric representations to the user, which then will enable this kind of preview point experiences? So not only rotating your head, but you can even walk around the content, or you can move into the content. And of course, that’s very important, this three dimensional aspect because our reality is three dimensional. And if you want to augment that with some information, some data, of course that also needs to be three dimensional content, in order to match it well.

MH: So you’re talking AR and VR. I was just thinking when it comes to immersion and low latency in technologies, just gaming in general, Google Stadia and things like that, but to your point, we’re entering a whole new world of immersion with devices that will soon be no heavier than a pair of eyeglasses. So augmenting the world around us is going to require not only a low latency environment, but a new codec to take advantage of that.

If you think about virtual reality and augmented reality, one challenge today is content creation. 

VVM: Yeah, if we think about that codecs so kind of conventional codecs for 2D video. Like for example, a 4K video today, this Codecs will also stay fundamental for these new emerging content experiences like augmented reality where I mentioned this Waldo metric video, so the video becoming three dimensional. So still, these conventional codecs are there at the bottom. So they are the fundamental enablers even when we consider these new content experiences. But of course, we then need to build quite much on top of them in order to enable this new content type like the three dimensional content.

MH: So this new video codec you’re working on, VVC H.266. That seems to help overcome a lot of barriers to the use of the technology, but what are the barriers to AR/VR adoption if that’s not bandwidth?

VVM: If you think about virtual reality and augmented reality, of course, one kind of challenge today is of course, content creation. If you think about this volumetric experience, or the example I gave today creating such content may perhaps only happen in studios, where you have multiple cameras in place, and perhaps also having this kind of green screen background. So being easily able to segment for example, the object from the background and capturing the object from all angles, all around the object, in order to create 3D content. But then the key question is, okay, how then can consumers create such content? This is one of the current challenges. We had to beat a similar challenge in what it comes to on the directional video, or VR video in the past. And of course, even Nokia, we’ve been pioneering this VR camera for professional use, in the past. But today, there are really multiple options for consumers, to have virtual reality cameras and capture everything around them. 

MH: I can imagine we’re already at the edge of that. My new smartphone has LiDAR built into it which gives it the ability to recognize the world in a 3D space.

VVM: Yeah. I think that some of the new phones do have this new sensor, this LiDAR, which is a kind of laser scanning sensor which can then help you to map your environment, the geometry of your environment. And that is exactly about this 3D I mentioned. That can actually help consumers create 3D content in the future. I’m sure that top model sensors will be needed to create such content. So this LiDAR sensor, plus of course then the camera sensors themselves.

MH: But interactivity requires the cloudification of video since the smartphones, as powerful as they are still need the heavy lifting to be done in the cloud.

VVM: That’s true because if you think about this new emerging content experience, likely virtual reality and augmented reality, the content itself is becoming much more complex. Because we may have, for example, multiple cameras in place to record the content, and then syncing the content as one video. So you may have multiple video tracks. And in addition to, let’s say, texture, we may also be interested in capturing depth information, which then also relates to this LiDAR scanning that we talked about: being able to estimate the geometry of the object, the 3D shape of the object. We need multiple tracks of all of this information. So textures, depth, and therefore the content is definitely becoming much more complex. 

And of course, if you think about today’s modern smartphones, they have very, very powerful devices. But we then may also need the cloud, because the media processing is also becoming more complex, more demanding because the content is becoming more complex. In the end, we may also need system equipment, which is great because then you don’t necessarily need to have the latest and most expensive smartphone here in your hand, but cloud can basically give you the experience for any connected device.

MH: How close does that cloud need to be? We’ve got the cloud, which we generally recognize now, you know my phone will upload all my photos to the cloud in case my device is lost or stolen. That’s just a general data center somewhere anywhere doesn’t matter where.

VVM: Yeah.

MH: Then you’ve got the edge cloud, which is much closer to where I physically am at any given point in time. But then you’ve got the near edge cloud which is right there close by for specific types of technologies. Where do we need to be in that cloud world when we’re talking about this interactivity?

VVM: It then depends really on the application. Cloud gaming, as an example, which is very fascinating, new way to play games. The cloud itself kind of renders all the views of the game, and then those views are encoded as a video and streamed to connected devices as video. So in a way, while you are playing the game you are watching a video, a highly interactive video. And when it’s about the gameplay, the frame rates actually need to be very high, perhaps something like 60 frames per second, at a minimum. And for that, we definitely need these low-latency operations. That requires that the network is very close to you and the computing power or the network then also needs to be very close to you. And this is one of the great promises of 5G, that the cell sizes are, of course getting smaller. And the computation power is very close to you. And that then enables this low-latency communication and response time is getting very short.

MH: We’ve been talking about video that we consume as humans, but I’m fascinated by your prediction that half of global video traffic will be something that humans won’t ever see. This will be about the Internet of Things, Industry 4.0, and machines are going to be consuming half of what we generate.

VVM: Exactly, so it will be quite different because if you are going to talk about video compression, today we optimize the video for humans and that’s for perceptual optimization. So really, optimizing the image and video quality to be higher resolution or have deeper colors and these kinds of things. 

But then when it comes to machines, it’s a totally different story. It’s training machines to follow if your object is moving into view or to recognize an optic. And these things are quite different from perceptual optimization. All of these relates to IoT, which is really kind of increasing. We are talking about applications like smart city surveillance or autonomous driving or industrial automation. In these cases, they are distributed applications. So we have IoT sensors, IoT devices, and then the question, how we codec the video from these devices in the most efficient manner so that we can then prefer the computer region and the media analytics, video analytics then in the cloud, whether it is then at the edge cloud, which may be the case if the application is time-critical or whether it is then deeper in the cloud in a kind of centralized cloud, if it doesn’t require this kind of very low latency operation. But for these kinds of situations, distribution applications, we really need to have efficient means to deliver the video over the communication network or computer vision analysts.

When it’s about the gameplay, the frame rates actually need to be very high, perhaps something like 60 frames per second, at a minimum.  

MH: Intel’s Chief Technology Officer told me the autonomous car will consume six terabytes of data every single day. First of all, does that jive with your view?

VVM: That sounds very exciting. And yes, this is about the future of autonomous driving. You need multiple cameras in order to really sense and analyze your environment. And of course, there’s also time critical operations for road safety and things like that. And it’s then about not only communicating the videos to the cloud but it’s also the communication between the vehicles so that they cooperate.

MH: But also in addition to the vehicle-to-vehicle communication, just the idea that you’re streaming that amount of data to an edge cloud in a vehicle that could be doing 100 kilometers an hour. It’s gonna be hitting cell site after cell site after cell site. You need ultra low latency to be able to pass from one cell to the other at 100 kilometers an hour. And I can imagine even faster if we’re talking about autonomous trains and planes and things of that nature too. So the codec comes into play again.

VVM: Yes. And then of course, 5G networks too. The 3GPP is standardizing 5G in phases. So this release 15 is exactly about this enhanced mobile broadband. So the high bandwidth, but then the release is 16 and 17 that are taking place this year, next year. So it’s really about Industry 4.0 where the key thing is low latency, this one millisecond promise of latency, and about high reliability of the network. So we are talking about this six times nine which means that 99.9999% of the time, the network and the machines should operate correctly. So it’s a very strict requirement indeed.

MH: So let’s then tie all these pillars together with intelligence, because I know you’re working on a codec for machines that not only compresses the video, but embeds additional data into it. So for example, today, you know that data is location, it’s date metadata. It may be color metadata, but in the future algorithms will extract visual features like faces, assembly line defects, even before the video is received on the other end?

VVM: Yeah, exactly. The number of devices is skyrocketing. And also, more and more video is then deployed on these connections. If we then think about intelligence, how we apply artificial intelligence to analyze that, those videos, that relates to deep neural networks. And if you think about deep neural networks, their size actually can be very large. Basically meaning, that there are a lot of weights in those networks that then take a lot of storage, which then makes or sets very strict requirements for computation capacity and memory, and therefore limitations that point on which devices we can actually run these artificial intelligence models. And that’s why we need compression. So compressing these neural networks, so that we can also run them on resource concentrate devices like surveillance cameras, for example. And, these methods to compress neural networks it’s quite exciting. So we are talking methods of identifying any weights or any notes that we can remove to make the overall model simpler but still maintain its performance. And these are very exciting technologies. Running these models on resource constraint devices is one important aspect here, but then the second impressive aspect is how we can deliver these models over the network to the IoT devices, for example, to millions of vehicles in order to let’s say, update those models, on those vehicles. Basically meaning that updating their intelligence over the air. And again, here, then compression is needed in order to efficiently deliver these models over the air to the IoT devices.

MH: And so with that, we have all these technologies coming together that rely upon each other. AI powered machine vision is limited if we don’t have codecs for compression, if we don’t have 5G to get the information from point A to point B. And if we don’t have that edge cloud technology, it all must work in concert.

VVM: Exactly, so that is one important thing. Standardization is about collaborating with other companies. We need to agree on the technical specifications on the technology so that we can reach this interoperability between devices and the various services. But often we need multiple standards. We need to develop multiple standards in order to enable solutions. We and our standardization partners have received several Emmy Awards in relation to our work and standardization work for video codecs. But this year, the latest Emmy actually came, again to standardization, but not to video coding, but recognizing our work on media formats. And I was personally very happy about that Emmy Award, because it recognizes the need to have multiple standards in place, in order to enable solutions, for example, for video streaming. Compression alone is not enough. We also need the transport for video to be transmitted to your home. But in between compression and transport, we also need the media format, in that they agree on the format that then carries really what they’re seeing now, through transport to your video service.

MH: And we haven’t even gotten into one other aspect of your day job now that involves these new codecs for video. That actually is not about video, but it is about immersion. 5G audio could be an entire separate podcast conversation because you’re working on solving the cocktail party effect.

VVM: Exactly. Yes. So we are working on a standard for immersive voice and audio services, and this is happening in 3GPP. This third generation partnership project, that standard defining organization is really working on technologies and standards for communication purposes. This new voice and audio codec is then really bringing a new element to mobile communication into 5G, which is spatial audio. We are perhaps more familiar with spatial audio when it comes to home entertainment or our kind of movie experiences, movie theaters.

MH: Oh right. My 5.1 surround sound system.

VVM: Exactly, exactly. That’s a good example of spatial audio. But now we are bringing the spatial audio out from entertainment to communication applications where the requirements are quite different. And again, because it’s about communication, it’s about conversational services. Low latency is critical again, so that we can have a conversation. So, this kind of interactive conversation. I was very happy about this new standardization and the capability to bring these totally new experiences to us, so that, for example, if we have a teleconference meeting with our colleagues today, we hear all the sounds inside our heads. But then if we have the spatial audio, we can actually place the participants around us, around our head where it becomes a bit easier to detect and understand who is talking, and that’s quite a nice, new experience.

MH: So our brains in the real world at a cocktail party could be in a room with a 100 people, all talking simultaneously, but we have the ability to hone in on just one individual who’s talking to us, but on my Zoom calls, of course, if more than one person talks, it’s absolute chaos. We can’t hear and understand what anybody’s saying. Cause everyone’s talking at once. You’re saying that we could have that video conference call where multiple people are talking and we could still tune into one conversation and drop into another conversation over there, all at the same time.

VVM: Yeah, so this cocktail party effect is kind of very unique. I know that it’s what we as humans can do. We can really kind of isolate someone’s voice even though everybody is speaking in the room. And really this spatial audio is exactly about that, not mixing all the participants and their voices inside your head, but placing them around you, when it becomes much easier to kind of know who is talking and if multiple people are talking at the same time, it’s easier to focus on one.

MH: As much as this is fascinating to me quite frankly, I’m looking forward to a post-COVID world where I can actually be at a real cocktail party again. It would be great to have you at that party as well. Thank you so much for your time and insight.

VVM: It’s my pleasure.