 I pressed it from video. Video is one of the early companies that dealt with video conferencing and with video technologies. And here is here to explain how these technologies can assist us when you start and take what we have today in the browser towards mobile. And what these things do, you learn to improve the quality there. Flora is yours. Thank you, Sahi, and thank you for the organizers for us being here. My talk today is about achieving the best video quality on mobile. And video has been building scalable architectures for over 10 years now, and we've been using a scalable video coding in order to achieve this on lossy networks, which are especially important on mobile environments. When people think about a video quality solution, thinking about the WebRTC codec itself and the transport is really not enough, you've got to think of the whole thing holistically. The platform, you need to think about what kind of video routing you're going to have in the background, what kind of video codec you're going to choose, which makes a huge impact of how your stream is constructed. You have to think about your audio solution and device management and rendering. All those things have a big part to play in the actual quality of the delivered stream. It's rendering as well as transmission over the network. So with mobile, when we were building our video service, we had to solve all these problems on our user's behalf. This is the same problems that WebRTC experiences, and most of the issues are that the mobile is the worst case scenario for most of your endpoints. Mobile has fluctuating bandwidth. It has packet loss and jitter. It has varying device capabilities. You have retina devices, you have devices that have very low CPU and performance, and you have to make decisions about tradeoff between bandwidth and battery life. Most of these devices, you know, if you're ever played with some of the other video solutions, you'll notice that the device might get too hot, and it doesn't provide a great experience if your device only lasts for 20 minutes in the video conference. Apart from all of that, you also have to deliver very low latency throughout your entire network. Real-time communications is a special challenge over any other type of media because of the low latency problem. You have to deliver a packet end-to-end within 250 milliseconds timeframe. And that's a huge burden and requires very specific choices about how you construct your stream and how you create your video solution. So to describe why we need scalability, I'm going to talk about some primers of video coding. What is a video stream? A video stream is essentially a series of video frames sent one after the other. In a very simplistic way, you can think about it is if you were to transfer a series of JPEGs at 30 times a second, now, while that would deliver you a great quality, you certainly wouldn't be able to do it in the bandwidth that you have. And in order to make this process fit into the bit stream, what happens is we have to create a dependency structure. This dependency structure starts off with a main frame that you encode and then you send differentials. To make it simple, we break this down into five different frames. In this example, we start off with a key frame one. That's a very big frame to send and transmit over a wireless network. Once that's sent, we can start sending little differentials piece by piece over to the other endpoint. But this creates a huge problem. For a basic singular encoding, missing any of those frames in the middle will break the stream. Once the stream is broken, you have to actually retransmit the original frame or ask for the missing frames back, which, again, creates latency and if you have to retransmit that original frame, it actually creates a lot of bandwidth limitations because that original iframe is very large. So just to solve some of these quality issues, you also have to think about what kind of routing architecture you have in the back end. A traditional routing architecture uses singular coding like you're used to and it's an MCU. An MCU is very simple in the way it functions. You send your video stream to the best of its capabilities. The server decodes everybody's streams, composites it, and recodes it back to the endpoint. Now, while that is simple to do, the same problem that I described earlier with having this interdependency between frames, especially on the mobile device, creates a lot of issues when it comes to bandwidth and bitrate and packet loss. In addition to that, all this transcoding induces delay. Having to decode a frame, reencode it back adds a lot of latency to the network, which is really bad for real-time communications. Not to mention the fact that all this decoding and reencoding takes a lot of CPU and this is something that you wouldn't be able to scale to a large conference. Now, moving to a much better architecture is a simulcast architecture. Now, this is great because now you can send two separate streams. They both still have this interdependency, but now the router can pick which stream to forward to the endpoint. Now, this is much better for error resiliency. It also allows you to create custom layouts on the endpoint. Now that your router picks which stream to send to each endpoint, you can choose how you want to composite them and lay them out. This is a very common way of doing video conferencing right now, and it does not require any server-side decoding. The problem with this is that you have to pay the penalty. The penalty is that you're encoding the stream twice, and that creates about a 50% overhead over a single layer. Now, having to send two streams itself also creates a synchronization problem because now that you're sending both streams at the same time, when you're making the switch, you have to choose. A better approach is to use a scalable router. This is something that video has been building for over 10 years, and this is something that we are trying to make sure that the VP9 has in its future to create the best coding and the best routing available. In a scalable router, you only send one stream, but that earlier interdependency that I showed you gets broken up in such a way that you can now choose frame rate and resolution by just picking packets out of that single stream. And a scalable router can do that delivering the benefits of both an MCU with a single stream and a simulcast, in addition by adding air concealment that now you can do because of interdependencies, and it utilizes less bandwidth. So how do we do that? We can start with a series of video frames. As I described earlier, it's a set of frames. We're going to make it simple and call them frame one through frame five. Now, in a single stream coding, as I described earlier, that interdependency is linear. You have frame one, frame two, and they all depend on the previous frame. If we were to create a temporarily scalable stream, what we would do is we would break this dependency. A way to do that is to make sure that we pick key frames, let's call them T0, and only make them dependent on each other. Now, once they're dependent on each other, and we have frame one and frame five, we have a 7.5 frame per second stream, which isn't bad. But we can now create a dependency, again, in a different layer zero that will give us a 15 frame per second stream. As you can see, the big difference here and the big innovation is that now the T1 frame only depends on the T0 frame. And if you were to lose it on the network during transmission on your mobile device, on your Wi-Fi, or anywhere else, you can still continue decoding, and you can still get your 7.5 frame per second stream. And all you're going to see is a bit of a glitch. Expanding to a third layer, we can add two more packets. Now the dependency tree grows, and you have an extra layer that allows you to have 30 frames per second. Now, the way that the router would adapt to this is that as it's receiving these packets from the endpoint, it can decide which packets to drop on the fly. So in this example, you can transition from having a 30 frame per second stream to a 7.5 one just by choosing to drop the T2 and T1 packets. It's a very simple thing to do. It does not require decoding, and all you have to do is look at the packet headers. This allows you to shape the stream into the bitrate that you have available on your mobile endpoints. This also adds a huge layer of error resiliency. Now that you can lose the T2 frame without impacting the decoder, without having to retransmit any of the important frames. And the resilience itself moves on as you're losing packets over the network. Losing a T1 frame will lead you to having a 7.5 frame per second stream, but losing a T0 frame will require you to retransmit it. However, if you look at the new layout, now the retransmission only affects a quarter of the packets. So we don't have to retransmit a lot. We only have to retransmit a little, and there's lots of methods to do that. FEC, forward error correction, or ARQ, automatic request repeat are two of the ways that we can retransmit those packets without having to lose any data. And if you think about on the mobile side, having to retransmit that iframe in the restarted decoding stream is the worst time to do it is when you're in a packet loss. So most of the hardware decoders output the single layer stream, and what happens is that they either produce a periodic iframe, which is very big and very fat, and you have to squeeze it into a small pipe, or you have to have this interdependency problem where losing any of the frames requires you to request the iframe again. Now, building on top of that, a scalable stream can also have a spatial component. Just like I described to you earlier when you can break the dependency on the temporal packets, now we can actually create an interdependency in space as well. We can create two layers, an SD and HD layer, where the S0, the HD layer, derives from the T0 layer. In fact, we use the benefit of the fact that those frames are very similar to each other to create a differential in space as well. So as you saw earlier, a video stream involves differentials from one frame to the other, but that is also true when you scale the stream from being small to being large. We can use that differential to save ourselves a lot of bandwidth when we transmit over SVC. Expanding this to a full picture of interdependency, we now have both spatial and temporal scalability. This is a specially encoded stream that has all this information in it that allows the intermediary router to pick out any frame rate or resolution that it chooses. For example, you can send one stream up to the server, and the server can decide that it wants to prioritize frame rate. So by using these highlighted packets, the T0 frame, it can deliver 30 frames per second in SD. But if I'd like to have resolution instead, I can create a packet stream that has 720p and 15, and all of this can happen dynamically and on the fly. Now, how do we solve most of our video challenges? Well, the bit stream itself becomes shapeable, right? So fluctuating network on your mobile device as you walk away and your single strength drops could be shaped dynamically without having to do any of the other extra retransmissions. There's not much cost there. The loss in jitter can be concealed, and there's a lot less to retransmit. The device capabilities themselves, this allows you to have a priority of what you want to do in terms of the way you want to view the video. If you're on a retina device, you might want to have a high quality image. If you're on a small device with low battery capability, you want to slow down the frame rate and make sure you don't receive anything. None of this will affect the originating source. So if you have a lot of people in the same conference, you do not want to use the lowest common denominator. You, in fact, want to send the best thing you can up to the server and let the server decide for each endpoint what best to deliver. You can choose battery life trade-off versus bit rate. As you determine that your battery life is depleting, you might want to say, hey, don't send me higher spatial error. So don't send me any more temporal errors. And you can reserve the battery and still have a conference just at a lower frame rate. And, of course, this allows you to have very low latency because the server no longer transcodes any of the media data. In fact, all you're dealing with is the penalty of a hop. This low delay also helps in retransmissions because when you're transmitting less packets, the video, if you had to ever freeze the video, it would be for a very short amount of time just to retransmit those packets. And that will significantly improve the delay instead of having to wait for the entire frame to arrive before you continue decoding. So even though we're now adding a little bit more bits to do the scalable coding, we did gain a lot. We have about 20% fewer bits over Simulcast due to our adaptive spatial scalability, and we can tolerate up to 20% packet loss with no significant impact due to error concealment. Now, if you think about a mobile device on a mobile network, which is the worst case scenario for you, having a 20% packet loss that you can easily tolerate is a great thing for your video quality. Now, this kind of technology is something we really want to have in WebRTC. And we're working with Google to make sure that VP9 has all the scalable extensions. Some of this is already in VP8. If you look at the VP8 right now, it has temporal scalability, which is actually great. It's the first thing that I showed you on the slides where you can pick out a temporal stream from that VP8 stream. But going forward, it's becoming more and more de facto to have the scalability built into all of our future codecs. Both on VP9 and the Open Media Alliance AV1, where the temporal is mandatory and the spatial is under review. The error resilience that I've described earlier is also available in VP8, H264 SVC, and all the future codecs like H265 and VP9. So in addition to that, we at video, we love to work with WebRTC. This is something that I think Google is doing a great job on. And the future would be with VP9 for us. All we need is to have these scalable extensions that we feel will benefit everyone. We're contributing code to both WebRTC and WebM to make sure that all those spatial extensions are there and for everybody to use. Chrome now has a full VP9 scalability in the decoder, and the encoder can be configured through URL flags. And going forward, we're working with WebRTC editors to provide ways to configure the encoder preferences themselves for different use cases.