 Welcome to the talk embedded deep learning simple resolution on G-Streamer using Onyx inference runtime. I'm Markus Edel, I'm a machine learning engineer collaborator and for the past couple of years I've been working on using machine learning to improve video quality to deliver better quality video content and today Aaron Boxer and I talk about super resolution, super resolution for embedded devices and how we can integrate super resolution into an existing G-Streamer workflow using Onyx. Now simple resolution is actually kind of an older technique. There are papers I found naming this from like the 80s talking about how if you have multiple low resolution frames with sub pixel offsets you can use that to create better looking higher resolution frames. I'm not going to talk about that at all today because really what I'm going to be talking about today is super resolution that are scaling using machine learning. Now the reason for this is that over the past few years as machine learning has really taken off the technique for building super resolution algorithms have really been completely dominated by machine learning to the point where they are essentially becoming synonymous with each other and that there aren't really any non-machine learning super resolution techniques out there that you really need to know about today. So in this talk what Aaron and I are talking about is I will start with a bit of an overview about why scaling specifically upscaling matters to us so much. I'm trying to give you an overview about super resolution, present here a couple of key papers and talk about our super resolution method that we developed. Then we will talk about how a specific super resolution that can be integrated into an existing workflow using onyx entry steamer and at the end we will give you a demo of how we can actually use super resolution so you can try it out yourself as well. Okay let's dive in. Why do we need to scale? And there are really two main reasons why we scale. The first is we scale down then up as a low pass filter in order to get better compression referencing, better use of the available bit rates we have and then more and more we need to scale up because we just don't have the content right now. Like if you have a really nice library of 1080p content and somebody comes to you and says hey I want all of this in 4k right now like you can't really go back in time and reshoot that content. You have to scale if you want to create something like in 4k or even 8k. Okay so let's talk a little bit more about scaling down then up as a low pass filter because this is actually one of the key concepts about the convex hull. Looking at this image one of the access here I have the resolution of the video on the other access I have the bit rate and then we have a ton of encodes of a fairly low complexity piece of content showing what the resulting VMAF score or quality of that video was at each point. And then highlighted in red you can see the resolution which maximizes quality for each bit rate along this curve and inherent to this picture is that the is that notion that we are going to need to do a lot of scaling. In fact if we want to deliver a great 1k encode we are not going to deliver our 1080p encode for this piece of content. So scaling let's us make better use of our bit rate and it's one of the key insights underpinning that notion of what the convex hull is. So how do we scale what methods do we use and it's basically a solved problem right. You can just use BQB interpolation or maybe some guy on stack overflow is going to tell you to do this. Let's test it out. So I'm gonna take the specific piece of content at 1080p Netflix made this available for free and for this test I'm just gonna go from the 1080p resolution I'm gonna scale down using a few different algorithms that you might have seen before and then I'm gonna scale back up using bicubic. I'm not actually going to the process at all. I'm just gonna show you the data which essentially confirms some of that insight. Nearest neighbors down there at the bottom don't use that. Beginner is way worse than the bicubic and Lancers does win by a small margin. Then when we are going the other way is scaling up I'm gonna pick that best algorithms line source. I'm gonna scale up using a free different method and here is the result and the data for that. And in this case you can see again Lancers is winning by a small margin. bicubic is up there and nearest neighbors doing something strange and bilinear is worse. And the point here is when we are scaling the choice of method can have a huge impact in the quality of our end code. So you really need to pay attention to scaling and with that in mind I hope you are all interested in scaling because now I'm talking about the future of scaling which is super resolution and we start by defining super resolution. So super resolution is a pretty straightforward task where we have a low resolution input image and we want to produce a high resolution version of that image. That's the whole idea. Now what's the relationship between the two images here? Well we assume that every pixel in the low resolution image was generated from the surrounding pixels in the high resolution image and these areas are connected by a block k and subsampling alpha. Meaning k runs on the high resolution image and produces from a specific amount of pixels a single pixel in the low resolution image. And this is obviously a high post problem even when we know this kernel. Even when we have knowledge about it this is highly opposed because we want to produce more pixels than what we have in our hands and it turns out that currently the best methods to do image upscaling obviously use deep learning such as edsr and rvn both use a very complex and sophisticated networks but both methods actually assume a lot of knowledge about this kernel k. Not only assume knowledge about it but also the method assume it's constant across all the images in the world. But it turns out when this assumption does not hold like for images that are not part of the training set so images in the wild images we take with our smartphones or we download from the internet those images those methods perform poorly. And this phenomena gave rise to what's now called in the field of machine learning blind super resolution. And blind super resolution all it does it assumes explicitly that we do not know the kernel so blind super resolution has no information about the kernel itself. And in this presentation we approach its question mark here and also show that based on that we are able to design a much smaller network that can if optimized run on our resource constraint device. So in summary in this talk we will show given only the input image we can estimate the image specific super resolution kernel which will greatly improve the quality of the upscaling image but also allows us to design a more optimized smaller model. And we do it in a completely unsupervised way using zero examples other than the test image itself so with no external data. And when we combine these two we achieve actually state of the art super resolution results. And maybe a more important contribution is that we achieve like a large step forward for super resolution in the wild that not only runs on desktop PCs but on resource constraint devices as well. But before we jump in let's take a look at what the majority of deep learning methods do these days. And mainly those methods take a large data set of high resolution images and downscale them by blurring and subsampling. And now they have like pairs of low resolution images and high resolution images. With that they can just take any neural network and train it to undo this downscaling process and to essentially produce a super resolution image. And now at test time they can apply this trained network to any new image and produce a super resolution image from a low resolution image. It's a standard supervised framework they're generally speaking and very generally speaking are three families of methods. The first one does exactly this type of training. It implicitly assumes a single kernel because as you can see here in the downscaling it did it with a single specific kernel k and by that what it's actually training the network to do is to undo this downscaling and by that they assume that the kernel is the same for all the images in the world. A second family of super resolution methods is trying to be agnostic to the kernel meaning that they are trying to produce a super resolution image regardless of the kernel without any information about the kernel. And those methods what they do is instead of downscaling with one specific kernel k they downscale with a number of kernels and in this way they actually take a large number of images and a large number of kernels down sample and by that hopefully at test time they are able to upscale an image without any knowledge about the kernel just because they use not only one kernel to train the model but multiple kernels. The third family assumed that they receive a kernel of the image which of course is a very strong assumption because when I take a photo with my smartphone I don't get the kernel with the image so that's where we come into the picture and the model we implemented will provide those families of super resolution models with that image specific kernel. But before we jump in let's first look at some examples of these families. So when the assumption about the fixed kernel holds it turns out that deep neural networks actually produce very good images. So here you can see the low resolution image that was down sampled with a specific kernel and this is just simple interpolation while this is state-of-the-art super resolution. And when we flicker between the two note how details are enhanced and the image is cleaner and much closer to the ground truth image. But when we take them just a simple step sideways outside of their comfort zone where this assumption does not hold anymore and there's a different unknown kernel well this is still simple by cubic interpolation and this is state-of-the-art super resolution. So just to recap this is state-of-the-art assuming a single across all the images in the world which is obviously a poor assumption and when we flicker between the two it's hard to tell any difference. But if we go ahead and compare them to the third family of methods where the method assumes someone gave them that image kernel this is what they do. And if we flicker between the two you can see how state-of-the-art super resolution does not perform as good as well as the super resolution method that assumes that the kernel was given. So the main takeaway is that the kernel is more important than the method. Not only when we talk about image quality but also performance because if we know the kernel the kernel is given we have the chance to design a much simpler squadron model as well. So let's continue with the question what is a super resolution kernel and more importantly how do we estimate the kernel. And if we go back to the problem definition we have the input image and we also want to estimate the unknown high resolution image. They are related as I said before by some blur with the kernel and subsampling and this is the super resolution kernel. This is what we are trying to estimate and this is also the kernel that the method from the third family of super resolution method assume they get. And what we use to estimate the kernel was defined by Thelfi Belklinger, Asaf Zoffer and Mikhail Irani as an internal gain and the idea of the internal gain is shown here. So an internal gain gets the input image as the only input it sees. It has no other images and the generator aims to downscale this image and fill the discriminator as in every gain. So after it downscaled the image we take crops from the input image as fake crops and crops from the input image from the real input image as we cross. And the discriminator now tries to distinguish between the two and find out who is a fake crop and who is a real crop. But it doesn't do it as in a normal gain will do it which is usually a single number of how likely it is that the given crop is real or fake but instead it outputs a map of pixels and of pixels where each pixel represents how likely it is that the patch is coming from a real or fake crop. And if the discriminator can distinguish between real crops from the input image and fake crops from the generator it means the similarity is maximized. And if that is the case it means that the generator is imitating exactly the super resolution kernel. And this is what we are trying to do. However estimating a proper down sampler from a single input image is complicated especially in the presence of noise or other artifacts we often fail to estimate good degradation parameters. And a wrong degradation severely reduces the effectiveness of the upscaler and reduces the super resolution performance. And with the two down sampler one can determine the up sampler more accurately and on the other hand with the true up sampler one can correctly estimate the down sampler. In other words the up sampler and down sampler are the inverts of each other and improving one can also improve the other. And this basically this relationship motivated us to simultaneously train both the up sampler and down sampler in a single pipeline. So unlike kernel gain the up sampler and down sampler are trained simultaneously and improve each other using the cycle consistency loss which is shown here on the slide. Now we have this network that can estimate our super resolution kernel as well as use that kernel to upscale an image but we what we haven't shown here is how do we find the really small deep learning network in comparison with other methods. Well traditionally the deployment of efficient deep learning can be split into model architecture design and model compression namely pruning and quantization where model pruning involves removing like parameters that don't impact the network accuracy and network quantization involves replacing data type with reduced width data types for example replacing 32-bit floating point with 8-bit integers. And the arrays can often be encoded to preserve more information than simple conversions. Some existing works have shown that such a sequential pipeline can significantly reduce the cost of existing models. Nevertheless careful hyperparameter tuning is required to obtain the optimal performance. Until the end we use a method that is called apq that was proposed by Titian Wang and his team which is a joint design method to enable end-to-end search of model architecture pruning and quantization policy with light cost and the core idea of apq is to use quantization aware accuracy predictor to accelerate the thrust process and the predictor takes the model architecture the quantization scheme as input and can quickly predict its accuracy instead of fine-tuning the pruned and quantized network to get the accuracy. So to get a small deep learning model that can run on a resource constrained device we mainly relied on semi-automated methods namely apq that paired with some time we were able to to reduce the model size we started with. Now if we take our optimized model design we can see some interesting results and with that I hand it over to Aaron. Thanks Marcus and now I'm going to talk about the qualitative and quantitative results that we got when we apply these different methods to real-world images. So starting with the qualitative results we'll begin with the nearest neighbor interpolation on this image and you can see how pixelated the result is it's very poor quality. Next is bicupid interpolation we get a little bit better image quality but it's still quite blurry especially around the fine details around the edges. So the next slide is the first of the deep learning super resolution approaches you can see a dramatic improvement in image quality with VDSR approach and then there's EDSR and just pay specific attention to the detail boxes on the right to see the differences in performance with these different methods. That's EDSR then there's RCAN and finally our own approach to super resolution and now we will talk about the quantitative results this is a table comparing the different approaches to super resolution on a couple of different data sets at different upscaling factors and we used two commonly used measures for lossy codecs one is PSNR and one is SSIM and our results are on the far right in red and you can see that generally we are matching or exceeding the performance of the other methods for upscaling. Now part of this project was to run these methods on an edge device which is extra challenging because an edge device typically is low in memory low power and low compute capabilities. So for this part of the project we chose the Vim3 from Amlogic and this SOC has a NPU on board and that's a neural processing unit which is what we ran our models on and you can see that our approach the model is significantly smaller in size than EDSR for example and this is particularly important for an edge device which are memory constrained and that translates into lower power usage and also it contributes to the higher performance on the frame rate and you can see that at an upscaling factor of two we are over three times faster than EDSR and for a higher upscaling factor of four were about twice as fast as the EDSR approach. So these are some nice results. Next I'm going to talk about how we enabled the application of these new methods to real-world multimedia workflows by integrating our model into Gstreamer. Now Gstreamer is everyone's favorite multimedia framework. It is pipeline based so it makes it quite easy to create custom bespoke workflows for video and audio processing. One other nice thing about Gstreamer is it has very broad hardware support so besides the CPU and the discrete GPUs it supports a host of smaller power devices on the edge. For the inference part of our approach we chose Onyx runtime and file format. Currently there is quite a lot of fragmentation among the different AI toolkits for training a model. There is TensorFlow, PyTorch, Cafe, MLPAC, Microsoft cognitive toolkit and a host of other different toolkits you can use to train your model. The nice thing about Onyx is it provides a single runtime that you can target with various backends that you can configure at runtime. So as long as you can convert the model to the Onyx file format and the operations are supported in Onyx you can just use a single runtime to target various types of hardware. And from previous work that Mark and I have done Onyx has already integrated into Gstreamer. Our previous project was to create an object detector element called Onyx Object Detector and this is upstream currently. It will be released in the upcoming 1.20 release of Gstreamer and here's an example of a Gstreamer pipeline that is performing object detection on images and you can see bounding boxes and labels displayed in the frame. So given that Onyx was already there for us we created a new element in a merge request called Onyx Super Resolution. This is quite similar to the object detector and the parameters that you would pass in would be the model that you are using, the upscaling factor and you can decide which backends you want to use for the inference currently. CPU and CUDA are the two backends that are supported in Gstreamer and this is the merge request so feel free to check it out and try it out for yourself. And now we're going to show you this new element running in a Gstreamer pipeline to perform upscaling. First off I'm going to show you a video of the input clip. This is a 480p clip of a running race and now here is a Gstreamer pipeline doing upscaling of that clip into 1080p and look carefully to measure the quality of the result. That looks quite good. That's all for today. We both want to thank you very much for taking the time to listen to our talk. We hope you enjoyed it. If you have any questions please stick around for the Q&A. You can also reach out to us at our site collaborer.com. We have a number of blog posts covering our Super Resolution work. You can check those out, make some comments if you like and we're also hiring. So thanks again and we hope you enjoy the rest of the conference.