 Hi, everyone. So I'm going to talk to you about video ML at scale, which basically means video ML for people who want to make money. And so thank you for the intro. I, as far as who I am, am working now at Snap, and previously worked on ML at Mox. And it's a bit of forewarning for this presentation. When I submitted the talk, I was still working at Mox, which is a video started up in San Francisco. And some of the information that I was going to present, I no longer had access to that material when I went to Snapchat this summer. So I'm not going to be talking as much on the production pipeline as in the description. But if you're interested in that, feel free to grab me after the talk. So why video ML at all? Seems like everywhere we turn, if you were talking about AI this, AI that, granted, we are at a data conference. So some of that is our own doing. But where do we even really want to use ML in video? And to answer that, let's look at how video is being delivered today. And when I say video, by the way, I mean OTT video, not broadcast TV, but streaming video online. So the way that works today is basically you have some sort of content source that you ingest into a cloud or on-prem transcoder that gets passed to a CDN origin server, which gets then passed to edge servers that go to a wide assortment of user devices as ABR manifest so that you reduce the latency and maintain a good quality of experience for all your users. And hopefully you also have some data analytics back in, although you're probably not using the data analytics to change very much, because there's 100,000 offset along this chain. So you're really just using it for monitoring. So where does ML kind of fall into this? Well, it actually falls into basically every step. So starting in the very beginning on the source creation, we can use ML for highlight generation, scene detection, video categorization, defake detection is becoming increasingly concerned in the last month. Facebook just released a new data set doing a defake detector. So that's just at the point in which the video ingested. We can already create a bunch of machine learning models that adds value to the viewing experience. On the transcoding side, we're looking at things like ML based per title or per shot in coding, no reference video quality assessment, which is a very, very difficult thing to do. The vast majority of content providers, if they're even tracking video quality, which is a maybe in their monitoring, it's usually using a full reference video quality metric in which you're comparing a source frame with a distorted frame. And you're comparing pixels to try to figure out what the video quality is. But perceptual little quality doesn't actually work that way in video for humans because there's things like motion blur, or even more in saying things like when we add in distortion, like film grain, and we actually like it. And so even though that's distortion, that's better quality rather than worse quality. And so that stuff is really, really hard to detect in a deterministic way without machine learning. And at the end, you can do smarter caching and storage on the playback side. There's a lot of interesting algorithms, like the evolution of ML, ABR algorithms, MIT published a paper a few years ago called Pen C's, in which they were able to drastically reduce rebuffering for the same bit rates. There's video super resolution, in which that image on the bottom right there shows that three different frames it's zoomed in. But the one on the left is bicubic upscaling, which is the default for the majority of players out there. Video super resolution is the one in the middle, and the ground truth is the one on the right. You can see there's huge, huge gains to be made for using things like super resolution as part of video players. And then as part of that, your analytics also evolves to be more than just monitoring, but also optimization for the ML within your video stack. And then so you really enter this positive feedback cycle of continuous improvement. So it sounds great, right? We get higher quality video, less buffering events, drastically reduce our costs, and overall just a richer viewing experience, in which no longer is OTT trying to really reach parity with broadcast TV, but really actually exceeds it. So we're still in the challenges of being able to do this, because naturally something that great would have equally great challenges to it. So the first one is the complexity when it comes to video machine learning. So the machine learning models were there one. All the use cases that I just described, they're very different ML implementations, right? And so not only are you looking at complexity in the spatial dimension, but also temporally as well. So your input and your output can involve multiple frames. And this really drastically increases the complexity, especially when you consider that a lot of image machine learning models with convolutional layers, they act as a funnel and shrink over time, so your dimensions rapidly get cut down. But for things like super resolution, for example, it's an upscaling algorithm. So your dimensions sometimes they shrink, sometimes it's sort of like an hourglass shape, but the end output is a low frame to high frame output. So you're going from low to high dimensions. So your complexity is many times greater than something like Inception or S9. And because of the way that you're trying to account for temporal complexity, you also have a lot of different ways. You're sampling your frames, which leads to more and more complex model architectures as well. And so as the complexity increases, you encounter the next challenge, which is the latency. And so video has really, really strict latency requirements. When you're looking at 24 FPS video, that's 42 milliseconds per frame. For 30 FPS, it's 33 milliseconds. Default, kind of if you download Inception off of Model Zoo. That's about 100 to 500 milliseconds per frame, depending on the CPU you're running on. And that is nowhere close to being fast enough to integrate as part of video every single frame. And that's taking in a downsized frame as an input and outputting that softmax layer, which is a really low dimension output as well. Another problem is training data, which is, let's say, you managed to figure out how to solve these really, really difficult video latency and complexity challenges. But how do you get the training data for your model? Because there's a lot of video out there, but even trying to find the right subset of the video for your product can be really difficult. If your product is UGC content, then downloading the Netflix's archive of videos might not really train your model and give you the results that you want to see, because UGC content just behaves differently versus premium-type content. So where's some solutions to this? Well, first, obviously, optimize your model. But really try to understand the tradeoff between accuracy and complexity here. While at research papers, they benchmark accuracy. And you're supposed to just, the highest number is the best. But really when it comes to production, ask-ale deployment of ML models, you're wanting that Pareto curve tradeoff of the most bang for your buck. So just good enough is basically what you're looking for. And here, we look at model size, which is the accuracy of inception. And we see the quantized version of session B3 does give you a slight hit in accuracy, maybe a half a percent. But it's about a fourth of the size of the non-quantized model. So proving quantization, using the TFGraph transfer tool is really useful for this. But going through this step of evaluating the tradeoff between accuracy and complexity is very, very necessary before you do anything else. Over the past few years, there's a lot of great TensorFlow tools. I've mostly worked with TensorFlow as well, which is why this is all TensorFlow focused. So there's a lot of great TensorFlow tools out there for converting your models into lighter weight versions, like TF Lite as well. Once you've optimized your model, look at your infrastructure. So are you building TensorFlow from binaries? Or are you compiling from source using the CPU optimizations? Doing this is a huge difference. And it's not that much more technically difficult to do. Also take advantage of building using the serving infrastructure of TensorFlow as microservice. TF serving is an easy, integrated GRPC service that you can basically containerize and deploy and does things like model versioning for you and it's part of the TFX framework that does a lot of the heavy lifting. Two years ago, I probably would not have recommended TF serving, but there's been so much work done in it and that continues to be done on that it really is sort of like TFX, Qflow, that whole environment is continuing to evolve really nicely for large-scale deployments. And then GPUs is kind of the obvious choice for improving the latency, but of course it has a downside of much higher costs and only being available on the server side. Something else you can do is look at sampling. But sampling really depends on your use case as well. So for things like video categorization, detection, you might be able to sample at 1 to 2 frames per second. And if that, for things like categorization. And that's totally fine. But things for reference-free quality metrics, you might want to sample at a higher rate than that because we can detect changes in quality at a greater frequency. For things like super resolution, that kind of needs to be every single frame. And so your use case here really dictates how well you can sample and to try to lower your latency requirements. Data set generation is something that is widely used to generate video purpose as well. Let's say you want to create a video categorization model instead of trying to find a data set that doesn't quite or trying to curate like a million video data set from where your user group is. Instead, you can hand label thousands of videos or maybe a thousand videos. Train a supervised model on that. Then use that to create machine labels on a larger corpus and then train a weekly supervised model from that larger corpus. Facebook does this. This is a really effective way of generating large corpuses of data sets. But there's also a lot of data sets available out there to have been published. But many of them are very focused and niche. And so you can also sample across publicly available data sets as well. Lastly, utilizing the TFX framework, basically there's just a lot of built-in tools around model validation, generating the right statistics as well as it all works in either Airflow or Kubeflow runtimes. And so it's very easy to deploy and integrate within your current system. And the model versioning deployment is a critical piece here because you're going to have a fairly high, or you likely have a fairly high fan out of user groups. And so a model that it's fairly unlikely to find one model that might work for just a single user group. And so having that robust framework around model validation is also key to have a quick iteration on your model plans. So all that said, none of these solutions really solve those challenges that I mentioned. And that's because the ML deployments are really difficult. And it will take a really long time before we see them at scale in production commonly. And the modern OTT video workflow took many, many years to reach that stage. But if we start now to work towards this better future with ML in our video, I look forward to the future in which viewing a video on my laptop is a much superior viewing experience than it ever will be than viewing it on traditional broadcast TV. No offense to any traditional broadcast TV people here. Thank you. Any questions? You have nine seconds. Eight, seven, six, five. All right, cool. Thank you.