 Hello everyone. All of you can hear me. Okay. Great. Thank you. Thank you for coming to my talk. It is going to be lighting the black hole of internet using AI and as you might have guessed till now or I can tell you I can I can suspend the confusion that it is about videos. So videos are considered today as a black hole of internet and why they are considered to be black hole. We will be discussing that and also how to mitigate that black hole. We will be discussing that too. So videos today is majority of internet traffic is actually videos and it is growing. So it is predicted that 80% of internet traffic is going to be videos by 2022. It is still majority. It is more than 50%. If you take and all of you have seen embedded videos on website on your company website on your personal website on your tutorial websites, you will always embed videos like this. Right. Now what happens is when so this person, this image is from AWS webinars when search engine come and scanning. So this is Google's tool to see what actually they are scanning and if you see here, there are at least 20 videos which are embedded on this page. But what Google is able to look on the right side is nothing. They just see it's a web page, probably some text which is around it. But nearly 20 hours of content which is embedded on a page and Google or any other search engine is not able to take a look because it's a multimedia content and typically they don't process that multimedia content unless there is a need and literally there is no need here. That's what they think about. Right. But from your side, there's a 20 hours of content which should be indexed for you to move higher in the SEO ranking. So that's one problem. What people usually do to mitigate this problem is that they put the tags themselves saying that, hey, Google, there's a video here and these are the tags. But this is a very ineffective way of doing the things. The second thing is that all of you probably come from enterprises, some come from colleges, students. Right. And we typically see different kind of videos in our day to day life. These are the kind of videos that you see. As a user, when was the last time that you watched a one hour video without any break continuously or even full one hour video from start to last? You never do. So all the informational content which is apart from movies and sports have a very low engagement rate and as well as very low completion rate also. So typically all these MOOC videos, all these training videos, all these old hands recording meeting recordings only get it was a fraction of time. And how do you watch that you randomly click at different point of time and try to see, oh, does this video have something interesting for me? And it's just a matter of chance whether you find it or not. It's really a random guess. So definitely we can do it better. Right. And that's what we I'm going to demonstrate to you how we can do it better. So we I will be first giving you a very, very quick overview of our technology. And then I will be going deeper into one just to understand because it's a data science conference. I will be going deeper into one module of how we do it. But first there is an overview and then there's a deeper dive. So first what we do is we generate AI generated video indices. So we just for any given video, we generate three indices, table of content, phrase cloud, and there is an in video surf. That's like a control F on the video. Now, how do we do that? So table of content is the topic switching. So you take a, let's say Andrew NG is one hour video, you pass it through our pipeline. And what you will get is wherever there's a topic switch and for how much duration that topic has been discussed, and this table of content will be overlaid on your video itself. So next time when you land up on a video, you can quickly check, am I interested in these topics? And then you can watch that two minutes of that you're spending randomly clicking, you can spend those two minutes on the portion of the content that you're most interested in. The second thing that we do is a phrase cloud, which actually subrises key concepts in a video. And they are domain ranked. So given a video, we automatically figure out what domain this video is all about. Is this bioinformatics? Is this machine learning? Is this some medical health? Is this education? What kind of video it is? And based on that, all the keywords will be or all the phrases will be automatically ranked according to the relevance. And again, you can click on one on any one of them and directly move to the point where the phrase has been mentioned. If there are multiple occurrences, you will see more multiple yellow markers there. Third thing is in video search, which is basically you will find in some platforms already where there's a transcript, which is basically in our case, we make all table of content phrase cloud and transcript searchable for you. So you can search for a given topic and you will find the occurrences. Once you have a video repository, so let's say your organization is producing, let's say, a thousand videos. In Oracle's case, for instance, they have produced like 50,000 videos over time. All of these videos can be made deep searchable. Right now, all YouTube or any other video search engine is doing is that they are only indexing metadata, which includes the title and description and the tags if user provides. What happens in this case is because each video is having these three extra attributes, table of content, phrase cloud and transcript, which actually describes video way more than a title or description could describe the video. So you can jump deeper. So for instance, you can search for a neural network and even if neural network was discussed somewhere like on 20th minute of a video, you will still find it. And that could happen in thousands of videos. And I'll be showing you something, some use cases that are down the line. And the last thing is that you can track it very deeply. You can track the deep engagement. So once you have these topics overlaid, that typically what happens is that you start watching the video, so you will high jump there. And then typically it's a reverse exponential curve. But what happens is when you have these indices, table of content or a phrase cloud, is that you will see these intermediate jumps also because people are going directly that you don't see in a video which does not have a table of content. If you have a YouTube channel, you can open it right now and see it just reverse exponential curve for every video. Apart from all these deep analytics, we also collect a few more, which is not here right now, but this is what we do. So this is a brief overview of video can technology. Behind all of this, we have a deep video indexing pipeline where we do all sort of things. We do visual processing. We do ASR, automated speech processing. We do speaker dirigation that Vikram talked about. And then we also determine the domains and all those things. But in this particular case, I'm going to only look at, let's say, very simple problem, a little bit deterministic problem of given a visually rich video, given a video, let's say, given a machine learning video of having some slides and a code walkthrough and a demo, how do you create a index for that? So let's take that as a problem. So I'm going to just, only these modules I am going to discuss. So this is what the input video comes, and then there's a table of content is generated for that slide-based video. So let's look at the different modules. Let's look at the flow. So what are the different challenges? I mean, it looks very simple problem that given a slide-based videos, why don't you just run and just do a OCR and get the maximum font out at the topic? Looks very simple, right? But there are inherent challenges. I mean, the way our product work is that you throw any YouTube video and it should generate a table of content. And we have seen different kind of challenges in doing that. For instance, slides get built up using animations. I mean, there are slides that get built up, and then you will have all the frames spread out for one slide, essentially. Then you have different camera angles. Imagine being an Apple. Have you seen Apple Keynote or Oracle Keynote? And you will see that angles directly jump. Sometimes slide is this much. Sometimes slide is that much. And there is literally no correlation between them. So you need to figure that out. There are different visual content. So you can pick a title from slide, but what if there is a demo where somebody is showing something on the browser, or what if there is a code walkthrough into that? There you can't pick anything. So you need to rely on more auditory features for that. Within these different slide formats, presence of objects, like in this case, this camera is capturing me as well as slides, how do I? There could be occlusion, for instance. But there are different kind of challenges that we have to overcome. So first what we do is we designed a custom visual frame extractor. And the main purpose of this visual frame extractor is given a video, can I detect key frames out of that? And key frames is not the anchor frame, but rather where I get the slide full. So in my case, also the slide might, the first title might appear, and then the second title and third title. The objective of this module is to get the frame where the slide is full, as well as the actual timings of the content. So for instance, in this case, at what time in the video, this particular word or phrase appeared in the video. So we get this. We build a graph, and then we have our own clustering algorithms to detect key frames, which gives very, very key frames. So in a typical 60-minute video, at the minimum, you will have 1,800 frames. Using this algorithm, we try to minimize that, and we arise something between 50 to 100 frames. Then what we do, these are some of the examples, right? We detect, for each word, at what time it is appearing in the video, we try to detect whether it's a text or not, first of all. And then at what time that text is appearing. For instance, a video might start from this, and then there will be different objects coming in, and then you will have a full side after like two minutes. So we are really interested in getting the full frame, and then we induce a timing of those words when they're coming on a slide. The second thing is, in a wider range, like in a product conferences or wherever, there are different objects are getting captured by the camera angle. So how do you localize where is actually the screen? Because if you apply OCR, for instance, on those kind of images, you will pick a lot of irrelevant text. So that's where we have did a slide segmentation module, which is based on fully conventional network model, where we try to detect where there's a screen between a very wide frame. So once you detect the screen, then we need to understand whether it's really a slide or a demo or a code walkthrough. I mean, if you open any tutorial or a informational video on YouTube, you will see combination of those things. So what we do here is that these kind of different frames are categorized using a transfer learn module, where we say whether it's a slide or a handwritten content or a code walkthrough or a demo. All of this becomes very handy indexed information. You can run, for instance, you have a machine learning repository or whatever. You can just say that, OK, give me a linear regression video with a code demo into it. With this kind of information, you could do that. So for a single video, also, you can directly jump to a demo or a code walkthrough directly without going through all the slides or animation or whatever. So all of this becomes a very handy index when you have it. So once all of this processing is done, then we have algorithm to check coherency. We do OCR mistakes correction. A lot of time, OCR makes a lot of errors when it reads from images. So it can interpret images. So we have domain spacing models. So we say, OK, if it's a machine learning video, what is the probability that this word is actually what it is saying? And then you correct based on that. Typically, the slide design remains the same. Like if the title is here, it is going to stay the same. That's an assumption that we can take in some cases. Doesn't hold for all. But there are different heuristics as well as decision-making that we have put in, which basically tries taking all this information from OCR and then tries to build a topic list. So in summary, what we do is, what video can does is it provides a quick summary, outline for the end users. Again, what I've described to you is only for slide-based videos, but there's a similar pipeline for non-slide-based videos where more natural language processing and auditory processing comes into play, like saying, hey, let's move to the next topic. Can I detect that? So here, the discussion was mostly focused on slide-based videos, quick navigation to the desired point of interest, the topic that you can navigate directly. It allows better discoverability by search engine. So there is a protocol called JSON-LD. So whenever you embed a video, what video can does is it lets Google know automatically that there is a video object. And here are the key terms for that video or key indices. So when Google comes scanning to your website, all of these gets feed into your head of the web page. And Google can see it better. And then you have deep search and recommendation built over and above this deep indexing of videos that is helpful. Just for this community, I can mention that some of the top AI conferences this year have been using VideoCan. For instance, NeurIPS have used it for last year. It's going to use it this year also. ICLR, ICML, KDD, some algorithms that we have been using earlier have been proposed in those conferences. So it's a matter of little pride for us. And what you get is for all these conferences, let's say this is ICML page where nearly 400 hours of content is streamed. I mean, every video is around one or two hours long. If you have to search for a paper or search for a topic, let's say you are interested in reinforcement learning, you just search for reinforcement learning. And you will get it, the papers which talks about reinforcement learning. So all these topics on the left side are automatically inferred, because we do deep video indexing. You automatically know what topics have been discussed. Apart from the conferences, of course, we have a lot of corporate customers also which have been using for variety of use cases, right from learning videos to indexing all-hand calls to Zoom calls, all the informational videos that you see around in a company. And you can try it yourself. So why don't you go to videocand.com, pick a favorite YouTube video, and just throw it out. And then you will get an index version of Table of Content and a phrase cloud. And you can embed that. You can see how it works. You can try it out. The way it works is that we don't modify the, it's more of a product positioning is that we don't modify any underlying video behavior. So we work with more than 10 video players, be it YouTube, Vimeo, we work on top of it. So we don't really stream videos, but we really make them more viewable, engaging, and all those. Yeah, thank you. Thank you for your time. So if you can tell something about that architecture behind this, like how it is, you are doing that, how you are calculating accuracy, that weather, that video, whatever you are summarizing that will be correct architecture behind. Yeah, so we do use different deep learning architectures for different modules. So for instance, if I have to talk about visual classifier, which says whether it's a slide or a demo or code walkthrough, for that what we use, it has been trained on ImageNet plus our nearly 100,000 of our domain-specific images. And we have taken the InceptionNet V3 model where we just transfer learn, we have decreased few layers in the last few layers, and then learned that model. So I think there are three to four different places where we use deep learning models, one that I describe, the other where we use is the slide segmentation, which is like an object segmentation kind of problem, where you have to detect whether there's a screen content inside a frame. For that we use FCN, that is also transfer learned, but with some of our domain-specific images. Again, actual parameters and all those things I can, I'm happy to talk to you about after this talk. A few years ago, Cisco had asked us to look at, for example, Webex videos and so forth, where there are multiple people talking. And we were not very successful, to be honest. Maybe we were successful with single speaker kind of videos, but with multiple speakers, one of the problems we had was speaker identification. Is that something you've been able to address? We do speaker derivation, so there is a different module that we do, like for non-slide videos, where multiple people are talking. So speaker derivation doesn't provide really a production level accuracy right now. But it does a great job of two to three speakers. It does a very great job. When there is a lot of speakers and a lot of overlap, then it doesn't do a great job. But if there's a minimum overlap, it does do a great job. For instance, one of the modules that we have is interview kind of videos, where there's no visual content, but two people are talking. And the usual table of content would be the one guy would be asking a question, and the other guy would be replying. So the one guy, the interviewer, is actually kind of monitoring the whole discussion progress. He's nudging the other guy. So they become an important topic for us. We use speaker derivation, but for a multiple, like six, seven speaker, it doesn't do a good progress. I'm sure it'll not work in our parliament, but it'll probably work in regular meetings. So a couple of people have, in case of Webex, actually, Webex or Zoom, because they have access to the microphone also. So sometime that information comes also in handy. So right now, at least they can, like Zoom can, for instance, tell us whether this speaker spoke well. Yeah, yeah, yeah. Yeah, very quick question here. Kuzi Peer. Yeah, hi. I think my thought was an extension of what you were talking about. So I mean, the use case of having videos which are generated out of slides, or which are instruction in nature, that's, I think that's particularly clear from your presentation. But the other use cases, and this could also be instructional videos, where there are a lot of video frames from very real life kind of situations, which may have more natural kind of a conversation, which could be even have noise, right? We'd like to know, I mean, have you tested on those kind of grabs? So you mean that people are talking and that's the video? OK, so I mean, we can talk offline. I actually have a real use case, commercial use case, where we have these are instructional videos, e-learning kind of stuff, OK? But a lot of grabs there are from real life hospitals and things like that. OK, OK. I mean, I think typically the way we work for these kind of things is that two things come in handy. Given a video, we try to do a speech-to-text, and we do a speech-to-speaker-direction based on a caustic feature on all those things. So if speech-to-text is inherently bad, or speaker direction is bad, then currently our algorithms have limitations on that kind of thing. But if there is some certain scenario where, in some certain cases, we have been able to improve that, like if you fix the speakers and all those things. But I'm happy to talk. And these need not be perfect solutions out of the box. I mean, there could be some level of post-work which can be done. So we have perfected whatever I have shown you, we have perfected for at least a few years now. But I think there are a lot of modules which are like, here and there, 50-50. Because each one of the modules is a very hard problem, like speaker direction in itself, if you don't have a microphone input, that itself an overlapping text. Speech-to-text, if it is bad, let's say very high Indian extent, then if the speech-to-text is bad, no matter what you do after that, the output is also going to be bad. So I mean, but the customer doesn't care, right? Because given the video, they need a good table of content and good indexing, basically. I was almost too much to talk about it. Sure. Sure. Happy to talk on that, yeah? Sure. You can just ask. So my question is about these videos that are mostly restricted to educational or, you know, informative videos or keynotes, so to say. So do you also work in entertainment videos, deep content indexing though? So I mean, part of the pitch is that the engagement is a big issue with these kind of videos. Engagement is usually not an issue with movies. But deep indexing and recommendation is. Like for instance, who was the actor there? Can you already organize? Can you already recognize where there was a maximum engagement, like good dialogue and all those things? And where there was a scene change and things like that. So there are lots of. Part of it, Amazon Prime Video does or even Netflix does right now, today. So they are more focused on those kind of things. Of course. Yeah, but some of the techniques may overlap, but we are not really focused on those. Thank you. Take the question offline, OK. Yeah, I'm here, so happy to talk to you if there is anything.