 Thanks, Sebastian There we have like a lot of diffusion talks every like I've seen like at least four talks on diffusion and there are more coming So pretty much the rage and unless you're living under a rock you Would know about stable diffusion or diffusion or generative art like text-to-image art So obviously this image is also a I generated this one is from mid-journey Which is my favorite right now? Dali sucks and Yeah, and I kind of don't like cats, but still I generated cats but yeah there's a lot of Lion on the moon With an astronaut on Mars, they're walking all the cool stuff that I wanted to generate Some somewhere there's some problems with the legs with the hands That's like a core problem with diffusion models. They are not really good with hands or legs Sometimes also with hair. They're really not that perfect so as You've seen like over the past like let's say from April last year like when Dali to came by Everything kind of boomed up like everybody wanted to get into that and as mid-journey stable diffusion So stability AI runway ML a lot of people wanted to like Get into this bandwagon, but to get into making stories to making films So I don't know if you guys have been on Twitter if everybody has been on Twitter There's like movies generated from runway ml or some other folks who do like nice video editing So it's not exactly Stable diffusion or any diffusion model. There's a lot of post-processing that have happened with that and I Want to show you like where we are and how far we are from the goal of generating stories, right? So when I was a kid, I had trouble understanding Visually like I couldn't make sense. How is this math? How is this? Eigen values and eigenvectors working or whatever like it was hard for me And this is stable diffusion UI the web UI and you can just type in some text And you'll give you some image some cat. So it gives the cat. It's a cute cat can't complain But what do we actually want what I actually want so to be able to generate Like visual videos like animated videos Mainly for kids or mainly for understanding any content, right? In the style of let's say Khan Academy or if you have seen three blue one brown videos, right? so I want to give a script a YouTube script or anything generated by GPT and I want to generate a visual Representation of that, but right now the cat is sad. There's nothing really happening yet so I Want like it to interpret what's happening here and to generate like Contextualized like sequence of images which would all which would alternately become a video Eventually will become a video So this is where we are at It's mainly like a lot of researchers Doing things separately and then they'll have their github Repo and some people would contribute but not at a large scale And I've not seen like a single repo which is like heavily maintained or like contributed I saw one which was like four months was the last contribution not active contribution and I feel like if you want to go to story generation like Anything with story generation. We have to actively contribute to that To the repositories that are already there Be it stable diffusion or it be anything on the internet like control net and many models related to diffusion Cool, so I'm gonna show you a video What runway ML gen 2 is doing and this is the closest we are right now a lion living walking in a rainstorm all that So we're getting there, but we are really really far because It's it's able to generate just like one Sentence but not like a big big query if you give them there won't be a story So if you give them another sentence they would perform poorly, right? so This is one of the examples. So what I want to do is like Get stories blend in with different styles of teaching or maybe make it learn from YouTube videos and Perform the same thing like Yeah, I want to like have Khan Academy kind of teaching from the chat repeat generated prompt that you want So this is one and now I want to touch on the history of how this came about like lot lot of people do not know how diffusion came in place and There were gants and there was a lot of stuff happening Even at Google. They were trying to get this story generated AI long long back so So this is kind of a flowchart. So there's some energy based models, which you guys don't need to worry about There's gans that VA is variational auto encoders. Dali came out in 2021 Dali to Very very different from Dali one That's glide that imagine everybody is there, but even before that there was a lot of stuff happening So this was I would say kind of the first Text to image thing like big paper. I think was 2012 or something So you can see it was just the starting phase. It's not Perfect like the beaks are not really good. The eyes are not there. There's so many problems with that this is where we started and Yeah, kind of Dali just wanted to show how Inaccurate Dali is with the Hands and the face you can see the faces Going off but now we can actually do this prompt engineering and do like painted by William Adolf and full-length character design with baggy jeans all that so this is possible But the ultimate goal is text to video not text to image. So It performs poorly when it comes to text to video, but let me show you first. What's happening currently? This is funny enough. It's actually a sketch of a serial killer Yeah, and It was given to control that Which basically takes your sketches and converts into nice looking images and also with protogen which makes like this really good-looking model like images and This guy is Stealthy the time traveler he goes here and travels back in time and takes photos with Indiana Jones characters and Also with like Nelson Mandela and on many other people. He just goes back in time But he he's not doing it directly through stable diffusion. There's no like straight way There's a lot of Photoshop. There's a lot of in painting that he does It takes hours for him to make just this single image. So let alone thinking about videos, right? Now gants probably many people would have would have heard of this website called this person does not exist and This was the most like earliest versions of gants being perfect in faces and There are newer versions of it and You could basically Never recognize if it's real or fake, right? And that's the concept behind gants So with gants, you have some real world images and you're trying to get to An image which is as close to the real world and there's like a discriminator Who's like a police and he tries to catch the thief who's trying to generate the fake ones and you're trying to Get back at each other back and forth until you until the discriminator Is no longer able to identify if it's a fake or a real, right? so that's the concept behind gants and This was another paper which did much better than the previous one and this used something like MS Coco data set if anybody has heard of that and That was the earliest known data set was available before Clip and everything came up came about So story gants this was one of the highlight paper, I think this was from Google and They tried the first variant of generating stories through any animated characters And this is also from Coco if I'm not wrong And so here the pororo and krong are fishing together and there's bucket and fishing rod so they're trying to get the main objects in the picture and Then they're trying to identify what else is there so pororo and krong are fishing krong is looking at the bucket So it's he's looking at the bucket Here it's fishing together Krong and are fishing and has a fish with a fishing rod so you can see a fishing rod that was Something really interesting and that this this paper actually got me interested into like generating stories and videos so This was one of them and then came vq-gan clip Which many people like on discord reddit? They were all trying this to get it as close to Dali Because that was the benchmark like to get this Model particular model, which is a variant of gan. It's a vector quantized can with clip I'll talk about clip just in a second and This came very very close to what we wanted with images. I'm going to come to videos So yeah, this is just the images. It came nicely sketch of a 3d printer by Leonardo, Leonardo da Vinci and then we started with Dali and everybody lost their minds and So here is stuck of three cubes a red cube on the top And it's able to like generate all of this very nicely Maybe not somewhere here or here but it basically Changed the way we looked at generative text to image and We were able to generate like this it's like a morphing animated videos and It's been all over Twitter. I'm very active on Twitter I'm always searching for what's new and this is morphing. It's like blueberry spaghetti and strawberry spaghetti and There's code for it. It's very very small code not much lines. You can test it out this morphing stuff There's outpainting so you can see like multiple images that are stitched together and forms like a beautiful art so So many features of diffusion. There's inpainting. So here there's a person. Let's say they have a mask and then The the diffusion process is generating a smile and maybe some angry maybe sad So that's a process of that There's paranormal rendering I'm showing you all the cool stuff here So they've generated like a big city inside itself and you can just do this I think just today. I think six hours ago. I was active on Twitter They made like text to 3d models and you could basically zoom in zoom out of that Yeah, I don't know who it was but I'm going to find out Cool and it's not just images. So meta. I made the Baby slot with an orange knitted hat trying to figure out a laptop and There this was like very very detailed with like the reflection in the eyes. I really loved it This is all old stuff Then obviously Google was not gonna be left behind So they made this drone flying through a restaurant on a dystopian planet, right? So Um Imagine tried to do better than Dali in terms of Generating images with subtitles with like if you wanted memes or something I'm gonna show you that as well and Maybe like Christopher Nolan style at an Oppenheimer style Background songs is that okay? No song cool But yeah, pretty good like you this is a start right when you make your own movies with diffusion So how does it all work? Yeah, so we're gonna start with clip now clip is just a Embedding of text and images. So you have a bunch of images. You have a bunch of text Combine them together. You get like a nice table with text descriptions and Yeah, so you basically have some images and you try to get what is it in the images? right, you cannot do Understand what is there in the image? So there's a text encoder and there's an image encoder and you basically match them and now why is clip powerful because combined with diffusion and I Diffusion process. There's two processes and diffusion. I think Michael also talked about previously, but there's a forward diffusion where you're adding noise to an image So here's the dog. You're adding noise gradually gradually gradually and there's the reverse diffusion Where you're removing the noise gradually in the reverse way But conditioned on the image that you provide in like clip like whatever in the prompt is there so you want to get closer to the prompt and Use clip together with this so you get nice-looking images and you learn like a representation of generating an image when you go backwards and Some people have figured out Doing like eliminating the photos in the dark Based on this process because it does something to your images when you go backward denoise you can actually get Nice images, but it's not really fast. So it doesn't work well And this is how you can easily do your own diffusion model It's very easy It's not using any Pretained Model is just like your own images in your diffusion. There's some Gaussian noise. So there's Gaussian diffusion There's a unit which is an auto encoder. So yeah, so there is a lot of stuff happening here and There's some learning rate all machine learning stuff. You have bad size 32 all of this depends on how Powerful your system is and how much you do your prompt engineering and what you really want But this are all the hyper parameters that you should tune and it's easier to train here And you'll get some sample images after a lot of training I've been told and from experimentation the time step is mainly around 200 It it gives like really nice images around that time So that's why sometimes it takes a huge amount of time to train all of this and you can do it if you're a careless lover I'm not I'm a pytorch lover So you can import chaos and you can do stable diffusion and then you can just have this method stable diffusion texture image take a prompt and then you can generate and show this you get some art nicely and Nowadays, there's like one click collab notebooks. So you can just go on that and just run that one line and you have the gradient or the stable diffusion web UI ready for you to work with cool and So from coming to one sentence to multiple sentences now So you have like a ferris wheel and a lake next to the ferris wheel and buildings next to the lake So this all conjunction is happening here and it helps to create more dynamic images Which you would like rather than to explain because it's not like chat GPT that you can explain and tell them Don't do this. Don't do that. It's you have to kind of mention in this way. So this is one of the papers that the lab had come with and Yeah, can you guys? Sorry can everybody Everyone recognize what's in this image? It's like a robot building like buildings in Minecraft one of the research that I'm working on so that they can make structures and This is Europe. I think pretty Not nice at all and you see the text very wobbly I don't like this Yeah, that's why Dali sucks. Sorry if anybody from open AI is here. I hope not and what about memes They're not really good at memes Or they are this one is pretty nice at Donald Trump and this one you can see there's some problems with the hands again and The problem with the hands is the data set because the data set that is trained It doesn't have like specific images of hands like big images All it sees is just the patterns and the ages and the blobs somewhere here and there So it tries to get there. That's why the problem with hands is like Doesn't understand what a hand make is it understands the fingers cool and Yeah, so the text is pretty bad, but it tries to generate a comic cool and in the process of generating the prompts Generating the images from the prompts it tried to make its own language and When you put this thing back into the prompt it would give you this images So it has its own secret language So you you say this wow, whatever and You get this back, which is very close to what the fishes are or whatever that is and So what party did is they trained on a different subset and they had like different Parameters, you know all this large language models and all everybody has billion parameters a lot of so even they trained on something else but specifically on understanding how to generate text properly and They have welcome friends cool and Then control net versus advertisement. This is actually a friend from a sketch from my friend and I generated this coca-cola advertisement using control net, which is also one diffusion model cool and What I wanted to highlight about it is like with this sketch it generates so many variants of advertisement That I could actually sell it to coca-cola and This is on my own images and dreambooth is a model That has been there for some time and people are monetizing on it pretty heavily There are a lot of startups which are doing this. This is basically me but on 10 images What the online websites they do is like take 15 images and so many Yeah, it took a long time to train this. It's a custom pre-trained model On dreambooth with control net in stable diffusion. Yeah, so and there's room GPT That's original room and there's a generated room This is also control net and I think control net is taking over like all the creative aspects Which was missing in like stable diffusion Yeah, so generated room. It's pretty good now Yeah So where are we right now? I want to highlight again. We are here We are still here, but This is the closest we are right now if I want to play again So we want to get more on this like not just single videos two seconds low resolution. We want one minute two minute videos with Context in the images or context in the videos so that it's coherent and you can understand what's happening like a movie and What we want is this right? We really want this any random and fully visualized story So We are here, but we want to be here So that's pretty much what I want to be there at and All right. Thanks. Thanks. That was pretty much it. I Do have some extras, but I was saving it for I don't know There's questions Thank you very much. Great talk. If anyone has questions, just please come to the microphone Standing in line. We we have five minutes for questions Yeah, thanks for the great talk. It looks looks very cool Um Yeah, we were you were also talking about like longer videos, maybe longer movies Like one field that I haven't really heard about in this talk is like for example the field of Netflix and these providers Do you think there's like a future for? Yeah, these streaming providers with or like is there a future for this AI generated content for streaming providers Yeah, yeah, I think like for everybody like at least for content creators mainly for Instagram creators YouTube creators tiktok creators and I think Netflix and they are going to be the first to have like Be on this obviously Netflix machine learning engineers are not behind they know all of this already and They they are also trying to solve it, but obviously it's not open Open source from their side, right? So I think like videos is a hard thing to do But it will be coming like everybody is there Everybody wants like AI generated movies like yeah, like it's been in so many books and like yeah, so many stuff Yeah, maybe at some point even like Generated on the fly like I've had a bad day at work. I want to cheer full movie Yeah, especially for like Mental health therapy. There'll be like specialized content personalized content on like the way you feel and they'll make you videos that make you feel good, you know Yeah, thanks So I had two questions, but I'll save one for the break So maybe one step easier you're trying to generate movies Imagine you have like a starting image and you want to create like a slight Like almost a comic like you mentioned before from that one image and then together they tell stages of a story You know like the script you showed for example, would that be easier? Would it be harder and what tools would you use for this? Okay? So one problem with that is like So you're talking about images like contextualized together, right? Yeah with a consistent style consistent style so what like GANs have been trying and there's diffusion story I think paper already. They also trying the same but with LSTM Which is like another neural network to try and remember what the context is and understand Okay, this is what I want and then maybe a stitch it with clip that okay This is what happened previously now. I need to generate something from this part So it's pretty hard to generate the same Content again, which would be like because it's random processes many times So you want to get closer to the previous image? Right, so you but it's pretty hard to do that It's would it be easier if you ever had a highly specialized trained model in a consistent style already. Yeah Yeah, I think it would be good, but I have not seen any so far if you have seen we can chat about it Thank you So there's a 30-second AI generated beer commercial on YouTube one. Have you seen that and two? Do you have any idea? How was made? I Have not seen that but We can chat about it. I'll see the video Yeah, but I'll I think I've seen it somehow, but I just don't remember it Was it Heineken? Okay, cool. Yeah Hello, thank you for the talk. I just wanted to ask a similar question to the previous question was like I saw some People who were able to like fine-tune some model and then have some consistency in the images, right? Because when you are trying to create a story, it's a problem that like you create a dog And in the next picture is another dog. Yeah, so like it doesn't work So I just wanted to ask like is it possible as of now or is it the state of future that like you would Be able to sort of have the same dog generated twice So what they're doing is I think with control net they're trying to make With like a lot of images first they're trying to generate images and then with control net they're trying to basically sketch out the story in a in a way and then They're trying to make video out of it, but it's not very good. If I have it here, I think Don't know if I have it here But I'll show you that but that's the that's that's also not that really good It's mainly on like frogs and then frogs are doing Something but it's not really showing that they're jumping or doing anything Yeah, the output will be consistent but like I think like With images it's little different like it's not like consistent. I would say have you seen like Sorry for any questions, please use the microphone because this is recorded and we are running out of time So I think Nilesh will be around. Please give him one more round of applause for the great talk