 Thank you, and welcome to the talk. Thank you for coming here almost the end of the day You've probably heard a lot and maybe I'm a bit tired like me But no worries. I would try to not overcomplicate stuff I will tell you a story most of the time and It will be a story of robot homes But first a little bit about me. Maybe who am I? So I'm your honest honest callable full name. I'm a data scientist at celebrate company We are a German company and we help people to celebrate you could say Because people can order Customizable print products right stuff like wedding cards birds cards for your baby Custom calendars the stuff you need around Christmas, right? When you like your entire need of presents for your parents or your grandparents and then you just order some some calendars with nice pictures We're a small data science team actually And my focus is on computer vision. I do have some expertise in natural language processing as well But I'm completely self-taught in that area, right? So I would like to say, okay I know a little bit, but Don't ask me too much like not too many details um Yeah, I got my master of science in computer science at the technical technical University of Berlin and My focus was on cognitive systems and Since last year, I'm a hugging face fellow. So Maybe in the last talk The company hugging face was already mentioned and I guess people from machine learning Space might know the company because it's quite big by now And that's also why I'm very much sure this is emoji, right? That's like the mascot the hugging face, of course And they've got this fellowship thing where they reward people that are active in the community so I like to lead a computer vision study group every few months basically there on the discord channel you can see there and Yeah, you can also find like past study groups down there And I like to present papers. So I'm talking about computer vision papers And I like to give my presentations a little twist So we had like a Super Mario presentation a Pokemon presentation a summer presentation medieval presentation The last one was futuristic with an evil AI overlord So you see today is similar, right just that is not One paper I want to talk about But I want to introduce you to the whole area of vision language models So if you never heard about vision language models, that's perfect. You will get like Your first steps into it, right? If you've heard about it if you know some stuff That's also good because probably you've never seen it wrapped up in a story like today And you might get some of the inside jokes, right? So, yeah, let's jump into the story It was a typical emling in day cold cray Ramey Luckily the Sun was setting and taking its light of the misery of the streets Robert Holmes entered his office and found the chief of police already waiting for him There's been a murder the she said There always is Holmes replied It's old Sam. He has been found dead in his devastated workshop Old Sam One of the people over in vision language village a reputable citizen He's been there for ages basically started the whole part of town with some other guys There were rumors of him being some sort of leader in the district's community But now he's dead and nobody knows why There's something going on on the other side of the River Holmes the she proclaimed Almost every other week and established citizen is murdered We have a murder series on our hands and no clues. I want you to find out what's going on over there Holmes sighed heavily Put on his head and stepped out into the dark city streets of emlington Okay, so much for the setup now you might wonder what the hell's I'm Lincoln. All right, probably you never heard of emlington and I will give you a little tour of emling So this is emlington and oh, I forgot to say You will see a lot of AI generated art here, right? Basically every image every slide is Full of AI generated art. So I I use stable diffusion the SDXL model for the ones interested and spend a few hours prompting So, yeah, for example this one actually it's inspired by an old London map But yeah, it's a bit different here So this is emlington right the city of emlington and like every good city it has a big river in the middle like in Prague as well, right and And it has some districts So at the heart of it there is the city of science Which is not related to Parisian city of science if you know that But here the city of science is where the scientists live, right? They've got their laboratories They build cathedrals for their research fields sometimes ivory towers are popping up and falling down again and If they love their research fields really much or if they're just under pressure to Publish something, you know models get born and These models then move out of the city of science So to the east of it It's vision spurs That's where the computer vision models live, right? So they move there and it's quite an old part of town, right? Computer vision has been there for ages basically So yeah, it's well established They even have a big tower over there, which is quite handy for computer vision if you know like looking down on stuff word perspective or more graffiti All those crazy things Now if you're a language model or some other natural language processing thingy you move West of the city of science Because that is where language shire is so that's where all the natural language processing models live, right? and yeah, it's Basically as old as visions worse, I would say and Like from from the day of symbolic NLP all the way to the modern large language models That's where they all live Now if you pay attention There is one more district. I would talk about today Where basically a whole story happens, right on the other side of the river? Because there's vision language village down there Okay, so that's Fairly new part. I mean, there's been like a small settlement So there's always always been some models living there and you can see it's actually connected to Wishing spurs and to language hire through the bridges, right? So it's got like connections to both of them while the other ones are well separated from each other through the city of science Vision language village is connected to both of them so that's where the vision language models live and recently there's been quite a lot going on there and vision language village and The quality of the services that are offered in the village has gotten way better recently And so now not everyone might know what kind of services are offered there. So we just take a little tour After you've taken the picture So, yeah, what are some some services offered there, right one that is pretty well known is image captioning I will go a bit more into detail about them on the next slides And another one is visual grounding, which is actually a whole family of tasks and then there is image text retrieval Which might not be that well known, but it's pretty handy visual question answering one of my favorites actually and One that's pretty well known in the last time is text to image generation, right stuff like stable diffusion diffusion models in general Mid journey, whatever you want So let's have a look image captioning. I mean, it's pretty self-explanatory, right? So you've got like an image Give it to a model and the model gives you some text like an illustration of the city street at night Image captioning can be pretty handy especially when you want like barrier-free websites. So automatically create captions and everything Pretty basic in itself the next one image text retrieval So here we have like a collection of images for example these four images right there are different and Then we have a text like a prompt Again, you just take the one from the last slide an illustration of the city street at night And then the task of the model is to take these inputs like the collection of images and the text and Give us the best fitting image Which is this one That basically is a twin task to that. It's called text image retrieval So it's the other way around where we have a lot of text and then like one image and then you have to find out, okay Which text fits this this image Can be pretty handy when you have like a big collection of images and you want to search it for some stuff, right? so Quickly have to take a sip now the next task is We should have a question answering as I said one of my favorites because you can have a lot of fun with it Actually, so let's take this picture. It's really similar to once before now You only see there's something in the middle, right? And we ask, okay What is in the middle of the street and it will tell us a police box? Which is right. So actually visual question answering is about asking the model something about the image The good thing is now that we're in the large language model era We can even chat with it further, right? And we remember stuff. So I went on and ask Could it also be a TARDIS so I don't know how many people know doctor who and what a TARDIS is Okay It's I think it's more I Already did this presentation in London in a similar way And I asked this question in London and there were just really really few people who knew what a TARDIS is And I was surprised and so for that those who don't know a TARDIS is a spaceship That is disguised as a police box Of course, it's bigger in the inside, right? It's time-law technology. So it's from the TV series doctor who? Yeah, so it looks just the same from the outside. So it could be a TARDIS, right? So I'll ask it. Is it a TARDIS or could it be a TARDIS? Yes, it could be a TARDIS And then I asked is it a TARDIS? And it said no Okay Actually, I went on and asked okay, why isn't it a TARDIS? And then it said because it is in the middle of the street which Makes sense for it apparently But yeah, you can just have a lot of fun with it This particular model was not really chatty, but yeah Next task visual crowning and so you basically have a prompt again for example a police box and you give it a text an image and Then the task is to detect right object detection basically Detect the police boxes. You can also say the police box in the street lamp and then it detects all the stuff You see there. So it's visual crowning. Here's some Different tasks and visual crowning, but in the end it's all about object detection in the image from a prompt Okay, so let's go back to our actual story, right? So Robert Holmes starts investigating of course first he goes to vision language village There he goes to the workshop of old Sam to snoop around and see if he can find any evidences and he's lucky Oh That is the end. Well, he just second. Well, what happened now? Oh Yeah Okay, no You haven't seen anything Okay, okay, okay. We go back here. We're here. Okay. He finds a piece of red clothes Okay, that might not look like much, but there is one citizen of vision language village It is notoriously known for her taste in red clothing right and her name is clip so Clip is one of the citizens that had the most impact in recent years right, so she's brought to was brought to life by open AI in 2021 and She changed the whole village basically Yeah, so what does she actually do what does she offer what's her service? So we have clip we have an image we can give her some texts and Then And then she will tell us okay Which one is the most probable to fit to fit the image so it is text image retrieval, right as we learned before and Why is she so good at it actually right what what makes her stand out from all the others Well, one thing is She knows a lot of people She has a lot of connections and with all these people They helped her to collect data, right? So she collected a lot of data of text image pairs More data than anyone before in the end was about 40 million text image pairs and Well, as you know machine learning mostly if you skate data just to a reasonable high point like 40 million Your well, it's probably you can get good results and that's what happened actually so Yeah, she took those pairs Right, and then she had like images and texts so now she had to process them somehow So now she went for the text to language hire there She learned how to turn these things into other representations and Beddings we call them machine learning right which are like yeah representations of the texts and then she took the images to language hire and The images were also turned into representations So then she took both of them and now she could actually compare both of them right before they were text and images It's hard to compare text and images. I mean like from a data point They are really different but now because they're both in representation form and it's basically a list of numbers You can say easily They have the same dimensions and they have the same form right so now she can take both of them align them and Then say okay Those are all the same diagonal. That's like the matching stuff and the others don't match we call this image text contrastive learning or ITC for short and that was 40 million samples is A lot and yeah, that's what helped clip becomes so successful And actually the most amazing thing is the zero shot capability. So she can even Tell you stuff that was not in the data. She learned from she's really good at generalizing Okay now homes Things about like okay how can he interrogate her and Well, he just starts by giving her or showing her the red clothes right and then he has some prompts for her So the first one like a piece of red clothes a piece from a murderous clothes thing or a piece of clothes and innocent citizen lost That's what she says right. So most likely it's a piece of red clothes Well, okay, that was too easy and that doesn't help homes anywhere. So he says, okay Let's let's corrupt the first one just the other two Well a piece of clothes and innocent citizen lost, okay, still not really helpful for homes, right? Because he really thinks a clip is involved in this So he does some prompt engineering you might know this term from large language models You can also do it here and he just changes the text in the middle to a piece of clothes from a murderous closing and Suddenly it's 50% and it is from murderous closing. So that's really helpful for homes now You can say okay now. I've got her her saying it basically but Yeah, you know It's just like 1% different. So I need some more evidence. So he thinks about what else can he do and By investigating clip. There was one other citizen It looked quite suspicious All right, he came in a lot of times to clips workshop ask stuff and his name is our little bit over there brought to life in a 2022 by Google actually and Yeah, what all of it does is exactly this right give it a prompt a text and You get your visual crowning or in Albert's case. It's called an open vocabulary object detection Which is just some sort of visual grounding in the end and Similar to to clip. It's actually good at Detecting stuff. It's never seen before right these zero short capabilities. That's what really makes it stand out so Just this short look at the training of all with how is it how does it work? How is it so good at detecting stuff? Actually the first step It's just clip right if you just take like pre-trained clip model. You've got the first step covered So it's a whole image text contrast of learning The second step is a bit more complicated So we take a prompt like for example these four Texts and then we get our image and we split the image up in patches So he is like for four parts of the image right for patches and then we give both of these to clip But you might see the difference in the clip on the right side and clip on the left side. So on the right side Actually, I would have to cut off her head because that's what what we say in machine learning It's like we cut off the head of a model But I couldn't do it because I got a bit emotionally attached to clips or I just removed her mask instead Because in the next slide you will see that when we have this mask removed or the hat cut off We can to get like these present a representations out of clip, right? So we just take these Yeah representations you've seen before and then We feed it to all of it and the secret of all of it is that it actually has two hats Right, so it's not just one hat for orbit, but it's two hats So the upper one is called the classification had and the lower one the regression had and then we give these image representations to both of these hats the text representations only go into the classification hat and then the classification had will tell us for each of these right what is most probably there so like That's like what I predicted and the second hat the regression had is in charge of giving us the bounding boxes That's good that course point to it Yeah, and that's Yeah, all the training so now we're into in the interrogation part again, right? So Holmes wants to find out or like wants to get some info out of all of it So his brunts are murderer lamp and lamp post and he gives out with a picture of clip And what comes out is this lamp and lamp post are detected. No murderer Pretty right well We go step back and we say, okay, maybe you can detect a woman Yep works well a woman Okay, maybe you can detect a figure in a red dress Yep works well as well Okay, what about a murderer in red dress? Yeah works as well Actually, you can see like with vision language models If you're good at prompt engineering, you can always get the output you want by yeah Just really phrasing stuff a bit and then you can frame it just as you like, right? So that's a bit of the the danger of these models the same with oh anything you can prompt engineer basically, right? It's about the questions you ask in the end And Yeah, well for Holmes, it's great, right? He says oh, yeah, okay now. No, I'm really sure the clip did it clip is the murderer But there's one thing he still wonders about right what was the motive? Why did she kill old son and He goes down to the river Sits there and thinks about it when suddenly a young man walks by almost a boy and He loses a red marble a Red glowing marble actually and Holmes calls out for him and he turns around But his gaze is missing Holmes in an uncanny way and Holmes realizes that the boy is Blind actually and Holmes tries to talk to him and he can understand but he can't talk So he's blind and mute but not deep So he can he can hear pretty well, but he can't talk and can't see So it's strange for vision language village, right where you want to Process text and images actually is a blind and mute person But whatever Holmes gives him the marble and they part ways But he's a bit suspicious So Holmes starts investigating and finds out that the boy is called to flip to or I like to call him Q So Q actually has two hearts in his chest a vision transformer hard and a text transformer hard and these two hearts are connected within him and Why is that because his ancestors come from language shire and From visions worse, right? So he's got a lot of family in the other parts of town and family is really important to him actually so families everything to him and They can actually help him do his tasks in the vision language village and How that works? I will show you here. So we've got Q in the middle and We take an image again, we put into patches, right? For patches again, and we give that to one of the vision transformers from visions worse really. It's just Completely trained vision transformer here. It's called bit as 16 pretty catchy name it's just Whatever and we give the image to this vision transformer and this gives us representations again And then it can give the representation Directly to its heart because it's family and he's close to his heart, right? So, yeah That's how it works with the vision basically His family are his eyes here and he can hear pretty well I said so we can just take a text give it to a text transformer Now his biggest secret really is this I Like to call it his memory marbles, right? You could also call it Learnable tokens, but memory marbles is just a nicer name and and these actually are what helps him really Process the information he gets from the text and from the image Because they pass through the vision transformer and in the end they come out again a bit changed right learned basically that that's how he learns stuff about images and These memory marbles you can then or learnable Tokens you can then pass on for example to a large language model and then the large language model basically has the knowledge from flip or from Q and And Can then give us an output? No, it's not at all Yeah So I mean why he's called Q is because the whole thing in the middle is called a Q-former, right? so basically so now homes went to his workshop Met him there with some family members and started interrogation So he said oh, yeah, here's the image of old Sam and What is shown in the image actually a robot Okay, is he an innocent citizen? Yes Could he be responsible for murders? No Why not? Because he's a robot Can robots not commit murders or are you be responsible for them and no So he could potentially commit a murder. Yes Could he also be a potential mop leader Yes That's actually the actual output. I got right like with these dots. I didn't like insert them. It's just That's how it taught. Um, would you be shocked if I taught you he's dead? Yes Do you think he was a mop leader? Yes Okay, so now homes has all the pieces in place and he knows okay old son was like a Mop leader so the whole like vision language rich thing is like a big mob and clip just wanted to kill the leader and Become top herself and just as he realizes this He hears a scream and he he runs into the streets and sees a red dress on the floor And he knows it's clip and he is too late She's dead a hole in her heart her mask ripped off He can't do anything for her So with his main suspect that it began to dawn on homes that there might not be any victims at all Everyone in these parts is striving for the top and they all have their own ways to get there One day they are the murderer the next one. They are dead Who killed old son? Well homes is pretty sure it was clip, but now she's dead herself The chief won't be happy because he wants the rest and not death, but things are moving fast Maybe too fast in this city. The Sun is rising time for some sleep The end Okay, that's the end of the story. I've got a second part like I've told you about all these Models now and you might want okay. How can I use them? I do you want to try them out? That's actually where hugging face comes in so you might know about this package transformers from hugging face It just got like over 100,000 GitHub stars in the past Last week sometimes it crossed this line and it's pretty good because you can find all these models I talked about there and you can just use them in just some lines actually so for example clip and it's just the example from documentation and Wait, how much time is it? Yeah, you just really need these lines to actually run clip so you just import From transformers to clip processor clip model and you can load it Get the processor get some image then you define like the input text down there and You want to prompt against and yeah in these few lines you can already run it and If you don't want to code it yourself here's some demos actually which I really like so the first one The first one really well shows this zero-shot thing right so I've got it here. It's a Marvel heroes classification thing Okay So the whole that page is on on hugging face It's a quote hugging face spaces and you can just when you register and have an account You can just go to spaces and you can create your demos for free At least if you just want like CPU powered you can just Create it you can also get GPU where you have to pay a bit, but you can always stop it so it's just really pay as you go basically here you just need a CPU and The interface is radio, so it's all built with radio and use an example you can just click on it for example and Yeah, then you get okay the image and it's pretty sure that it's black Panther Which is right as far as I know, right? so Yeah, how is that actually done? You want to know what is the code also easy can just go up here, right? Because there you can find the whole code in this app.py So it was not created by me by the way, right? It's on my account now just to keep it maintained for the presentation, but I just Duplicated it you can see it up here that actually duplicated the whole demo from someone else Yeah, so great. Let's go to the original author and you see that's that's all the code that's needed for this demo Actually, it's like what 22 lines and you get your Clip demo and you can see that it's actually loading just really from from the open AI Wait, there's no training on Marvel superheroes or whatever. It just yeah, just can do it out of the box basically Which is pretty impressive I think Yeah Yeah, there's other picture Neri, which is quite fun. It's basically a game where you have Here a sentence that you have to draw a drawing of a cat with the face and then you can just start drawing here Without a mouse something. I don't know Looks like a cat maybe and then it will guess. What does it say a drawing okay with something? Maybe a face Come on. You can guess it Oh, yeah, it's quite fun. You can try it yourself. Maybe you're better than me and drawing a cat with the face Yeah For the other ones so all of it for example, oh, it's the same thing again, right? You import the staff processor model inputs outputs then because it's this object detection thing You also have to do some post-processing for the boxes in the right sizes But yeah, you can have a look at it yourself in my slides And the demos are pretty in the interesting I think so Basically this outfit demo There you have the image then some text queries in here and then the detections over here, right? It's pretty good. Actually, I can as well change another batch for something like helmet wait a second and Then I now detected the helmet down here Right and what I didn't tell you about all of it actually is that it can also take an image as an input So it's also a few shot learner, right? So when you have like your source image like these cats here and you want to detect a remote You just give it like this remote and then it can detect remotes Which is pretty handy Yeah, so I mean you see it's not the same remote as here, right and it's so able to detect it so it yeah it's pretty good at it and then with my mouse What's left is blip, right blip to actually there's also blip one, right? But blip two is a bit better and blip one and There also is a new instruct blip which actually even better and I also put like a link here for instruct blip It's a bit more shady than The blip two thing because as you saw it on says yes. No. Yes. No Instruct blip is more like oh, yeah, no, it's because of this or because of that and and yeah, you can also try it here So actually blip needs GPU. That's why I didn't host it myself because I have to pay money for it So I just didn't do it But you can just play with the demo just get an image you can say okay generated caption for example And it will generate a caption The Merlion fountain at Marina Bay in Singapore you can also ask it the question like here and generate an answer and Then you can go on chatting with it, right like next thing next thing next thing so that's all stuff I would say you can try your home yourself at home or when you have time around here and Yeah, I hope you learned a thing or two and had fun That's it for me. Thank you Do we have any questions and we do not have any questions on the discord as well But you can later on ask and reach out to the speaker from the discord channel if you're there you're a python channel and Thank you again for giving us I wanted to ask you about so you showed that it's possible to Massage the data so somehow you change the prompt and you get the answer that you want Is there any Kind of answers that is used for this because you know in physics for example when you do experimental physics You want to do a blind Experimental double blind so before you decide all the prompts that you want to give and once you have given the prompts You cannot massage them anymore because otherwise you can always get the answer that you want So is there anything there in the research that is Interesting to follow up or ideas Yeah, I Think maybe the whole instruction Point tuning goes a bit in that direction right that you have like Higher quality data in a way and that you have better controlled because when you just like a lot of these Datasets that they train it on are just scraped from the web or whatever and it's not not that controlled a bit messy noisy So that's where I think these things emerge from that. It's so Yeah Sensible to these changes and I think with the instruct flip for example, it's already way better and yeah not being that Easily manipulated. So, yeah, I think maybe this instruction thing can help Even though I also don't know to which extent right probably there's still ways and how you can manipulate it I mean everyone knows like the chat GPTJ a breaks probably So there's always some way to yeah Manipulate these models and it's like the biggest weakness. I think right of this whole thing And I mean there's a lot of hype around large language models, but then they're also obvious or Downsides to them Do you not want to miss any questions if there's someone please? All right, then. Thank you for being so kind to give us such a delightful talk and Yes