 Hey everybody. So I'm going to start right on time so that we have as much time for questions as possible at the end. So thanks for coming to our presentation on Drupal Media Image Captions using AI. To start off, we'll do a brief introduction. I'm Laura and this is Vauban. We are both senior software developers at MyPlanet, which is a digital studio that specializes in smarter interfaces. We have a wide variety of clients and a lot of different technologies we use, including Drupal, AI implementations, custom JavaScript applications, just to name a few. MyPlanet is located in Toronto, Canada, which is where I'm from, and Vauban lives in Serbia. So one really cool thing about this talk is that this is the first time that Vauban and I have met in person after working together for almost two years. So that's one of the really cool things about DrupalCon is that it brings together developers from all over, including the ones you work with. Okay, so on to the presentation. I'm going to talk a little bit about the where, the what, and the why associated with this topic, and then Vauban is going to give a live demo. So first, the burning question. When you think about Drupal Media Image Captions suggestions using AI, one of the first questions you might ask is why? What is the use case for this? So, oops, wait a minute, sorry, my technical moment function. It's not changing my notes for the next slide. All right, I'll just wing it. Okay, so why use AI captions in a CMS? So a lot of the time, so basically it saves a lot of time. I know that as a developer when I am developing an image feature, for example, and I have to enter in an image over and over again, I'm just going to enter a test in the alt field. And I think that it's true that a lot of us take shortcuts, which is why a lot of the time we end up with sort of bad alt captions or no alternative text at all. Another use case is if you have a large archive of images, you know, I mean, if we think that it takes a lot of time to make one image caption, if you've got a large archive, that's going to take you forever to add image captions. So having a tool that is going to insert image captions that you can go and review later on is great. And we would strongly recommend that in both cases, like whether you're using it for entering one image or for a giant archive, that you go back and review the image captions. Because a lot of the time, well, sometimes, a percentage of the time, you get unexpected results. So you want to go ahead and review them if possible. So another reason to use it is that it is a prompt for content authors to enter a proper caption. So in the same way that, you know, Drupal has the alt field is required by default. And that in itself is a prompt to enter something. But if you have a button that says, you know, generate image caption, it's a prompt to enter a full sentence or, you know, a good caption rather than just, you know, a single word. And another possibility is as an educational tool that describes what types of captions are best for screen reader users. And I'll talk about that a little bit more later when I talk about the Microsoft ability initiative. But so just keep it in mind for now as possible use case. So a bit of background about the model. We decided that we wanted to focus in this demo on the integration of the model with Drupal. So while Boeben is going to talk a little bit later about how, you know, creating a model like this is within reach of any of us, you know, using TensorFlow. The model that we're using is implemented with TensorFlow. We're using an API that is provided by IBM. So the model was already created and it was pre-trained using a data set. It is based on a model called the show and tell model that was developed by Google scientists, research scientists. And yeah, and they developed it at the, in the beginning, they had another model that they were using to translate from one language to another using an encoder, decoder type model. So they decided that they were going to try to take that model and use it for translation of images into sentences. And so they replaced the encoder part which encoded sentences with a CNN that encodes images. So the encoder now encodes images and the decoder decodes them into producing sentences. So this is a little bit more information about the encoder, decoder model. So the encoder is a deep convolutional neural network called inception of U3. It was developed by Google and it's pre-trained using a very large data set. The model, these inception B3 models are currently considered to be state of the art for encoding images. The decoder is an RNN which is also called our current neural network and also called a long term, a long, short term memory network. And they are good for sequencing and in this case the network produces one word at a time. So the sequence is actually the sentence that it produces. So what we're looking at here is a deep convolutional neural network. The role of the DCNN is to reduce images into a form which is easier to process and less expensive to process. So what we see is this is a visualization of convolution. So in convolution of an image there is a filter which is the smaller object there that's also called a kernel. And it scans over images and moves across and transfers them and takes out a lot of the dimensions. Making sure to save the edges and the most important aspects and then passes that along to the next layer. So a deep convolutional network may have many layers which is actually what the term deep refers to is layer upon layer and each excessive one sort of extracting more data from the image and reducing it more. So recurrent neural network. Now this image is actually a representation provided in the show and tell documentation. So what we see on the left here is the deep DCNN. And that image even though you can't see it right now is a representation of the convolutions. So like all of those the little chunks sort of represent the different convolutional layers. It outputs an image vector and then that gets handed over to the decoder which is the RNN. And so basically at each stage in the RNN it's generating one word of the sentence. Then they use a tool called beam search that at each stage grabs the top three sentences that are produced. They did a little bit of experimentation and figured out that the correct beam size. So beam size meaning the number of sentences that it keeps at each iteration is three. And you'll see during the demo that what the model gives back to us at the end is in fact three sentences. The data set. The data set is really important. So different data sets will give you drastically different results. The data set that we're using is sort of a generic one. It's very large. It's meant to cover most use cases. You can also like if you had a use case that was very specific. For example, if you wanted a model that was able to identify specific types of clothing, you would use a data set that was specifically designed for that. They could look at an image of a person and say they're wearing a double breasted suit and a winter tie or whatever. Microsoft ability initiative. It says accessibility because I always get that wrong, but it's the ability initiative. And we've got links to all of these things in this slideshow which we're going to publish if you're interested. The Microsoft ability initiative is they are working to produce a data set that is designed specifically to give good captions for screen reader users. So, you know, sort of looking at what types of information tends to be missing from AI and AI generated captions. And also they're looking into AI. You know, an AI caption can be more than just text. It can be multi-dimensional. It could include multimedia. So there's a lot of different things that AI could potentially offer for visually impaired users, which is pretty interesting. And I think it's going to provide great information for anyone that wants to create their own data set on how to create captions that are good for screen reader users. Why not use AI captions? So this is just a caveat saying when you're using them, be careful, because sometimes you are going to see unexpected results. Microsoft published a tool called Caption Bot. I don't know if anyone remembers Caption Bot, but they put up this AI captioning tool, and they said, you know, Internet, come try out our AI captioning tool. And, you know, Internet responded by uploading, you know, all kinds of like political images and stuff, you know, that AI probably wouldn't have any idea about and got very funny results back. You know, for example, this one is like a picture of the first lunar landing, and the caption says, I'm not really confident, but I think it's a man standing on top of a dirt field, which is kind of funny. It also illustrates one of the challenges with AI captioning, which is that a caption ideally should communicate the intent of the author. And AI captions, you know, that is something that, you know, while they're improving, they're not quite there yet. You know, one of the improvements that is being looked at for AI captioning is that, you know, in addition to the actual image, the AI will be able to potentially look at the surrounding text in an article to try and get a better idea of the context. But that's not quite there yet. So, yeah. So, in conclusion, user responsibly implemented as an assistive tool rather than as a default. So, you know, review the captions. Make sure they're good. If you are using captions that potentially are not reviewed, make that known to visually impaired users that there is a percentage of error. Studies have shown that visually impaired users tend to trust the captions, but that if you, you know, add information saying that, you know, there's a 20% chance that this caption is incorrect, then, you know, they are more likely to be skeptical of the caption, which, you know, is a good thing. And so prompt content authors to review the questions also. If you would like to, so I'm about to hand it over to Boben. I see there is one question. Should we answer that now or? Absolutely. Okay. I was wondering if you have any... Can you step up to that? Sorry. I was wondering if you have any tips on how to communicate it to the visually impaired user. Like, would you include it in the LTCH or do you have a general message on top of the page? You know, I think that it could be communicated in different ways. I, you know, one proposal I had that is that, you know, you could always include that information in the LTCH, and then if they are being reviewed, you could just remove that information from the LTCH, which would be a good way to do it. On the other hand, if they're about to look at an archive of images that are not reviewed, you could put that information at the start. So, yeah. All right. Over to Boben, live demo time. Thank you, Laura. Hi, everyone. My name is Boben. You'll have to, first of all, forgive me. This is my first time doing any presentations at DrupalCon. So I have to set a timer so that I can constrict myself, not talk too much about such a complex theme, such as using AI, especially in Drupal projects. So Laura gave a very good intro in how and why we might have a purpose in using AI models and detecting captions from a range of different images in our websites. And at my planet, a lot of the sites that we had worked on in the past had such a need. So instead of bugging content moderators, entering different alt tags for different images, we had sites with a fairly large amount of images, and it was important that each of them had an LTCH for visually impaired users. So as the AI technology progressed, we explored various different technologies that were offered. As Laura already mentioned, IBM technology is just one out of many technologies that we can use to train a machine to detect something from the image. And for the purpose of this demo, we used IBM because it was, I wouldn't say the simplest one, but it tends to serve as a proof of concept into how you can go ahead and train your own models to detect various different images. I'll go back to this point a little bit at the end once we show the live demo and go through a bit more into how we implemented this. Before we dive really into the code, I just want to say a little bit of history on the demo. Before we started implementing this solution, we needed to ask ourselves two questions. So the first one would be how to do this in Drupal. The other one would be what to use as a server that would give us information on what is showing on the image. So on the Drupal end, we decided that we just simply use the regular module and library approach. The library will be used to initiate API calls towards the whatever endpoint that will give us the results. The module will be there to give you administration over the API usage. One of those usage in this case is definition of the base URL that you can use to ping the model to give you the results. Again, this is an important point that I will stress out at the end and you'll see why it's needed. So we had the model library model implemented on Drupal end and we decided to go with the IBM one to get results on image captions. Let me show you. What we did is this is of course Drupal 8 site and since the field API has been largely upgraded in Drupal 8, for us it was extremely easy to implement enhancements over the existing fields that exist. In our case, we used the image field which is part of the core and we enhanced it so that we can add additional configuration options when you're extending the field in any of your content types and also we're extending the field widget functionality that determines how the field is represented on our coreless non-creation section. So we extended only two base classes, the field type for configuration options, field widget for display. Of course, you can extend the module itself to include the field formator options. So, you know, show the representation of the image in the way you want but since the alt tag is part of the natural render of the image field, there was really no other need for us to do anything additionally. How does it work? I know you're all excited to see this. Oops. At least it's not a white screen of death. But here we are trying not to touch anything else. So, we have the image field. You can select any image, I've numbered them, so let's select the image, it gets uploaded. What you get is additional options besides the alt tag which is by default available. So, we added this field set here with the single button. So, if we hit this, it contacts, it does all the stuff in the background that we'll pretty much discuss about soon enough and as a result it gives you this additional select field. Now, every model, of course it doesn't work 100% cases. So, even in those cases if it gives you a good caption, it still doesn't give you one result. It gives you at least couple. It all depends on the model. In this case we're using the select option because we have a number of different options. So, as you can see, I'm hoping everyone can see that I have an image of a mountain and a cloud. So, all three results here that are showing relate, well, pretty much describe what we are trying to set in the alt tags. It's not very literal. It's not, you know, something that a poet might write, but it's still a very good representation of what the image is all about. Now, you'll also notice, of course, the sentence itself does not have the first letter capital. I mean, it's something that you guys as a group of developers can do additional formatting in order to show, but the important stuff is at the end. So, you see a very large point number at the end which represents, by model opinion, the probability that whatever is offering to you is the correct definition of the image. So, it's a very small point number because, like I said, it's not an exact representation of the image, but it's something that it sees from its own, presumably large, database of images. So, this option, the probability what the IBM Max Image Generator calls can be, you know, when you add it, it automatically populates the alt tag field. This is an optional value. So, when you're setting the field configuration, you can just simply, we added this configuration option to either use it or not. If you disable it, then let's see. And let's add additional image. Again, let's generate the alt tag. And, again, let me see if I can just quickly show. So, it's a train on an AC site, which helps your resource. And the definition that we get from the IBM model is pretty good. It's pretty accurate. And, again, we get the probability showing if we want, we can just delete it. The good thing is that the alt field itself, whatever you select here, it's still editable. So, you can just simply delete and add any whatever text I want to change it into. So, that's one additional step that it's optional for content admins to do in case they want to additionally format or change whatever the suggestion was from IBM. This is pretty much it. It's a small demo, but a lot of things happen in the back end. So, let me just check my time. I still have it. Good. So, I'll quickly switch to the code itself. So, how does this work? There are numerous steps. Once we load the page, it happens when we load the page and to the point when we get the slackbox with the available options. So, starting with this, what we have is, like we mentioned, so we extended the field type class. We added the regular, we used the regular form API in Drupal Lake to add additional field that is going to show the slackbox, the button that triggers everything else. And then we also have the field... Sorry, this is the field widget. The field type simply adds this checkbox on the field configuration page. The field widget. Sorry. Oh, sorry. Hope it's better now. Thank you. So, yeah. This is the checkbox. Oh, no. This is the field widget. So, it shows the button. It's not a complete solution because the field widget class also depends on some JavaScript work that we have because everything that includes executing the API code itself is done through AJAX. So, in the end, I think the last slide contains all the links including the link to the repository where the custom module, the library that we did is available and we, of course, encourage you to all clone it and expand it and enhance it in any way you want. Okay. So, what's else? So, you get the field widget displaying the button. You upload the image, the button displays. You click on Generate the tag. What happens next? There is a small JavaScript code that we added which is, again, part of the custom module itself. I don't want to zoom in right now because it's a difficult code. It's a regular jQuery with a little bit of JavaScript which is taking whatever the image ID is and executes an AJAX call. It's relatively straightforward. The point in all this is that it's pinging a Drupal control that we created. So, you can have two things here. You can either directly ping the IBM model or you can just simply have something called Drupal backend that will then call IBM system, get the data parsed in any way you want. We chose the latter method because it gives us greater control. So, Drupal backend pings the IBM model, gets the recorded data, and then we can parse it, capitalize the first letter, remove the probability or add it instead of putting everything on the client inside. Whatever method you choose is up to you but this is what we currently did. So, now we are pinging, in our case, the Drupal backend. There's a control that we created which takes the image ID and then collects all of the necessary data and pings the IBM max image caption generated. It's a very long title for it. So, I'll use IBM for short for now on. What happens there? That's the most interesting part and this is the part that Laura was talking about. So, the image is processed. It goes through the encoder, the convolutional neural network, first of all. It's split into many different details. So, in general terms, the image is split into whatever the system can process individually. So, we had a mountain with a sky above it, maybe a little river on it. The convolutional neural network tries to split all of those terms in the penalty. So, it splits the image into smaller pieces and runs through its extremely large database. I think the IBM model has around 500,000 images. They're not very diverse, but they are mostly related to, I would say, nature images. So, that's why, in our case, we got pretty good results and descriptions of the images. So, once it goes through the encoder, the encoder tries to put all those independent pieces together and as a result, you get a couple of different sentences. So, the pairing of all those different layers that we got from the encoder could give us a lot of different results. That's why you also get a couple of different options all with a different probability. The larger probability, the better description of the image is. Of course, the most important thing in all of this is the database itself. Like I said, IBM currently has 500. I think they're continuing work on this, but since this model is a proof of concept for them, they have a whole other slew of other models that they are trained for different purposes. In any case, back to our code, the decoder finalizes everything, puts the different sentences, and puts it into a JSON output that the Drupal control now receives. Drupal, like I said previously, we parse it, we do with it whatever we want, and output the JSON again. The good thing is, you can use the JSON API, the new module that will be shipping soon, and I think it's already available. And you give it to the JavaScript. As a result, what you get is... Switch. Where's the browser? Okay, if you can drag it while I'm tweaking. So what we get as a result is a slack box with options that you can use to... It's pretty sensitive. But a lot of different options. There's one more demo that I'd like to show you, and it's based on what I've said earlier that why the models are important to be trained well enough. The IBM model is mostly about nature, so it's not really good at recognizing something else. Machines won't be... Don't believe me on my word, but machines won't be still ready for at least another 10 years to detect everything that you put at it. It still means a lot of storage and a lot of data in order to understand everything. So I found an image that's going to be interesting for the algorithm system to try and describe. So here we have a close-up of a monkey's face. So let's get available options. So how do we frequency it? What the model sees is a large elephant with trees in the background. The first one, if you can see, the third option is my favorite, so it's completely off point. It doesn't mean that the system doesn't work. It's simply a matter of how well you train it. So initially when we started this proof of concept project is that we asked ourselves, do we really need to create our own model to train it for our special purposes? Or can we just use IBM and hope an image that you throw at it will work fine? Well, for proof of concept at least, the answer was pretty simple. It's important for you because Drupal is heading, and it already is, heavily invested in media. So you guys are doing media sites with a lot of images, not only images in articles, but you're also representing the hero banners, static images that don't represent perhaps valuable information, but they do for visual impaired, maybe they even do represent information. So it's important that you use as much information as you can, not only for the visual users that you can see, but also visual impaired users. The important thing with this is you have a website that's doing something related to recipes, related to food. It would be much more preferable if you create your own model, train it with a whole set of images related to food and recipes so that something like this works in your advantage. So you would create it once constantly feed it with the new images and the model will work for you. I mean, it's something that you can see benefits in the long run, of course. So that's one of the ways how the technology will also progress because the model itself can then move on to its parent. And the different learning models can be joined together. So instead of just seeing this as an elephant, the same model can get the data from another model and then describe the image properly. I mean, I expect the next 10 years will be pretty interesting because the models will grow in size, not only by Google, IBM, Microsoft, and Facebook, but the rest of the community including Drupal. So that's about it. Thank you. Of course, we have time for questions. Yeah, please take the mic. You demonstrated this image caption one by one. Is it also possible to generate captions for a large archive of image fields in an existing website? Absolutely. And this was my first point, how we decided to implement the project. So you have a library that's simply there to initiate API calls. The biggest problem is that one API call per image can be made to the IBM. If you create your own model and you write the back end so that it allows multiple images to be provided at the same time, it's fine. The IBM one only allows one image per API call. The way you implement, since we're talking about the image field for this demo, if it's a multiple value field, you'll have one generate alt button per uploaded image. So click on it twice to get alt tags for two images. That's about it. This is a proof of concept. Everything that Drupal has, you can use so you can extend it any way you want, but it is possible. Absolutely. Okay. So a quick question. IPTC data. So most of the large-scale image banks or agencies such as Associated Press convey information using IPTC data, which is kind of lacking. It doesn't have any support for linguistics or anything. Have you tried using IPTC data for the machine learning calibration or anything? Yes, Laura can do this because she's more in the central since I'm from Serbia, I do that have a different part but I'm guessing no, right? Yeah, no, we haven't as far as I know. So again, this is a proof of concept and it'll be interesting to offer this. Here's an idea for you. If you have a company, if you work at a company, you can supply this to one of your existing or future clients. What type of costs are connected with calling the IBM service to get the caption and what are the approximate costs for getting alternative image sets to train on? Like the 500,000 images plus a description isn't something we can really build up by ourselves. So I have to get it somewhere. Yeah, do you want to take this? Yeah, so I know that some data sets are publicly available. The one that was used to train this show and tell model is publicly available and I know that there are some large ones that are. I know that the ability initiative is going to make theirs public as far as I know. So yeah, there are quite a few that are. As far as, you know, ones that are charging, I'm not sure what the cost is. But yeah, it would be interesting to see that's probably going to be something that's monetized more and more. So it would be interesting to see what it costs. Okay, and the IBM service? Is it currently free or? This is for the go one. Yeah. For evaluation purposes probably. Yeah, I think that it may only allow a certain number in a given amount of time. Yeah. So the way they offer it is, of course, you want to use it for commercial purposes, it's ever free. For, you know, not making, I think more than 100 API costs per day is again free. But, you know, with IBM, it's always contact us for more information. So based on your needs, you know, you should always contact them directly and, you know, specify this. The Amazon also, I think it's worth mentioning that the Coursera has a free course on the Amazon AWS model training. So AWS also, as you already know, whenever they are used, that's how they charge it. So sitting, if you have a website there, it's not doing nothing. You're not being charged with anything, but they have an extremely valuable course on machine learning, especially in related to their services. And I think one of the information that you'll find there might be how they value this. In any case, this is technology, emerging technology. So it's definitely not free, especially if you consider half a million images, quality images, you know, 2000 by 2000 resolution, they take up a lot of space. And we're not talking about space, only we're talking about processing power, because it needs to, it's not going once to the same image. It's going a few times because the convolutional, the neural network and the RNN and CNN, the way they work, they need to have multiple layers. So the image is not spliced once, but it's recombined and spliced in another way. It's an extremely time-consuming process, so we still don't have that heavy processing power. Yeah, I think to train a model with, you know, many, many images and captions can take, you know, a couple of weeks. Are the captions always delivered in English? Any ideas for translations, for liquid sites? Yeah. So with everything, even the, you know, the PHP, the Python, everything else is in English. So everything they do, they do in English, first of all. But yeah, this proof of concept, it's only available for English, the IBM one. But you do have other languages that are supported and I'm guessing they are not free at all. Don't exit anyone, but any like available, you can think of, but Dutch certainly might be available for the same, you know. It's charged. One last question or remark. I think it might be good if, I mean, I'm not sure if it's possible, but it's the alt text. So say we use this system and I correct the alt text because, well, there's a gorilla and an elephant. Is that information being sent away to IBM to train the system or are there other providers who do that? So you remember the whole thing with the Amazon Alexa that it's listening to you, even though you haven't said anything to them. It's the same thing here. And again, I'm guessing we're living in the age of information so anything that you can supply it with, it'll use to train. So it's not doing anything bad with it. It's using it to train your data. Again, it's never written anything so it can be completely written. And I think that if you train your own model, you could definitely set up, you know, if you were using a system like this within CMS, you could set it up so that it did send the result back so that it was taking whatever the users are entering and adding additional information for that image. All right. Well, thank you, everyone, for coming up.