 Hello, folks. My name is Diptharup. I'm a product designer at Flipkart, and I'll be talking about how we're using generative AI, image-based generative AI models to basically enable large-scale visualization of products, right? So here's the problem. Worldwide and not just India, what is happening is users are hesitant to buy large electronics or say furniture is online. And the reason, primarily, is because they're not able to visualize a product in their space, right? So how can we solve that problem? That is the core solution that we were looking at. So augmented reality, to some extent, was able to solve this problem, but the solution is not scalable. The problem with augmented reality is like a lot of devices, Android devices, do not support AR, right? That is the first problem. Secondly is handheld devices. Mobile devices are not as stable as a head-mounted device. So aligning objects is pretty difficult for users, right? And most importantly, for augmented reality to work, you need to have all the products. You need to have 3D models of all the products, right, to put it in the scene. So making those 3D models in itself is a very costly affair. So the solution that AR was providing was not scalable. So these images that you see here are from Flipkart's user research team, where they went on and they were trying to find out what is the reason why users are hesitant to buy these products online. So the goal of the experience of what we were trying to do with Generative AI is we wanted to provide a seamless or frictionless way for users to visualize a given product in their space, right? And we wanted it to be scalable again. And in terms of our target audiences, since Flipkart is an e-commerce company, what we wanted to do was primarily target pin code changes, like people who have multiple pin codes. We're trying to target these audiences and also users who are browsing a lot of home categories, right? Furnitures, televisions, whatever. Yes. So in terms of methodology, what we were trying to do is we were trying to use image-based generative AI models a technique called in-painting capability. So what is in-painting capability? In-painting capability refers to the AI model's ability to fill in missing piece of information in an image, right? So this picture here that you see is a prompt that I've put in Dali, which is, say, a red bicycle on a hill, oil paint, big brush paint, and style of cloud monet. So it generates this particular image. Now what I do is I erase a certain section of this particular image, and I put in another prompt saying a red house in similar style. So what Dali does is it puts in a house which is similar to the surrounding, right? This is in-painting capability. And the solution that we were looking at was based on the same principle of in-painting. Yes. So I'll briefly touch upon the experience here. What we did was every product here had a trigger, right? So the trigger for, say, this particular television was this try the AI room design of widget, clicking on which you would come onto a home page, or, say, an onboarding page, which would have a multimedia widget, say, a video playing and explaining the user what the feature is about. Then we have a flow to capture image, and then we have centralized tabs. So the centralized tabs would contain templates for the users to play with. And also, say, whatever progress you're making with the images that they've uploaded, those would get saved here, right? In terms of, again, capturing the room, we wanted the user to click a picture of their room. And when they would do that, it was extremely important to call out the privacy disclaimer. Because as soon as you are using information that the users are giving the platform, it's important to call out that we'll be storing these images for future references or whether we'll be using to enhance their experience. So you should have these ethical considerations in place, especially if you're using AI. And then when the users would click on, say, the camera, when they're using the camera, giving them a one is to one camera viewport and not, say, the entire screen. The primary reason why we were doing this is because for this particular pilot, we were using stable diffusion, which is, again, a generative model, image with generative model. And the default size for, say, stable diffusion is 512 into 512 pixel, which is one is to one aspect ratio. So we wanted to make the input similar to the output so that the output has very little distortion. Because as soon as your image is your aspect ratio is different to the output, it will create distortion. So to avoid that, we were using one is to one camera viewport. And then we would also have the help text here guiding the user about what to do, how to click the best possible picture. Then comes the room selection. So again, what we were doing is as soon as the user clicks an image, the next step is we were asking them to associate that picture with the room. So what that does is, as soon as you map an image to a room in the back end, the room could load up possible verticals that that image is associated to, right? So if you're clicking a picture of, say, you've just clicked a picture and you've said that it's a picture of a living room. So in the back end, we would be knowing that possible furnitures or, say, electronics that can come up in a living room would be, say, a TV unit, a television, or a sofa. So the load time or the computing time would immensely reduce in that case. So that's, again, a tech consideration here. And then we would mask. So remember in the methodology slide, we were showing we're using the in-painting capability, wherein in an image, we were removing a certain section. So similar use case here, what we were doing is instead of erasing the section, we are giving the user physical mask. This blue object here is a physical mask here. What we are asking them to do is mask the area that you would want to erase out, right? So the user masks that area. And what that does is, as opposed to erasing a section, having a physical mask enables you to have a control of the aspect ratio. Again, if you say, for instance, you click a picture of a TV unit which doesn't have a television on it and you want to visualize how a television would look like when it's there, right? So if you give user the freedom to erase, what they'll do is they'll end up erasing the entire wall behind it, right? And that would create a problem. A 42-inch television would become a large-scale theater-sized screen. So we didn't want to do that. So we wanted to restrict that. Hence, we gave physical masking. And the other thing that we were doing is we were trying to allow our models to be optimized by resolution. So rather than rendering the complete area on an image, we were just allowing them to render a single portion. What that does is it helps immensely reduce the compute time. Because each time you make an API call, it takes five to six seconds to load that particular section. So again, we were trying to optimize by resolution and then scaling the selection. So we gave the user a slider to scale in the particular mask. And now, what is happening is, if you see, just wait, when the user clicks the image and they click on place TV with AI magic, the in-painting capability kicks in, and it replaces that section with the television. See, that is on Flipkart. So users are now able to visualize the particular section. Yes. So once the user goes through the masking flow, they come onto this piece of, we can call it a dashboard. It's not essentially a dashboard, but a section. And this particular page, what it does is through on this page, via this page, you get to visualize other products, more products. If you want, you can edit the particular product out again. So this is the anatomy for that. Again, we have the one-to-one image area, which I mentioned earlier. And there's a status bar. There are actions. And there is an input and feedback panel here, right? Yeah. So essentially, that translates into the features that we want the user to have here. Now we have provision for choices. So how many of you here have used tools like Mid Journey or Stable Diffusion, a lot of you, right? So as soon as you hit a prompt, you would see four variants of the same image appearing. Same, your output has four variants of it, right? Similarly, these image-generating models, what it does is it can generate multiple outputs, right? And to keep that in consideration, we gave the users a slider for horizontally scrolling the possible variants that can come out of a particular mask. So ideally, it shouldn't happen. Only a single image should come out. But as you know, the technology is pretty new. And it can so happen that the picture you've clicked has some sort of distortion. It could be in the form of lighting. It could be because of the perspective or the angle. So these image-generating models which are being trained, they might not be able to perceive it correctly at the first go. So that training needs to happen constantly. So anyway, we had a provision for choices. Choices are nothing but the variants, right? And then we also gave the users to give feedback on these particular choices. So initially, when these models are being trained on data sets, we don't know as to which selection the user would like more, right? So we gave users an option to upward or downward the particular choice that they would like. So the users would downward or downward it. And what this does is it helps us train the models better. Then we gave the user to visualize before and after. So before would look like this. And after would look something like this. And it was a very immersive action that we gave as soon as the user hits on before and after visualization tab. We sort of give prominence to that and dim the other sections, right? And we also have a edit mask flow there. So if the users feel they've not edited out that particular area well, they can always edit it again. Then we have the view details and visualize more products. I think since there's lightning talk, I might not be able to complete it in time. So I don't want to get into details of this. Again, so models in the sense, while you're saving a room, it's very important to again give a disclaimer that how you would be using the data. It is a very important ethical consideration, Simon, again, that you tell the user what you will be doing with that particular data. And then when the room is saved, in that initial onboarding page that I spoke about, what happens is the room gets saved in the save room section of that onboarding page. And the users can come again later and modify it if they wish to. That brings me into the user experience part of it. Now I'll talk about the generative AI landscape in general. So primarily, there are these three areas that are there. One is text and code. The other is image and video. And then we have audio. So text and code, I think, you might be knowing about chat GPT. And then there are these code building tools that are there. And also, there's next-gen assistant for superior customer understanding that is happening. So Microsoft Google Open AI code here, Hug & Face, these are the leading companies in text and code space. Then we have the image and video. So our entire experience that I was talking about falls under image and video section. And there are companies like Stable Diffusion, Mid Journey, Runaway, these companies, and Dali as well. So these are the companies that are leading the image and video space. And then there's audio space. Audio space is extremely important and popular at this point of time. So if you would have an the arrow that you see here at the bottom talks about the adoption and maturity at this point. So text and code has the highest adoption and maturity at this point. Image and video has medium and audio has low. If you would have asked me if I had to present this a month back, I would rate audio as low adoption on maturity. But currently, at this given point in time, audio has also gained a medium maturity and adoption. You might be seeing memes where in Cristiano Ronaldo is singing Bollywood songs or say some the voice of a deceased singer is being used to play a new song and a new age song. So audio in itself has gained a lot of popularity. I would have played something here, but I don't think we have internet connection over here. Not able to play it. But yeah, that's about it. Thank you. You can reach out to me over LinkedIn. This is my email ID. You can scan the code. You can directly hit me up on LinkedIn. So yeah, thank you so much. I don't think we have time for questions, but I'll be here. If you have anything to ask, just hit up. Thank you.