 What a fantastic three days, almost the end last afternoon. So kudos to all of you for joining this session. I know you have travel plans to go back, et cetera, so thank you for being here. And thank you so much for giving us the opportunity to share our ideas with you. Today, I will be speaking with you about achieving real-time generative AI during gameplay for games running on Kubernetes. Just a quick introduction, I'm Ishan Sharma. I'm a senior product manager in the Google Kubernetes Engine GKE team within Google Cloud. I'm part of the Google Cloud for Games team and what excites me most about my work is building a platform that powers some of the most successful games, global life service games. In my life before joining Google, I had the opportunity to work at large and small companies in the fields of cloud, AI, IoT, nanotechnology and microelectronics. When I'm not working with my colleagues to build awesome products, you will find me hiking, biking, exploring new cities and teaching meditation. Over the next 30 minutes or so, let's go on a journey together exploring generative AI in games. Let's discuss briefly about the latest trends in generative AI, how game backends are deployed using Kubernetes, how generative AI is really starting to be used in games and how it can all be done with Kubernetes. Finally, I will present a demo that demonstrates these concepts. So just to break the ice here a little bit, a question for all of you. Show of hands, of course. How many here are game developers for hobby or for work? Okay, I see a few. Welcome. And how many said hobby and work? So how many of you are professional game developers? You can leave your hands up just for a few minutes. Okay, so a few of you. Great. How many of you are gamers, avid gamers, almost every day you're playing games? Great, keep your hands up. How many of you play a game at least once a month? So console, PC, mobile. Okay, once, good. And how many of you have played a game at least once in your life? All right, almost all of you, great. So what this tells me is that I have a fantastic audience here and half my job is done. Good. So switching to generative AI, right? So we all know with the launch of chat, GPT, Bard, et cetera, the term generative AI has become mainstream in all of our conversations. And that's not just the tech community, right? Over the last few decades, however, we'll see that artificial intelligence and machine learning has been steadily improving. So it is noteworthy that just in the last decade, we have seen that AI systems have become more capable and are now beating humans in perception tests. So these are in domains such as handwriting speech and image recognition, reading comprehension, and of course language understanding. Now generative AI, gen AI, as most of you would know, it is a type of artificial intelligence that can produce new images, text, videos, and audio clips. And that is really part of this evolution. So the generative capabilities of AI have seen a really stark improvement over the last nine years. So this chronology that I'm presenting here shows pictures generated by AI. And by the way, none of these people actually exist, except of course for the last one. We can see that the evolution from the very pixelated black and white image from 2014 to a very realistic image in just three years by 2017. By 2021, we started to see text to image generation. These are capabilities with prompts that a lot of you might have tried out already. And just last week, I used stable diffusion to generate a very realistic image of Albert Einstein in a space suit at the time of a solar eclipse. Perhaps himself gathering the experimental evidence that he needed to prove general relativity. That is an impressive image, of course Albert Einstein was in in space without a helmet, as we all know. Now, the future of gen AI is even more incredible with the latest from Google research teams showing quite accurate and realistic video clips generated by text prompts. On the left, you have a teddy bear running in New York City and on the right, a glass bead falling into water with a huge splash with sunset in the background. Really realistic images. So now, taking this generative AI capabilities and looking at games. So generative AI capabilities when integrated into games will truly transform life service games. And it will give players a really novel experience. So within the next decade or so, we can expect generative AI in games to grow upwards of 20% annually. And we are already seeing generative AI being used in game development today. And we expect that ultimately new game experiences such as smart non-player characters and PCs and level generation will start to become more prevalent. In the chart here, those are sort of the purple, and the next two bars at the bottom. And you can see how they're growing over time. Now, when you think of generative AI use cases in games, we can classify those use cases into two categories. One is improving productivity during game development. And the other one is improving player experiences during gameplay. So let's explore these. So in the first category, game developers use generative AI to really accelerate time to launch or time to market by creating content and simplifying development. This includes development of game assets, such as characters, props, audio and video, code generation, and AI-based game testing. In fact, code generation and image generation are already being used by game developers today. From these game developers use cases, you can sort of see, you can use off the shelf turn key APIs like Google's Vertex AI, SageMaker or ChatGPT, or you can run your own gen AI inference on top of Kubernetes. In the second category on the right, game developers use AI ML and generative AI to adapt gameplay and empower players to generate game content in real time. So these include smart non-player characters, NPCs, dynamic in-game content, gameplay that is customized to players, and finally the ultimate, which is user-generated content leading to endless worlds. All right, that's sort of a little ways in the future. For gameplay experiences, we believe that running generative AI inference inside of Kubernetes alongside your game servers is really going to be the best solution as the industry develops this capability. So today, we will focus on sort of the right side of this, which is sort of the focus of the talk, which is real time inference during gameplay. Now, we looked at the use cases. Now let's also discuss typical user pain points for generative AI in games. This is based on user research that our team conducted with subject matter experts across the games industry. So at the platform level, sort of starting in the top row, cost and latency are crucial. So for generative AI to be financially feasible in popular games with a large number of players or concurrent users, generative AI inference has to be cost-effective. Goes without saying, right, obvious. Additionally, for a smooth gameplay and a seamless player experience, there cannot be any lag. So low latency is also very important. Poor gameplay with a lot of lag can really hurt the success of these games. Also, raw performance and the ability to run state of art models without vendor lock-in will really drive platform decisions. The next set of pain points really revolve around the maturity of AI models today. In games, we need coherent, relevant and contextually appropriate inference over and over again that's repeatable. These models should not, of course, propagate biases or stereotypes. We also need appropriate content moderation to align with the maturity ratings in games to ensure a safe and inclusive experience for everyone. On the other hand, sort of moving into the row at the very bottom, for an engaging gameplay, some games might want to content that LLM filters out today. So they might want to show certain images that enhance the gameplay experience. Furthermore, there needs to be a balance between user-generated content and the game structure or lore or the storyline. Procedural generation will still require human supervision in the near future as we continue to evolve AI models and LLMs and how we integrate that with gameplay. So as the game industry starts to integrate generative AI, games will continue to evolve. The whole industry in fact will continue to evolve. Similar to how previously business model evolved from boxed software games that you could go to Best Buy or GameStop and get to live service games, we will continue to evolve into what some are calling living games. In such a model, the relationship cycle between the player and the developer expands to the game itself with all three aspects interacting to enrich the player experience. In these living games, game developers will need to implement AI responsibly and securely safeguarding intellectual property while of course being vigilant and respecting the player. Before we dive deeper into creating living games that I just spoke about by integrating with generative AI, let's briefly discuss how Kubernetes is a great compute solution for games. Kubernetes solves the majority of the IT operations problems for games, as you're all familiar with, scheduling, auto-scaling, health checking, logging and monitoring, declarative paradigms, rollbacks, isolation, et cetera. However, Kubernetes on its own doesn't understand game servers. For game servers, we really need the ability to do a few things. Start and shut down on demand, be able to protect game servers that are running with players on them. These allocated game servers cannot just be shut down even for upgrades. That's a poor player experience, especially if you're gonna win a game and your game server gets shut down for upgrades. Also, to be able to scale on demand, these are based on location, number of players, rather than CPU utilization. So for games in memory state, it's critical. Enter Agonis. You might have heard talks about Agonis previously. It's an open source batteries included multiplayer dedicated game server, scaling and orchestration platform. And that can run anywhere where Kubernetes can run. In 2017, in a partnership between Google Cloud and Ubisoft, we built Agonis, which teaches Kubernetes how to run game servers. Running Agonis and Kubernetes really enables, hosting, running and scaling dedicated game servers. Since then, many contributors across the community, across Google and many other game studios have continued to build and enhance Agonis. And to this day, we're continuing with releases on that. Agonis understands game sessions. It scales with player loads. It supports multiple network proxies, UDP TCP, ports per node has tunable warmer parameters. And of course, it is open source, which is extremely valuable. So this is what a high level architecture of a live service game looks like. Players starting a lobby on the left. There's a matchmaker that directs them to connect to a dedicated game server, where they can connect in a shared environment, a shared experience with other players. That's a multiplayer session based game. The game front end, customer matchmaking client, and the player profile service, can all run on Kubernetes clusters. The player profile metadata can write to a globally replicated database for access later, or for analytics, such as leaderboards, right? The game servers also run on Kubernetes clusters and are orchestrated by Agonis that we just spoke about. A service mesh can be used for global deployments. So we want to now add generative AI inference servers with the game servers to create a whole new type of game, right? So how do we deploy them? How do you manage them? How do we connect them together? That's the next slide. So there are a few different ways of integrating generative AI inference with game servers. Of course, one is a turnkey solution, such as Vertex AI, SageMakers, Stable Diffusion API, where the game servers running on Agonis on Kubernetes can directly query the API. The second approach is more of a do-it-yourself approach, and that uses Kubernetes, and there are largely two options for that. In the first case, generative AI inference servers can run on dedicated Kubernetes nodes. This allows multiple game servers orchestrated by Agonis to query the inference APIs when needed. The servers can then run on hardware, such as GPUs. The inference servers, I mean, can run on GPUs or other high-performance CPUs, et cetera, to find the right balance between raw performance and cost. The other approach is to deploy generative AI inference as a sidecar to Agonis game servers within the same pod. This makes sense really when the dedicated inference server is needed for each game server, and the underlying hardware, of course, is optimal for both the game server and the inference server. We've seen examples of this where sometimes game servers are running on GPUs and there is capacity that's available on the underlying hardware, and inference servers sort of make sense to go on there. Of course, a lot of times, as you might be familiar, game servers typically run on CPUs, different types of performance CPUs based on the requirements of game servers. Beyond this, of course, game developers can also choose to integrate the generative AI inference within the game binary itself and run it on Agonis, on game server pods. And of course, this imposes strong performance requirements on the underlying hardware, given that you're running both the generative AI and the game server in the same binary. Now let's dive deeper into these options. So there are advantages to using a turnkey solution, especially in game development use cases where it's not real time, that was sort of the first column in the chart a few slides ago, where it's the game development use cases, you're doing assets generation, et cetera. In that case, there are not too many requirements for real time where you can query offline some of the APIs and some of the turnkey solutions. Turnkey solutions, of course, improve time to value, where POCs for real time use cases for gameplay, turnkey solutions are great for that as well. And sometimes their only specific models are only available through turnkey APIs. They're not available, openly available, where you can containerize them. In those situations, you're kind of tied to using an API. Now looking at the DIY solution with Kubernetes for generative AI in games. In the market today, as you would see every single day, increasing number of openly available models are becoming available, and they can be run in containers. What this also provides is cost optimization at scale. So Kubernetes can be more cost effective than paper use APIs for high user scenarios. For example, game launches, where you're going to see an influx of a lot of concurrent users within a short amount of time. So there, the unit cost where you're paying for the API starts to go up, whereas if you're paying for the platform, that of course makes sense for these high user scenarios. Dedicated inference, Kubernetes nodes are also, of course, easy to set up, and they can use the Kubernetes features that we're all familiar with, such as horizontal pod auto scaling, scheduling retains and tolerance, et cetera. Now generative AI sidecars, which is the other option that I spoke about, might have a slight advantage in latency, but they are costly. They're one to one with game servers, and you might be spinning a lot of these game servers, which means that you're also spinning up a lot of these inference servers. But in cases where inference really doesn't use as much compute power, sidecars could be useful because you're using some of the compute that might still be available. So that's where the inference model to game server one to one might make sense when you're able to bin pack your pods. So what we did is we ran some initial tests with stable diffusion and with bloom. So that's the table on the right. So we tested two scenarios on Kubernetes, which is dedicated Kubernetes nodes and the sidecar scenario, which are the two rows. So with stable diffusion, the order of magnitude is, it's one to 1.3 seconds. And that's largely because of the latency that's in the inference itself. The bloom, which is a text-based model, the latency that we saw with, the latency that we saw with dedicated inference nodes was about 146 to 147 milliseconds, whereas with the sidecar model, we saw 144 to 145 milliseconds. So a slight difference, which is sort of expected. However, the meta point here, the key point to take away here is that inference latency today, as you can see in the stable diffusion model, really overpowers any difference between different Kubernetes deployment methods. So dedicated inference Kubernetes nodes provide the most versatility, ease of use, and flexibility compared to sidecars, because they can be used for multiple game servers all at the same time. They can have dedicated optimized underlying hardware, et cetera. In the future, as some of the inference latency in these models starts to go down, these differences will start to become more exaggerated. And that's when we really have to make the decisions on which scenario to really go with. So something to look forward to and something to prepare as these inference starts to get faster and faster and we get faster and better hardware. So for game workloads, running generative AI specifically, Kubernetes does have several advantages. Portability to train and serve across clouds, preventing vendor lock-in. And this is key to games customers who are trying to access global markets. The flexibility to choose the right framework for the job. There are a lot of different frameworks that are available and becoming increasingly available. So that's important. Kubernetes works really well for that. The ability to fine-tune performance and scale a platform. We've all built skills in Kubernetes over the last few years. We can use those skills towards generative AI if we subscribe to the do-it-yourself model on Kubernetes. And of course, the advantage for paying what you need, when you need it, with high utilization of computer resources, whether it's CPUs, GPUs, TPUs, et cetera. And of course, cost savings if you're able to use something like spot instances. Most importantly, customers can run their generative AI inference alongside game servers. That are orchestrated by Agonis, right? So that Agonis being open source and being able to run in Kubernetes, which is also open source, that really gives you the best versatility and it allows you to really optimize the two together. So this improves performance, latency, and of course, it reduces management. Overhead does not require you to retrain the skill set of your teams, et cetera. So with that, we'll jump into sort of the next part, which you all have been waiting for, hopefully, which is the demo. So just to give you a bit of the scenario, so this is a game that we have developed. We partnered with Globent to develop this game. I'll show you a video of the game and this was recorded in one shot. We have integrated generative AI in there. We will show you two use cases from the ones that I highlighted earlier. The first one being smart non-playable characters where there's a dialogue with a robot. The robot is named Bodis and it's a yellow robot you will see in just a little bit. And the second part of it, just to show you image generation in real time, we created a bunch of different billboards throughout the game. Now those billboards represent places where images can be shown. Perhaps these are textures on cars or your clothes or buildings, et cetera. So these are really meant to represent images that can be used within the gameplay environment. So that's our attempt at showing you what is possible and what is possible going forward. Now this game takes place in the future. I'll just start it just in a little bit. But this game takes place in the future. There is a city where an alien robot in a spaceship has come to Earth. Unfortunately, the spaceship has crashed and we spin into the game or we start off the game. It's a multiplayer game, session-based game. So it is using a bonus. And my partner and I, we're in this game and we are trying to help out and understand what's happening, discuss with the robot and of course eventually help them out. So that's the context of the game. Let's make sure it plays. So here, the player, which is me and my partner are exploring what is happening. We, my partner is, we both discussed, okay, maybe there's a robot here. We should speak with it. Now this is where we're actually using a generative AI model. I'll speak about that just in a little bit. Just for folks, I typed in Diego, hello, who are you? Bodice responds, I'm Bodice. What are you doing in there, Bodice? I'm looking for parts to repair my ship, is Bodice's response. The next question we're asking, do you want me to help you look for parts? Bodice says yes. How can I help find the parts? Bodice says use the billboards to guide you to the parts you need. Now, we type in, how will the billboards help me? The billboards will show you where to find the components you need. And then, being a nice person, they say, okay, I will look for the parts and see you then. And Bodice says, I will be waiting. So that's in real time, during gameplay. It wasn't from before. So now I'm off on the quest, trying to help Bodice, looking for these billboards, walking through the gameplay environment, walk behind a school bus. And as I go through here, I see a billboard up front which tells me there's a box. Now, if you notice very closely, the billboard changed. Those are actually prompts. The prompt is something like, create a box with neon highlights on it, and you can see the box is changing. So these are, we're hitting the stable diffusion model, running on Kubernetes in real time where images are being generated based on that one prompt. So now I see an alien with a burger, as you saw on the right. Of course, the prompt for the game is alien with a burger. And here there's a bunch of kids playing. Wonder what that is? Not sure what it is. Let's wait for another image that might appear. And there's a school bus. Oh, there was a school bus just now. Huh, it's right over there. Player two, why don't you go explore the school bus while I go figure out what this hamburger, alien eating hamburgers. Oh look, there's a hamburger joint here. And there is a box with a neon. Great, I picked that up. School bus, did you figure anything out? Oh yeah, I got the other box. Cool, all right. Now let's build, we got two boxes now. Let's see what else is happening. So there's some fire here. Oh, there's an exhaust. It looks like a spaceship. Something like that. Interesting, there is fire in the, oh okay, so it's an ignition thing. All right. Oh, there's something here. Player two, can you please help me open this gate? And player two goes in, opens the gate, I walk in, oh there's another box with a neon X on it. I'm gonna grab that. So now between the two of us, we have grabbed two boxes. Now let's go and find where Bodis is. Okay, more billboards. There's a school bus, we haven't explored this. Oh look, there is a spaceship. All right, maybe I should head in this direction. Another spaceship, I'm getting closer. Maybe this is where the spaceships are. Oh, another spaceship. All of those are images that are generated in real time with stable diffusion. I see a spaceship, oh I see Bodis as well. Player two, please speak with Bodis. What do I need to do? Player two tells me, put the boxes in here. These are the components for the engine, for the spaceship. All right, ooh Bodis got into the spaceship, great. Now the spaceship is powering up, it's ready to go. Powers up the engines and launch in three, two, one. And it's gone to space and we have rescued Bodis. With real time generative AI that was happening during gameplay. Thank you. So let's actually, the more interesting part is how did this actually work? What's underneath this, right? So you can see on the left, this chart looks very familiar with the chart I presented a few slides ago. On the left are my players, so Diego and my friend, that's calling me Sean, that's me. The two of us are playing, we connect to the game servers. That's sort of how a typical backend works for a game. We're both in game servers. We ran this on Google Cloud. We were using US Central One as where we built this demo. We used Google Kubernetes Engine for this deployment. So we have a GKE game server cluster, which is running a bunch of agonist game servers. When we moved to the inference cluster, here we decided not to use the sidecar model, but we decided to use dedicated nodes. We put these dedicated nodes onto a different GKE cluster powered by GPUs. Now, within that, there is an API which basically routes the traffic to the various models, depending on which scenario we're looking at. So whether it's the smart NPC or the billboards that are doing image generation. Then we have sort of this middle layer, and I'll get back to this, but the middle layer basically is logic for each of the use cases. So for NPC logic, there is a service running that handles text processing for the dialogue, pre-processing and post-processing. For example, if you have played around with ChatGPT or you've played around with Bard from Google, you would have noticed that if you provided context, the response that it gives you makes more sense. So what we did is we did a bit of prompt engineering where we gave the context, something along the lines of respond as if you are an alien from named BOTIS and you're alien robot from a planet and your name is BOTIS, your space shift has crashed. You are going to, you're looking for parts that are scattered throughout the city. You're gonna help guide, you need help to find these parts and you're gonna guide them by using billboards. So respond as if that, and response in short statements that are no longer than X. So that's sort of what the context is to the model and then all of the questions kind of go in, hit that API or that model that session that's running. So the model has the context to respond appropriately. So that's sort of what would go in those logic areas. Image and logic that of course has, okay, we had an alien with a hamburger, we had a school bus, we had the ships with an exhaust and all of that. So those are all billboards, those are context that we're providing or the prompts that we're providing. Just going back to the NPC logic with the context that we have provided to the model, what the user inputs is passed through directly as is to the model. So that's where the real-time generative AI comes into play. And of course, we also used, built in some of the Vertex AI services as well just so that we can use that during development. And that's a turnkey solution that's on Google Cloud. So we use that as well for LLM pre and post processing. And there we go and hit the Vertex API. So we try to build in both the scenarios into this deployment. The NPC logic, the image and logic all on GKE and Kubernetes, and then the models of course running. So Lama 2 model running on GKE, that's what we use. And then that can run both on GPUs or CPUs depending on the performance that you're looking for. And for image generation, we use stable diffusion. So that was on GPUs. So I wanna highlight one aspect of this. The middle layer here is incredibly interesting. We actually spend a lot of time on this middle layer which is sort of the NPC logic and the APIs because that's where we really integrate the game servers inference and we abstract away the calls, we load balance them and we send it out to different inference servers. So going forward, that's where we're gonna be spending a lot of our time seeing how well that integrates with the goners and with Kubernetes. So that's really I think where the important part is. And even for example, with stable diffusion, we were getting time inference times that were really long. So we spent a bit of time trying to reduce that to a couple of seconds which you saw in real time that was not sped up. It was actually a couple of seconds where the inference was coming in real time. As with anything, there's a huge team behind it. We would like to thank our partners at Globent and various Google-Priode contributors, especially the Cloud CE team and for AI inference and benchmarking, our team there, of course, user research for that part and our leadership for sponsoring some of this. So as you explore integrating generative AI in your games, consider deploying your services on Kubernetes, be it matchmaking, game servers, generative AI. What we have hopefully presented you is we have tickled your imagination. We have got you excited about how all of this can be done with Kubernetes using the same skills that you built up using the same awesome platform that we have. And we can really change the future of games and entertainment. Here are some links. Please feel free to connect with us. We'd love to chat with you. Learn about how you're exploring, how your journey is with generative AI in games. And here are some key links. Please feel free to reach out g2x at google.com. With that, thank you so much for coming here. Have a safe trip back home if you have come from abroad or you've come from different cities. If you're local, again, have a safe flight home and enjoy the rest of the conference. Thank you so much for being here. I would love for you to come up, chat with me, want to love, would love to understand what you're working on and ask any questions. Thank you. I'll be up here for a few minutes.