 Welcome to the session here, such a big stage. That's my 12th Q-Con talk. I still feel excited or nervous, too. By the way, there are no fancy animation photos as previous talks. I still hope you won't find it too boring. So I'm Yuan Cheng from NVIDIA, a quick disclaimer. All the contents and opinions are personal, not representative of my former and current employee and that the analysis made some simplified assumptions may contain the mistakes. OK, so as we all know, and training large language models is expensive, but the model inferencing and model serving is expensive, too. To some extent, it couldn't dominate the overall cost for the following reason. One is not everyone needs to train their model from scratch. They can find two existing models. Secondly, and the model serving also need to access a large number of the model parameters and also demand and very high performance load latency. Finally, a lot of applications and the model serving, like Chet TTT, and they have to handle a very large volume of the user request. So here is a backup amdol estimate, and ID based on public available data. So firstly, and from the number of users, like Chet TTT, we have like 100 million of the active users per week. If we assume every user make five requests, total is $71 million of the request. Also assume and each request costs $0.01. The total cost, the daily operation cost is about $700,000. And another perspective we can look at the cost is from the server performance perspective. For example, $800,000 and how many requests that they can handle. Then we assume how many the $800,000 GPU we need. Finally, based on the public GPU cloud cost, every hour is $2. Then the cost is about $1 million. Again, those numbers have never been verified. And I hope they are not far from the truth, at least in the same ballpark. But the key takeaway is model serving and model inferencing is expensive. So another thing I want to see is also the performance is critical, right? Compared to internet service like Search, Google Search, we are talking about the latency in the range of the hundreds of milliseconds. But today, all this and the large and accurate model of GPT generative AI is still seconds, tens of seconds, even minutes. There are a large of room or space we need to improve. So if we look at the large and or big picture, right? We draw on the AI workload in GPU and the clusters. It basically is a different diverse workloads, right? From model training to serving to the CI-CD, to visualization interactive on heterogeneous network. But one thing I want to highlight here is how can we improve the efficiencies. NVIDIA already and often the different GPU and the sharing technology from the multi-process sharing to time slicing to multiple instances of the GPU, also the virtualized GPU. That's something we should take advantage and to improve the efficiency. OK, so in terms of how can we improve the model serving? And I think the foreign aspect, right? We need more and the efficient workload and the scheduling algorithm from advanced scheduling and the resource sharing and the job management. Of course, the machine learning and the AI community also working on improvement and optimization technical, like smaller models and model compression, technicals like the continuous batching. Of course, finally, we should continue to improve the hardware performance. So to summarize and conclude my talks, the model serving or influencing is critically important. I want to see, and just like Google and other internet companies, right? The pioneer, the distributor, large-scale system for the internet service and the search. Today, we need a scalable, efficient, and fault-tolerant and cloud-native solution for the emerging AI workload. Specifically, I think there are major opportunities in the cloud-native resource management solutions. There are nine talks from NVIDIA about how to improve the AI workload on GPU clusters, all these new device drivers, technology, GPU sharing. Tomorrow, though, the first, we'll start with my colleague Kevin and Sandra about the GPU acceleration and GPU sharing. Please come to our sessions and learn more about accelerating AI workload using the NVIDIA and the hardware technologies. Thank you very much. Bye. Thanks.