 Hello, everyone. Welcome to the session. I hope you're all having a good time at the PyTorch mini-summit inside the larger OSS summit. Today, we're going to talk about how AWS and Amazon are leveraging PyTorch to help build large-scale models. I'm Shubhakum Bhatkune. I'm a principal GDM specialist at AWS. And I'm Arjay Rajit Joseph. I'm a principal engineer with Amazon Search. And in the later part of the slide, I will walk through our journey for actually deploying large-language models, not only for training, but also for deploying in production to excite all Amazon.com customers. So, at AWS, our mission is to democratize machine learning so researchers and data scientists can go from experimentation and research phase to production deployments of their models as quickly as possible. We like the variety that the PyTorch ecosystem offers that enables machine learning developers to build, train, and deploy their deep learning solutions quickly. To that end, AWS and Meta have partnered closely and strategically to optimize PyTorch on AWS to deepen its integration into our core services, such as our EC2 compute service, as well as our ML services, such as SageMaker. We are also working very closely with the PyTorch team to develop solutions that help ML engineers build out large-scale distributed training architectures using AWS services. And we're working closely to enhance PyTorch in terms of performance, explainability, as well as running inference for PyTorch models. Specifically for running inference, both companies are contributing to a StartServe, which is a model-serving service that's native to PyTorch and helps you scale your production deployments of your PyTorch models quickly and easily. We are also... Another point is we're also a board member of the PyTorch Foundation, and we're very excited about it because we believe that the governance structure, the diverse leadership, as well as the technical investments that the partners will make will really help AWS customers scale their models and get more value from it. Now, if we look at the AI landscape, there's been a lot of news in the recent past around generative AI applications, and it has really captured the interest of the general public like never before because generative AI applications today are able to create new and original content that is very close to human-generated content. Now, if you think about the foundation models or the large language models that are powering these generative AI applications, those models are pretty complex. They are in the orders, if you consider model size as a metric of how sophisticated the models are, we are seeing that these models are approaching almost a trillion parameters in size. And the way we see developers using these models or building these models, we've seen a couple of use cases. One is they're looking at building foundation models, and what we mean by foundation models are brand new models that are trained on massive amounts of data with a broad range of downstream use cases. For instance, if you consider stable diffusion, which was a foundation model that was built by stability AI, it was trained on almost 300 terabytes of open-sourced images that are available on the net. It was trained on thousands of GPUs. The GPU cluster was more than 4,000 GPUs, and it took months to train it. But it has a whole wide range of downstream applications. A lot of developers are using these foundation models and fine-tuning it or retraining it for specific use cases using proprietary data. So if you were to take stable diffusion and train it against, let's say, a set of image data, that is for medical use cases for digital pathology, that becomes an example of fine-tuning. And a lot of developers are just developing end-user applications that are calling these foundation models through APIs, but it's just an inference run. For instance, stable diffusion is part of Dream Studio, Photoshop, and millions of users, some of whom may not have any technical background, are accessing this foundation model through these end-user applications. Now whatever the use cases, whether it's training the foundation model, whether you're looking at fine-tuning it, or whether you're just running inference, there are some core challenges, underlying challenges, as you start thinking about architecting this or even deploying it in production. Foundation models, developing them, is resource-heavy. There is human in the loop. There is extensive ML expertise that is required. Because of the scale of compute that these applications need, cost to train becomes an increasingly large component and increasingly prohibitive. Also the time to train is pretty large. It takes a significant amount of resources as well as time to get the model to the accuracy that you need. And when you think about inference, when you have millions of users accessing these models all at the same time, you need to start thinking about how you architect it for real-time user experience or how you manage for cost when you have a sudden spike in demand. AWS offers the broadest and deepest ML tech stack. And we try to meet the customer wherever they are. For instance, if they are expert ML practitioners and they want to build out custom ML ops platforms and eek out the most performance from the infrastructure, we have a range of high-performance infrastructure services, compute, networking, and storage. And we also have orchestration services such as our container-based services, parallel cluster and Batch, which are our HPC services, all of which can be used to architect out your training and inference architectures. If customers want to remove some of that infrastructure management overhead, they can use SageMaker, which offers a completely managed end-to-end ML ops platform for building, training, and deploying your models. And with the new SageMaker Jumpstart, we have now a model repo of foundation models that they can use to then use using those, calling those models from that repo, they can use SageMaker to then go ahead and do their model training and model deployment. We touched on how compute is becoming increasingly important and cost of compute is becoming increasingly important. Customers always are asking us how they can optimize for cost. What are the solutions that we can provide them so that they can run things at cost? Our compute services, accelerated compute services, have a range of accelerator support. We have GPUs like the A100-based P4s. We have the new H100-based P5s coming out later this year. But Amazon has also invested in custom silicon instances. They are called Tranium and Inventure. And we have invested heavily in providing another compute instance, another accelerated compute instance that can significantly lower the cost of training and inference for your large language models and your diffusion models. Tranium offers up to 50% cost savings and Inventure, which is our inference-specific chip, offers up to 40% better price performance. And both Tranium and Inventure have been architected so as to really be suited for training and running distributed inference on your large language models and diffusion models. We've also worked closely with the PyTorch team to enable, to ensure that ML developers can get the maximum performance from the underlying accelerators. PyTorch eager mode is really good for explainability and debugability because it creates the graphs dynamically. But by combining it with XLA that is available as PyTorch XLA through the Neuron compiler, what we are doing is we're combining the ease of use and the debugability of the eager mode with XLA level execution so you can access or avail the complete performance that the underlying Tranium chips or accelerators can provide when you're using the Neuron SDK, which is the compiler for the Tranium and the Inventure instances. And what we have done is made it really easy for you to develop your GenAI applications with our latest announcement, which is Amazon Bedrock. Now Amazon Bedrock gives a completely managed serverless experience when you want to fine-tune your foundation models or run inference on it. You make an API call to the models that are supported by Bedrock today we have Anthropics Cloud, stability AI stable diffusion, as well as Jurassic from AI 21 and you can host your data on S3 and train these foundation models with your data all within the same VPC. So you can also apply security services such as KMS encryption or private link or even IM access controls. So what you're able to do is have a customized model for your specific use case using your proprietary data and maintain its security and ensure that it is completely secure in your own environment. What we also have in Bedrock are Amazon's own LLMs. Machine learning is a fundamental component of Amazon's business and we have developed foundation models and large language models that we have used as part of our businesses and now with Bedrock we are exposing those LLMs to customers so they can leverage them for building their own Gen AI applications and today RJ is going to go over Amazon's search journey on using AWS as well as PyTorch to build, train and deploy their LLMs. Thanks Shuba for setting the stage for me and providing a clear availability of AWS infrastructure which is enabling us to innovate for our Amazon customers. So just a curious how many of you guys are Amazon customers who have bought something in Amazon. That's good to hear and we work hard to give you a good customer experience but I want to start with foundation of this large language models needs lots of infrastructure as everybody was telling and I'll go through the journey of what we did over the last few years to give you the benefit of the large language models. So before I go there I want to give a quick introduction for my team's vision and mission so we are there to delight our customers by actually building foundational, universal and semantic representations of the important Amazon entities like products, queries and shopping session. As you're all aware there are different kinds of shoppers in Amazon different kinds of session different kinds of the way you interact with Amazon so we want to build a large language model which can learn from everything and give a good great experience to our customers. What we also learned through that exercise is we need to build a large-scale deep learning system that is what is powering all this experience. And the third pillar that we had a vision is can we deploy these models across teams across Amazon not to few teams but across Amazon so that every Amazon customers can get benefited from it. That's why our team's mission and vision and what is M5. So when we started this team we wanted to know what is large language model what are the entities. So M5 starts for multi-model as you can see for a shopping experience we need to train our model with images unstructured text tables and videos. So our model is trained with all the four kinds of artifacts and Amazon operates in multiple countries. So we need to build a model which is multi-local. As you are aware Amazon also supports multiple primary languages and secondary languages. So our base model should be also multi-lingual. Fourth pillar is actually the interactions and relationship between the customers, queries, sellers and so on. So we need to build this connection across all these entities. So that's multi-entity. And the fifth one was multi-task. Like as you interact with Amazon there were various, various you touch base with. It can be ads, duplicate detection, semantic matching. So we want to build a model which is actually multi-task. That's what M5 stands for. And when we started training the models we also went through a journey of M5 lifecycle. This is a practical journey about how this large language models get into production. Across teams, across Amazon. So as you can see we pre-train a core model which everybody calls foundational model on a large corpus of data sets. It covers a lot of data sets and it's pretty huge. And this foundational model is what people mention about scaling up. So that you can learn as much as possible and put it into the parameters. So now we are able to train up to 100 billion parameter model and converge. So once we train a core model now we take it up and pass it in our stream and the next stream is basically fine-tuning this large model on multiple tasks and it's also a pre-training style and it runs within our team. Because you need to have the infrastructure to train large models. Once we have trained the multi-task pre-train model and then we feel that's good enough with our performance numbers we actually distill the model to lower size and it can be through quantization or some other techniques. And which I will cover later in my slides. And this distilled model is serialized and shared with all our partners teams within Amazon. And all the partner teams take this distilled model and further fine-tune on their own data sets because those tasks are much smaller and they have the infrastructure requirements are much much lesser. So this is the entire M5 lifecycle of how the large language model hits Amazon customers and we try our to make you delighted. When we started the journey we needed the first thing what we needed to choose was as anything what's the framework that you choose because that's the day zero decision that we need to make. So we were built on deep speed because at that time when we started deep speed supported zero optimizations and it can train large models with limited hardware. So one thing what we learned is it was built on Pytos distributed. So we got a lot of benefits of being on Pytos distributed and all the ecosystem coming us for free. In additionally if you look at the training we also invested a lot on Bfloat 16 because we found Bfloat 16 stabilizes large model training especially when larger parameters are coming up. And we're also currently evaluating zero support in native Pytos and see where the performance numbers are compared to deep speed. That's the framework choice that we had to make from day zero and we're pretty happy about it. And as we scale up the model training what is the pillars of innovation in model training. It's not just training the models. We need to have experimental velocity because we can train multiple models and with different parameters and that's what we always invest on experimental velocity for evaluation. As you can see on the right side is the picture of an experiment for our developers. So it can experiment is backed up by an hypothesis for ML developer. I think this parameters might choose or this kind of configuration from the code or this kinds of data set that I'm going to train the model might make the model better. So it's all starts with an hypothesis and our developers train multiple models and then they evaluate against multiple tasks and see the results. So first pillar is experimental velocity and how do you get that experimental velocity 100% reproducibility of your experiment that includes not only just a hardware but also data replication hardware configuration because these large models are pretty expensive to train and pretty time consuming. So you need to invest on techniques where you can reproduce an experiment or you can restart an experiment without being additional cost. We don't want to be in a situation I forgot which data set I trained on. I forgot which configuration I trained on. I need to restart my training. When you restart you'll find that you don't have the hardware ready. It has been moved on. And the third pillar which people normally don't discuss much is the reliability of large model training. It's quite common to have hardware failures and software dependency failures. Right. And what we invested and this model takes like weeks or maybe months to train it. And also it's spread across hundreds of machines. So when you have a hardware failure how you understand it's not because of the code bug but it's because of an hardware. So we invested in mechanism that we can reliably identify hardware automatically restart from the check point so there's no human in the loop involved. So this enabled us to further increase our experiment velocity which is our first pillar. The fourth pillar that we invested much is stick to one framework and contribute back to open source. This is very important because you get it's a two way two way bridge that you can also contribute and you build muscles within your team for the framework. And then you also get similar tools that they will also contribute it back. So in total we are able to move faster by this adoption. We don't try to create branches of our repository and not to release it to open source. And the fifth pillar we invested early from the beginning was enable multiple specialized hardware and optimization for the large compute efficiency. As you are aware the biggest bottleneck currently is having the compute that you want and it's very hard to get the compute when you need it. So we have invested in training our models on AWS Tranium P4DN which is the A100 P3DN which is the V100 and Indals hardware DL1. So this is actually what is enabling us to scale our model and experiment velocity to innovate on these models to really get it our customers hands as soon as possible because we truly believe that these models can enhance our customers experience. As we went to the model training other thing that people look into is for the data. Data is the fuel for the large language models. So as you can see when we were having compute shortage right we need to move our compute across regions which are AWS silos where you get access to the compute. So you need to make sure that the data is actually can be streamed and follow the compute model. So what we did for example on the picture on the right side for in making sure that we are able to reproduce the experiments we are able to stream the data from petabytes of storage and 300 terabyte data sets into their preprocessing on the fly on the CPU and actually process them directly to the training. So for example this is an example of a multi-model training where you have to transfer your images resize color augment push it to the training. So rather than doing offline jobs for doing it we actually stream it through our CPU on the GPU machine. So this enabled not just tight coupling of our training with the data but actually reproducibility. We also store everything in a immutable data storage and which is equally accessible across any region any system and we also respect the data life cycle policy. So how it's enabled is we are thinking of it's actually integration with PyTorch data loader and this optimized AWS S3 client based on C++ which enable us to get at least 10x performance compared to the AWS S3 botox line. I think we are very close to releasing and I think it already has the support for the data loader with S3 CRT client but we are pushing further to provide a journey about how this S3 client enable us to scale and follow the compute and scale our experimentation. So this is the first part of the journey where you are training the model you are trying to get the data ingestion to the system. Now the second set of challenges starts when you are actually want to delight the customers and that's what we call is ML inference. The concern for the machine learning inference is actually the decoupling of the training and inference. As you see lot of teams working together the training infrastructure code base hardware and all of those things become hardly coupled and it's very it takes much longer human time to deploy this models to production as you have to transition this models to other teams where they don't share the same infrastructure. So we need to decouple our training and inference not only just code but infrastructure too. Second is every team when we share our models the most important thing that they are worried about is price performance its latency throughput and cost. Because we delight customers across the world the throughput requirement for our models are very high but we cannot have pricey options so we always look at which is the best price performance that you can get. And the third pillar is not all inference are a critical path. You have to look at your inference workloads whether they are bursting their provision or their real-time or batch. So you have to really profile these workloads to ensure what is the right hardware and the software that you need to use. How did we decouple? We actually use the PyTorch model serialization framework to decouple the models. So on the left side is what mostly you see as a Torch.Eager which is a research and development for the models. Then once the model is of good quality we actually serialize the models and we adopted Torch.Script to our customers and the customers will take this Torch.Script module and either for the fine-tune or host the model inference. This actually separated us to move faster with very less decoupling between training and infrastructure. And there is no tickets, bugs, like my code has moved it's not working and everything. Torch.Script actually enabled us to scale there. And once we decouple the model what do we do for the inference optimization? This is also another area where we invest a lot to get to get to this details to the every customer. So we have three layers for the model inference optimization. One is called algorithmic and data structure which is like pruning models for reducing the compute or you do quantization on different formats like FP32, Int8, FP16, Bfloat16 without affecting the quality of the model. So again when you do this experiments you need to run a lot of experiments and the right configuration so that you're not losing the quality of the model. Second layer what we call is SDK and model format. So somebody's called it ML Compilers. So we call it SDK model format for in French we use Neuron for NVIDIA GPUs we use Tensorati we use Onyx and everything. And for the hardware we use AWS Inferentia for some of the workloads we use NVIDIA GPUs we use CPUs all those different kinds of hardware but always prioritize price to performance for your inference workload that is what is going to enable you to scale and quickly get it to the customer's hands. So we did discuss all those things now you're getting the results right getting what this all this enable actually we are able to train multiple birth 100 billion encoder models and not only train but actually converse them so that they can be fine-tuned and deployed to production. Our team runs close to 10k experiments jobs per month. This is where the power of AWS and PyTorps really enables us is we are able to scale up and down on different kinds of hardware and whether they're available for us right and when we are not using it we can always release it back so that we have to don't we have to pay the cost for it. And similarly for the new hardware comes in we don't have to spend cycles to test and validate it comes to our pool very easily. We are able to train our models and converse them on Tranium, P4DN, P3DN and DL1 which is Intel's accelerators. Fourth, we are able to achieve less than 10 millisecond latency for a 1 billion encoder models in both AWS Inferential 2 hardware and GPU instances available. What it really enables with this latency is it enables our modelers to further have more power into the large language model and that can get easily deployed to production. All the results are good but I think it's still day one what we say for the challenges we are looking forward for the collaborations across industry it cannot be done alone to innovate and increase that option of large language models in production. We are eagerly looking for the PyTorch 2.0 and especially ML compilers in DL frameworks because it reduces our time from taking a good model to our customers hands. And on the compute side we are looking mostly for FP8 which I think is going to be very good for us to reduce the cost and if it really give the inference and sparse computation as the large language models become bigger there will lot of sparse it in them and we are trying to make sure that we can adopt them. To reduce the cost further. It's for both training and inference. Good results but I think always it's across the partnership across what we had across the industry. Like we had help from MetapyTorch for learning some of the few features which is coming in we have the deep engine science team who helped up with deep engine and the large model training the Napuna labs who helped us with training and inference here too and AWS ML frameworks team who enabled us to go through all this journey and give us feedbacks early are we going the right direction. We had partnership with NVIDIA who actually helped us to optimize our training for the hardware that we had and D. That's all the partnership that we had which really enabled us to scale. With that I want to close it with a thank you note and I hope you are having fun in this conference. Any questions? Yes. So in the inference time when the model comes to the inference I think yes I think when you're coming for the multi-model right when you're doing this the inference latency will go up. Is that the question that you're asking? Yes. Yes. Yes. Yes. So for the pre-processing and the post-processing we actually that's why the TorchScript there are some limitations that we could not solve it completely because there are branch conditions and everything which makes it harder to see. So at that time in addition to the TorchScript model we actually went our code which is the pre-processing and the post-processing. That's why we want to invest further on the DL framework so that all those things can be done more efficiently transition. We were mostly doing batch and not only streaming I don't think we are there yet for the streaming images to be honest but for the text we can do non-batch real time. Yeah. I think the rest of the question I can take it offline. Thank you.