 Hello, welcome to our talk. I'm Aniket and I'm a deep learning library engineer working in the Cutlass team and joining me in this presentation I have Vijay Thakur who is also working in the Cutlass team as a computer architect. Today we will be presenting Cutlass, a CUDA C++ template library for accelerating deep learning and HPC computations. Cutlass provide CUDA C++ template abstractions for implementing high performance gem or matrix multiplication operations and related computations for all labels and scales within CUDA. Cutlass has been in development since 2017 when the Volta architecture came out with its first generation of tensor cores. Since then Cutlass has been in continuous development and has been supporting each new GPU architecture including the recent and the most latest Nvidia GPU architecture which is the Hopper which includes the fourth generation tensor cores. Cutlass is integrated into various Nvidia libraries such as kublas, kootanser, koo sparse and koo dnn. In terms of functionality Cutlass provide gem kernels which include group gem, implicit gem which is used to implement convolution functions of various kinds including 3D convolution. Cutlass also includes implementations for BLAST label 3 operations. Cutlass has support for various data types of int and floating point data types of various precision and in addition to kernels Cutlass also has a profiler which enable users to major performance for existing kernels as well as add their own customized kernels to the profiler and measure the performance. So Cutlass has been in development for over five years and currently support six generation of Nvidia GPU architecture. Cutlass has been the go-to source for tensor core programming details and has over 10 million downloads. Cutlass GitHub community is quite strong with many contributions from outside Nvidia. At this moment Cutlass is integrated into PyTorch, OneFlow, DVMAI templates and various GitHub projects. So now we'll go over the first part of this presentation about the fundamentals of tensor cores. This figure illustrates the tiled hierarchical structure implemented in Cutlass. This closely resembles the CUDA memory hierarchy and execution model. Here you can see the tiled data from two input matrices A and B is fetched from global memory into the shared memory and with the latest hopper architecture the tensor cores or the math instructions can directly fetch data from shared memory to compute the operation and produce an output and this output is then put in back into the global memory which is a part of the output C wave cracks. So tensor cores perform a math operation of type D equals to operation on A and B plus D. These math operations are warp-provided which means it's performed by a set of warps where each warp is consisting of 32 threads. These operations are also asynchronous and with the latest architecture tensor cores can be fed from data from shared memory which is most optimal but can also operate on data which is present in the registers. Tensor core operations come in a wide variety of MMA instruction shapes. Here you can see the MMA instruction shape varies by each data type and within each data type you can have various sizes of instructions and these different sizes of instruction can be used to fine-tune kernels for optimal performance and looking at the evolution of the tensor cores you can see that tensor cores can provide tremendous throughput in terms of flops and it has been improving with each new generation. Now let's talk about the second important part of doing matrix multiplication operations which is the data movement. Here we illustrate the asynchronous data movement between global and shared memory using the TMM multicast feature of the hopper architecture. Here you see a block cluster which consists of two SMs or streaming multiprocessors. In this case one streaming multiprocessor or SM issues a CP bulk async operation which is a copy operation also asynchronous. Once this operation is issued the data is copied from global memory into the shared memory and the asynchronous barriers are updated to let the SM know that the data movement has completed. With the multicast feature the same CP bulk async operation can be used to copy data from global memory to the other SM and update the barriers in the other SM. Now I would like to hand over the presentation to Vijay. Thanks Aniket. Okay with that as you can see as Aniket talked about the hopper architecture boasts some pretty impressive set of features and extremely high throughput but an architecture is only as useful as it is programmable. So now let's talk about how Cutlass comes in to help you write fast kernels for these architectures with ease. Before I dive deeper into it though let's peel back a little bit and think about it from a higher level. How do we think about linear algebra? When we talk about BLAST3 like workloads you can usually represent all of them as a pretty simple triple four loop around the M, N and K dimensions. But in order to get the performance benefits that Aniket talked about you usually have to expose locality in both threads and data to extract as much performance from it as possible. So when we talk about the CUDA specific programming model you have your entire problem that lives within the GPU's global memory that is then partitioned at the execution hierarchy across the thread blocks. In order to extract even more locality from it you tile the global problem with a thread block specific tile shape and then fetch data into shared memory and the execution hierarchy that corresponds to shared memory is then your thread block itself. You further tile the shared memory and then exploit reuse across registers with threads within the thread block itself. So you have these nested hierarchies of both execution units and data that correspond to them and you need to keep mapping the correct threads data ownership all the time which sort of turns this elegant triple four loop into a game of index bookkeeping which is quite painful. Thinking about this deeply over the last five years my colleagues from Nvidia Research namely Chris Cheka and I have come up with a new layout language to describe these kind of coordinate and index bookkeeping as well as manipulations over them and we call this CUTA which stands for CUDA tensor tools. CUTA is now a part of Cutlass itself and you can find that within the Cutlass open source repo and CUTA is composed of two main portions. The first component is a vocabulary type layout which is a composition of a tensor shape and its stride. Each layout fundamentally is a shape and a stride put together. It is a compact representation that allows us to talk about all possible affine layouts and it has a representation power that is strong enough to represent any tensor that we care about within the linear algebra domain. All layouts are natively hierarchical and multimodal and they always maintain logical consistency which means that you never have to worry about the index bookkeeping it'll be taken care of for you and you can think about your algorithm from the perspective of the logical coordinates that you actually want to worry about. We also use the same layout to define thread layouts which used to be implicit in Cutlass 2.x and now have been made explicit so you have ownership over your entire thread layout and data layout. Sometimes GPU architectures also require you to specify special layout mappings that are called swizzle functions and we also support those. What's more cool and interesting about CUTA though is that we have a formalized algebra of layouts as well which allows us to manipulate and compose multiple different layouts together and that way we can build even more interesting layouts from very simple primitive ones that the architecture provides us and this is why we have chosen to embrace skewed throughout the Cutlass 3 codebase. It provides us with a single vocabulary type that subsumes every single iterator based layout that we used to have in Cutlass 2, condenses all the complexity and allows us to only worry about the actual algorithmic description rather than worry about these index bookkeeping. The formalized algebra as I noted earlier makes it extremely simple to implement these algorithms as well because now we don't need to construct the final complicated layout we can build that up from simple invariants. The algebra also allows us to inspect these layouts at compile time and provide you safeguards against writing incorrect code. All of this massively simplifies the hardest part about parallel programming which is making sure that you are mapping the correct threads to the correct values for your computation. Here you can see a pictorial representation of what I just talked about. On the right hand side of the slide you will see all of the named types that we used to have in Cutlass 2 that have now been replaced by Cutlass 3 layout, acute layouts. And on the left hand side you will see some of the primitives of the actual layout algebra that we have implemented as well which includes functional composition of layouts, complements, left and right inverses of layouts, and then more complicated primitives that we build on top of that which include various kinds of products of layouts and divisions of layouts. Fundamentally a layout is just a function though. It takes in some set of coordinates which represent the logical shape of your tensor and then map them on to a single index as its output. And this is strong enough to represent row column major usual layouts but also padded and mixed layouts that you can see over here. What this also means that a layout allows us to talk about tensors as simply views of an underlying storage. Every storage is a memory space and we can reshape tensors and fold their modes using layouts and the algebra of them without having to actually copy data between. So in this case what you're seeing is an example of how these functions allow us to map coordinates to offsets within a storage space. The way you do that is by taking an inner product of the coordinate with the strides so that that is taken care of for you. And as I alluded to earlier, this kind of logical thinking of coordinate spaces also allows you to reshape and fold a tensor without actually having to copy them between spaces. So in this case on the bottom of the slide, I have a logical view of a tensor that is two by four but I can also nested further to change only the logical view into a four by two tensor without actually having to copy anything. Here's some more examples of the kinds of layout that we can represent. Over here are three examples of the column major padded, pitch linear and column major interleaved layouts that we had defined as bespoke types in Cutlass 2 that have now been reshaped. So I'm going to type in Cutlass 2 that have now been replaced by acute layouts. And here are some examples of swizzle layouts as well which can now use, now express with a single vocabulary type. In summary, layouts have two components. The shape defines the logical coordinate space and mappings between multiple different coordinate spaces and strides define how those coordinates get mapped onto the linear index. Layouts are also natively hierarchical. So in this example over here, we're building a eight by eight Morton coded layout from a simple two by two column major layout. So Morton one over here is a simple column major two by two layout which is then product which is then multiplied by itself to produce this eight by eight layout. The full layout description is on the left hand side and you can see that over there. But what we can also do is we can slice these layouts rather than indexing a particular element itself. So the bottom bottom right side of the slide shows you how you can take slices of these to obtain views within this tensor such that you can index into multiple elements at the same time and partition this tensor across multiple possible workers. So you can imagine in this case that if your thread one comma two needs to slice out a certain portion of this Morton coded layout, it would then perform the slicing operation in the bottom right hand side obtaining the four elements that of course needs to process. So as I said earlier, Qt has two main ingredients. The first one is the vocabulary types included within Qt and they are summarized on the slide. It's a fairly minimal APS surface area where you have the shape and the stride both of which are just defined as hierarchical tuples where every element of the tuple can either be an integer type or a tuple itself. Layout is simply a composition of a shape and a stride and acts as a function that maps a shape domain onto its index codomain and a tensor is further a composition of a layout and a pointer that backs that storage as tensors are just views that have a layout associated with them. More interestingly, the functional composition side of Qt is perhaps the most interesting portion where we define methods on top of these types. For tuples we have the expected operation of get that you would expect but we can also query a tuples rank and its depth all of which are formerly defined and for shapes and strides we have getters for those specifically. We can also have query the size or the codomize which is called the cosize in Qt of a particular layout or a tensor and we can ask interesting questions so we can ask whether two layouts or two tensors are congruent or compatible with each other. Again, I will defer to the source code for the exact mathematical definitions of these and finally the three most powerful primitives are compositions of layouts, complements and inverses which we use to build a very robust microkernel infrastructure for GPUs. What it allows us to do is that what we used to have on the right hand side of the slide over here is hand implemented post partitioned views of these really complicated mapping functions that now in Cutlass 3 get distilled to these very elegant architecture primitive representations of these layouts which you can now see on the left hand side. These two codes represent the same exact layout but the one on the left hand side is much more legible much more maintainable and most importantly uses a vocabulary type that flattens the learning curve for new developers to implement their kernels. So with that background of Qt let's see how we actually implement Cutlass 3 itself using Qt at his backbone. The goal of Cutlass 3 was to embrace Qt throughout for the reasons that I described previously but it was also beyond just using Qt. We wanted to emphasize composability and productivity at all layers of the hierarchy with an extreme focus on allowing other people to use their stuff to build custom kernels that we cannot even envision need to be written two years from now. We also wanted to provide static checking at compile time for every layer in the hierarchy to make sure that these composable abstractions will not lead to runtime failures in case they are incompatible and they will provide you with actionable static asserts in case something has gone wrong. We also wanted to reduce the or flatten the learning curve to performance optimizations on GPU by providing a single point of performance tuning oftentimes within Cutlass 3 kernels optimization is done via careful consideration of what layout you are designing because layouts will always maintain logical consistency which makes writing correct kernels mostly trivial and then performance optimization a game of picking the correct layouts and we also wanted to reduce the API surface area. Cutlass 2 has a very large number of named types which can also make the learning curve quite steep so this is something we have very consciously reduced in Cutlass 3. Cutlass 3 also has a new matrix multiply hierarchy in Cutlass 2 the hierarchy was centered around the hardware but we've found that that actually does not generalize to architecture evolution over time so we have a new conceptual matrix multiply hierarchy with Cutlass 3. From the very bottom of the hierarchy we started the atom layer which includes any architecture primitive instruction that can be described by a single instruction so you can think of this as a instruction provided by the hardware that is executed as an atomic and conceptually it is the fewest number of threads that must participate in an instruction for an accelerated math or copy operation so this could include your single threaded SIMT operation this could be a Volta quad pair level matrix multiply which has eight threads in it this could be an ampere or tiering warp wide matrix multiply which has 32 threads in it or it could be a hopper matrix multiply instruction as well which could have 128 threads participating in it on top of this we build the actual core microkernel infrastructure that I alluded to earlier earlier as well this is conceptually a spatial tiling of various atoms so that could be the architecture instructions across threads and data such that the generated code is guaranteed to execute at maximum throughput so again this is simply a spatial tiling across threads and data with permutations folded into it and what this does is that it abstracts away the atom layer which is architecture specific from the main loops and outer loops which are architecture agnostic and this allows us to build a consistent absorption layer across a wide range of gpu architectures all the way from Maxwell to hopper today the collective layer then orchestrates these spatial tilings to compute matrix multipliers over tiles in global memory the reason why this collective layer has to be architecture specific as well is because based on gpu architectures the temporal synchronization patterns for our pipelines are also architecture specific so these simply invoke in a dynamic loop these micro kernels that I described earlier the kernel and device layer remain the same from cutlass 2 and if you use cutlass 2 these will be familiar to you where the kernel layer is in charge of grid planning logic load balancing schedules outer loops and in the case of warp specialized kernels thread marshalling this is also where the kernel is launched and the device layer simply provides a lightweight handle to a matrix multiplier the api entry point surface area is also much reduced and there is a single api entry point per layer of the hierarchy that we just discussed at the micro kernel layer you would access the api through cute tiled mma or tiled copy at the collective layer we have the collective mma and the collective epilogues which describe these collective operations a gem kernel is merely treated as a composition of a collective main loop and a collective epilogue and the entry point to the kernel layer api is gem universal this is also one of the backwards compatible interfaces that we have where a gem universal kernel uh can have a cutlass 2 embodiment or a cutlass 3 embodiment the device layer finally just wraps around the kernel layer and therefore we call it the gem universal adapter and this too is a backwards compatible interface that can contain within itself a cutlass 2 or a cutlass 3 kernel as I mentioned earlier we also provide static asserts throughout the hierarchy so that if you try to compose something and they were indeed not compatible with each other we would provide you asserts at compile time let's dive a little bit deeper into the api itself the kernel layer api is accessed mostly via these tag dispatch policies which live within the main loops dispatch policy as well so generally speaking a lot of our kernels are composable with many of the main loops and vice versa so we have main loop policies that are used for categorical dispatch and they contain them within themselves kernel schedules that allow for free composability with more than one kernel schedule on the flip side of this the collective api also is specialized on dispatch policies um such that each main loop opts into the kernel schedules it can be composed with dispatch policies provide a very easy way for you to extend and write custom kernels or custom main loops as well one of the other important factors that we considered quite heavily when designing cutlass 3 was um sensible defaults with opt-ins so cutlass 2 had default gem configurations that would provide you with what we think would be the best performing solutions but if you wanted to customize something you couldn't use partial default configurations you had to you had to go through the export api directly this is one of the things that we have tried to bridge the gap of in cutlass 3 where we have a new infrastructure for specifying defaults which generate the kernel configuration for you via these auto policies that you see on the slide over here in the top example i have stage count auto and kernel schedule auto in the builder api where most of the input types are user facing are very straightforward use simply specify your element types that you want to compute upon and the architecture you want to target and the difficult part of tuning the stage count and picking the right kernel and main loop is done automatically for you but let's say you wanted to leave most of it is as is and wanted to just specify which kernel schedule and how many stages you wanted to use in your kernel you can opt into that so there is an incremental opt-in for customization points now writing fully custom kernel schedules is also easier let's say you wanted to just experiment with writing a new main loop and wanted to use an existing kernel layer in that case you would simply write your new main loop file and write a new dispatch policy that simply composes itself and opts in with an existing kernel schedule and everything will slot in quite gracefully for you on the other hand of the on the other hand if you wanted to experiment with outer loops and wanted to write a new kernel outer layer schedule but wanted to use an existing main loop that we already provide to you you can write yet an egg again another kernel you can use one of the existing main loop policies that we have and simply swap out the schedule tag within it so this allows for a lot of composability and plug-in play okay with that I would like to hand it back to Anikate to go a little bit deeper into the actual synchronization semantics and warp specialization thanks Wajah now I'll go over a couple of kernel designs that we have implemented in Cutlass keeping the hopper architecture in mind so first I'll describe the cooperative warp specialized persistent kernel design by warp specialization it means a set of threads called a warp group are assigned a certain role and in this case we have two types of such warp groups we have the producer warp group that is responsible for fetching data from global memory to the shared memory and then we have the consumer warp group that is responsible for issuing the MMA of the math instructions so here you can see that the producer warp group issues the TMA operation once the TMA operation has fetched data into the shared memory and the barriers have been updated the tensor core operations from the consumer warp group gets a signal to go ahead and they start working on the data from that's present in the shared memory while the tensor cores are in progress the producer warp group can now work on fetching data for the next tile from global memory into the shared memory and this operation can be overlapped with the tensor core operations being done in the consumer warp group so coming back to the consumer warp group once the tensor core operations finishes the epilogue is performed that drives the data from the registers back into the global memory and by the time the consumer warp group is done with the epilogue the data for the next tile is already present in the shared memory that was transferred using TMA and at that moment the tensor core operations can now go ahead and work on the second tile so this is the benefit of having asynchronous data movement and math operations now i'll present the more slightly more advanced version of the warp specialized persistent kernel which is the ping pong version in this case you have one producer warp group but two consumer warp groups in this case the producer first fetches the data from global memory to shared memory for the first consumer once the data is present in shared memory consumer one can go ahead and issue the tensor core operations while the tensor core operations and epilogue are happening for consumer one the data can be fetched for the next tile that will be computed by the second consumer and once the data is moved to the shared memory for the second tile consumer two can start issuing the tensor core operations this design help in overlapping the epilogues with the tensor core operations for the next tile and this quickly affects the performance and this is something we have been using with the latest cutlass copper implementations now i'll present some results for the latest cutlass version so here we can see that cutlass provides comparable performance to the closed source libraries implementing the same gem operations here we can see that cutlass performance is quite comparable for various data types and on various nvidia architectures and all while providing empowering developers to reuse components that are present in cutlass and also providing APIs of existing functionality now i'll move on to the last part of our presentation which is the cutlass python interface as the name suggests here we present a high-level interface for cutlass kernel via python there are three main goals of the python interface first it allows for easy declaration emission and compilation of the cutlass kernels second it can help catch common compilation bugs or runtime bugs and this can help developer productivity by making it easy to debug and lastly it provides easy integration of cutlass kernel into deep learning frameworks now talking about the first goal here you can see the number of template parameters a user have to define to instantiate a c++ cutlass kernel whereas in python users can get away with just a few template parameters to instantiate a kernel this is really useful for new users and not so expert users but if the users still want to tune for performance or try some advanced features such as stream k they can do that by setting some properties such as the swiss link factor in this case and if users want to apply element wise activation to be fused into the kernel that's also allowed using the python interface now moving on to the second goal for users familiar with c++ templates may have encountered very verbose errors or compilation errors while working with templates and these errors are hard to interpret sometimes with the python interface it becomes a lot easier to figure out those compilation bugs here you can see python provides a very precise compilation error which can help users to debug their program quite quickly moving on to the third goal which is that python interface provides easy integration into deep learning frameworks such as py torch which is used as a case in this slide here we can see that using the cutlass emit py torch method users can auto emit py torch code extensions for cutlass kernels this method would automatically generate source that's needed to run the cutlass kernels and also provide wrappers to map the inputs from py torch data types and also generate scripts that can be used to build and install these py torch extensions and this also allows for easy integration of these py torch code extension into the existing py torch code and finally to conclude cutlass is a c++ template library for deep learning and hbc cutlass provides various optimized production quality implementation of ai and hbc computation and has been the go to source for tensor core program details and i would like to thank the cutlass github community for their contribution over the years and at this moment cutlass is integrated into projects such as py torch, dvm, ai templates and various other projects so please go check it out thank you