 Let me ask you a question. How often have you used the mainframe today? Did you use your credit card to pay for your morning coffee or order something online? Did you go grocery shopping or get money from an ATM? Then probably you've used an IBM Z system today. My name is Christian Jakobi. I'm the Chief Architect for Z-Processor Design. And today I'm introducing the IBM Telem chip. Telem is the next generation processor for IBM Z and Linux One systems. It's a very exciting moment for me to be able to talk about this chip publicly for the first time. I've seen this chip grow up from rough ideas in the concept phase through high level design and the ups and downs of implementing the chip. And now that it's working well on the test floor and we can talk about it here at Hotchips, it's just a major milestone for the project and for me personally. Telem design is focused on enterprise-class workloads. It provides the necessary performance and availability and security, but it also has a new feature with a built-in AI accelerator geared towards enabling our clients to gain real-time insights from their data as it's getting processed. I'll talk you through all those details. But before I do that, let me give you a little bit of background around IBM Z. As I can probably tell from my initial questions, IBM Z systems are a central part of large enterprises' IT infrastructure in industries like banking, retail and logistics. But not only large enterprises in those kinds of industries are using IBM Z, the same capabilities are getting exploited by startup companies in new areas like digital asset custody. Enterprise workloads are an ever-evolving mix of established technologies and new technologies. Take languages, for example. It's not uncommon for an enterprise workload to be composed of programs written in COBOL, Java, Python and Node.js. Enterprise workloads combine traditional on-prem data hosting with OpenShift-based hybrid cloud. And they are combining traditional transaction and batch processing with artificial intelligence. That let-out point is particularly interesting. Increasingly, enterprises are using the data they own and process to gain insights with AI models, insights they can then use to optimize their businesses. The IBM Telem Chip is designed for such mission-critical workloads, enabling enhancements in both the traditional aspects of enterprise computing and AI capabilities. Let me talk through some of the details on the attributes that are traditionally associated with enterprise workloads. First, there's performance and scale. Enterprise workloads are very sensitive to per-thread performance, meaning the ability to finish every single task very quick. And they are also very sensitive to scalability so that they can scale up to the sheer number of tasks thrown at those systems every second. The Telem Chip has an optimized core pipeline and a brand-new cache hierarchy and a new multi-chip fabric that I'm going to describe in detail. Enterprise workloads are very heterogeneous. Banking workloads are very different from, say, logistics workloads. And even within those workloads, they are very heterogeneous kinds of programs. There are some common types of operations that happen across a wide range of applications, for example data sorting, compression, and cryptography. IBM Z has a long history of implementing hardware accelerators for such tasks in cooperation with the firmware and software team to enable best possible end-to-end value from those accelerators. Like I already mentioned, we are now implementing a new AI accelerator and we have re-optimized all the existing accelerators to work perfectly in harmony with the new cache hierarchy and fabric design. Of course, enterprise workloads are also very sensitive to security. And the IBM Telem Chip implements a number of innovations in that regard as well. We now implement encrypted memory and we have a performance-improved trusted execution environment. The trusted execution environment enables clients to run containerized workloads in a way such that the hardware ensures that the system administrators and the hypervisor administrators cannot get to the data in those containers. That obviously aligns very well with a hybrid cloud operational model. Last but not least, enterprise workloads and mission-critical workloads need best possible reliability and availability. The IBM Z15 predecessor chip already provided seven lines of availability. And with the telem chip, we're driving the ball forward through a number of enhancements. For example, with a new error correction and sparing mechanism that can recover data even when an entire L2 cache SRAM array has a wipeout error. We can transparently correct the data and we can then implement a spare array without the software even noticing. I'll take you now on a small journey through the chip to talk a little bit more about how the telem processor achieves the performance and scalability enhancements before I come back to the AI capabilities. Let's start with the eight cores and L2 caches per chip. We have optimized the core for best possible performance and we're investing a lot of silicon real estate into that per-core performance. For example, through a very deep high-frequency out-of-order pipeline and very large structures like the branch prediction tables and the caches. The out-of-order pipeline can run with a base frequency of more than 5 GHz and implements SMT2. There's a number of enhancements that went into the core pipeline. One of the bigger ones is the redesigned branch prediction. We have now an integrated first and second level branch prediction pipeline which allows us to access the second level BTB with lower latency when branches are not found in the first level branch prediction. We also implement a new mechanism called dynamic branch prediction entry reconfiguration. That allows us to vary the number of branches we can store in each table entry based on how many branches are in any given instruction cache line and whether those branches are going far or staying nearby. Depending on that, we need few or more bits to store the branch target address and based on that we can then put more or fewer branches into the branch prediction tables. With that design, we achieve more than 270,000 branch targets that we can keep in every single course branch prediction tables. That sheer size is a testament to the scale of these enterprise workloads. On Z15, we implemented shared physical level 3 caches on the processor chip and we had a separate cache chip that implemented a large level 4 cache. On the telem chip, we are implementing all of that logic in a single chip and we opted to quadruple the L2 cache to 32 megabytes. Of course, the L2 access latency is very important for the performance of enterprise workloads so we spent a lot of engineering effort to get that latency as low as we could and we achieved a 19 cycle load use latency. That's roughly 3.8 nanoseconds which already includes the access to the 7000 entry TLB. We have four pipelines in the L2 that allow overlapping traffic so that the performance of the L2 does not bog down under load. Now I mentioned the shared level 3 and level 4 caches that we had on the Z15 generation. On the telem generation, we don't have those as physical caches anymore. Instead, we are building virtual level 3 and level 4 caches from the private L2 caches and overall, we can provide 1.5x the cache per core at improved latencies compared to Z15. From a software perspective and software performance perspective, it still feels like a traditional cache hierarchy even though everything is built from the L2 caches. That's an important aspect to drive a consistent workload performance gain across a wide range of workloads with the telem chip. Let me describe in a little bit more detail how we achieve these virtual level 3 and level 4 caches. First, we are interconnecting all the L2 caches on the chip with a ring infrastructure that supports more than 320 gigabytes per second of ring bandwidth. Then based on that infrastructure, we are implementing what we call on-chip horizontal cache persistence. What that means is that when one L2 evicts a cache line, it can look around on the chip to find a less busy L2 and push the cache line into that other L2 so that it stays close by on the chip should the workload come back to that data. It's accessible very quickly with on-chip latencies. That way, we achieve a 256 megabyte distributed cache on the chip with an average latency of only 12 nanoseconds. That is faster than the physical L3 that we had on Z15. We then apply the same mechanism across multiple chips. We can group up to eight telem chips and form a virtual 2 gigabyte level 4 cache across those eight chips. Let me describe a little bit how we are using the telem chip to build out a large-scale system. Of course, we start with a single chip with its 256 megabyte cache. The telem chip is designed to fit on a dual-chip module. There are two chips with 512 megabytes cache on one module. Four of those modules get plucked into a four-sock drawer. Think of the drawer as the motherboard that can hold four of those dual-chip modules. That gives us eight chips and the virtual level 4 cache of two gigabytes. Then up to four of those drawers can be interconnected into a system with up to 32 chips and eight gigabytes of cache forming one large-scale coherent shared memory system that enables the scale that our client's most demanding workloads need. All of that is enabled with the February controllers and the cross-chip interfaces that are on the perimeter of the chip. There are latency and bandwidth improvements compared to the prior generation along that entire build-up from single chip to the full system. I'll just mention two here. The dual-chip module uses a two-cycle transfer path between the sending chip and the receiving chip, meaning we can send data out of one chip and receive it in a latch on the other chip with just a two-to-one clock in between. We achieve that by having perfectly synchronized clock grids on the two chips of the DCM. I already mentioned that on Z15 we had a dedicated cache chip and that cache chip also was a hub whenever two processor chips needed to communicate with each other, which led to a little bit of added latency through that hub chip. Now having everything combined into the telem chip, we can implement a completely flat topology within the drawer, meaning every one of the eight processor chips in the drawer has a direct connection to every other chip in the drawer. That further reduces the latency of that large virtual level 4 cache. Taking all of these enhancements together, the improved fabric controls and on-drawer interfaces, the core design and the cache hierarchy, we can achieve over 40% per socket performance growth. That's the kind of performance growth our clients need to keep up with the increase in their workloads. I spend a lot of time describing the details of how we achieve the performance and scale. I now want to switch gears and talk more about the embedded accelerators and specifically the accelerator for AI. But before I go into the details of the AI accelerator, I want to spend a little bit of time to explain the use cases that we're going after and that we're shaping some of the design decisions. When I look at enterprise workloads and AI use cases, they roughly fall into two categories. The first category is what I would label business insights, where clients can use AI on their business data to derive insights they then use to improve their businesses. Examples include fraud detection on credit cards, customer behavior prediction, or supply chain optimization. The second category I would label as intelligent infrastructure, where AI algorithms are used to make the machine more efficient. Examples include intelligent workload placement in an operating system, database query plan optimization, or anomaly detection for security. Let's take credit card processing as an example and specifically credit card fraud detection. We know from our conversations with clients that when they try to do that with an off-platform inference engine that they cannot achieve the low latency and the consistency of low latency by sending data from IBM Z to a separate platform. Also, when sending data to a separate platform, it creates all sorts of security concerns that data, after all, is sensitive and often personal. And so the data needs to be encrypted, the security standards need to be audited, and those things create additional complexity in an enterprise environment. So based on that, we know from our clients that they would much rather have the ability to run AI directly embedded into the transaction workload directly on IBM Z. And that way, they can score every transaction 100% of the time with the best available model that they want to use for that task. For that reason, we chose to implement a centralized on-chip accelerator directly shared by all the course. Let me talk through some of the attributes that this design point provides us and compare that to those basic use cases. First of all, I mentioned we need very low and just as important very consistent inference latency. By having the accelerator accessible by every single core, when the core switches back and forth between non-AI work and AI work, it has the ability to use the entire compute capacity of the AI accelerator for when it does perform AI work. That's different from most other server processors that are implementing some AI capabilities directly in their vector execution units. In that design point, when workload switches back and forth between AI and non-AI work, the AI work can only get the portion of the total capacity that is belonging to that core. In our design point, the entire centralized accelerator's capability is available to every core when it needs it. Second, we had to optimize the AI accelerator's compute capacity to match up the total transaction capacity of the telem-chip. We want our clients to be enabled to perform AI inference as part of every transaction, so we needed to implement sufficient compute capacity for that. The centralized AI accelerator provided us with some amount of flexibility on floor planning and where we placed the accelerator on the chip and also how much area we can devote to the accelerator. Between those two considerations, we implemented the AI accelerator with more than six teraflops of compute capacity. We also know from our clients that they are using a wide range of different types of AI models, ranging from traditional machine learning models, like decision trees, to various types of neural networks. We designed the accelerator to provide acceleration to the operational types that occur in those different types of AI models. I already mentioned the importance of security and how we are avoiding sending data off-platform with a built-in accelerator. But of course, it's also important to follow the strong memory virtualization and protection mechanisms that IBM Z on its course implements. I'll describe how we map that from the core directly onto the accelerator. And then last but not least, AI is a fairly new and quickly evolving field, so we designed the accelerator with extensibility in mind. There's a lot of firmware involved in how the accelerator works, and so that enables us to provide updates and new features and functions with new firmware releases in the future. The hardware design of the accelerator also naturally lends itself to enhancements in future generations of silicon. So let me go a little bit more into the details of how the accelerator works. We defined a new instruction called the Neural Network Processing Assist Instruction. That instruction is a memory-to-memory-assist instruction, meaning the operands, the tensor data, are directly sitting in user space in a program's memory. So for example, a program could have two matrices sitting in memory and a destination matrix and call the instruction, and the instruction would perform the matrix multiplication of the two source operands and put the result into the destination operand. The instruction can perform many types of operations, like matrix multiplication, pooling, or activation functions. There is firmware running on both the processor core and the AI accelerator. The processor firmware performs all the address translation and translates the program virtual addresses to physical addresses, and it performs also access checking as it performs that translation. That way we inherit all the natural virtualization and protection capabilities of the core for the AI accelerator. The core also performs prefetching of the tensor data into the L2 cache so that data is readily available and distributed. The firmware sitting on the core and the accelerator are then building a data pipeline that stages the data from the L2 cache into the accelerator and distributes it within the accelerator to gain the maximum efficiency of the compute performance there. Speaking about the compute performance, like I said, we deliver more than six teraflops per chip, which provides us with over 200 teraflops in a 32-chip system. The compute capacity comes from two compute arrays. The upper array is the matrix array. It consists of 128 processor tiles, each implementing an 8-way SIMD engine with 16-bit floating point format. The array is designed as a high density multiply and accumulate array focused on matrix multiplication and convolution. The second array is the activation array. It consists of 32 processor tiles with an 8-way SIMD for floating point 16, which can also perform FP32 operations. That array is optimized for activation functions and other complex operations like LSTM. In order to make the maximum efficient use of the array, we invested a lot into the data flow surrounding the array. We have the intelligent pre-fetch engine, which is firmware controlled and receives the translated physical addresses from the core and then fetches the data from the L2 cache through the ring into the accelerator and then results go back the same way. It can perform operations on the ring with about 100 gigabytes per second. The data gets loaded into the scratch pad from where it gets distributed into the input and output stages of the compute grid array itself. Along that data path, we have data formators that ensure that the data arrives at the compute engines in exactly the format and layout that the compute engines need. We can distribute the data with more than 30 gigabytes per second and through all the firmware coordination between the core, all these data movers and the compute array itself, we maximize the compute efficiency of that compute grid. Let me step back out from the accelerator and talk about the software ecosystem that enables the exploitation of this accelerator. There's a broad and open software ecosystem that enables our clients to build and train models anywhere, meaning they can build models on IBM Z, they can build their models on IBM Power Systems or they can build their models on any other system. They can use the tools that their data scientists are already familiar with. You see a lot of familiar logos on this page. And then the trained models can be exported into the open neural network exchange format and then the IBM Deep Learning compiler can take the ONIX model and compile and optimize the model for direct execution on the AI accelerator. On the right side, you see a typical enterprise stack consisting of the operating system and container platform, databases, app servers and applications. As I already mentioned there are use cases for AI at every layer in that stack. The operating system can benefit from AI for intelligent workload placement. Databases can optimize their query plans and then of course at the application layer clients can embed AI into their transactions for things like credit cut fraud detection or supply chain optimization. I did talk a lot about the goal of achieving low latency so that AI can be embedded real-time and at scale without slowing down transactions. We built a number of models in cooperation with our clients, proxy models that reflect real-world applications of AI inferencing. On this chart I'm showing one example. This is a recurring neural network that we co-developed with a global bank to reflect their credit cut fraud scoring models. And we ran that model on a single telem chip and we can run more than 100,000 inferences every second with a latency of only 1.1 millisecond. And then as we scale that system up from 1 chip to 2 up to 8 and 32 chips, we can perform inference on more than 3.5 million tasks and the latency still stays very low at only 1.2 milliseconds. Now this is only running the AI inference tasks. We are not actually running the credit cut transaction workload, but it does show that Telem's AI accelerator has the capacity to provide low latency real-time inference at massive scale so that it can be directly embedded into the transactions at very high bandwidth. Let me summarize. I introduced the Telem processor chip for the next generation of IBM Z and Linux 1 systems. I explained in some detail how the Telem chip achieves the performance and scale enhancements and for lack of time I only gave a few examples on how the Telem chip also improves the security and availability characteristics. And then I described in some detail how the embedded AI accelerator will enable our clients to embed AI directly into their enterprise workloads. Of course, this chip is the work of a very large team, spanning the globe and spanning multiple groups inside IBM from the IBM system chip development team to the IBM research division. Our technology development partner on this project is Samsung. We are manufacturing the Telem chip in Samsung 7 nanometer EUV technology. The entire team is very excited to see this chip come to life and we can't wait to see how our clients will benefit from all the capabilities we put into the Telem chip. Thank you very much for your attention and we now have a few minutes for questions and answers.