 Okay, my name is Christian König. I'm working for AMD for something like eight and a half years now. I'm one of the maintainers of the AMD and Radeon kernel module, MDGPU and Radeon kernel module. And for the last two years, I'm also maintaining TTM. And that's what I want to talk about. So this presentation is all about the TTM memory manager and the Linux score. What's TTM? TTM is a manager for GPUs with dedicated video memory. So if you got something like an AMD GPU or an NVIDIA GPU, those Radeon, MDGPU, as well as Novo are using TTM to manage their buffer objects and to move their buffer objects around. The acronym TTM originals stand for translation table manager. But by today's standards, that's as inaccurate as it could be. Because TTM is managing quite a bunch of things, but not translation tables anymore. It's used to manage translation tables for AGP. So that's where that came from. Yeah, as I said, it's TTM is directly used by AMD GPU, Novo, Quixel, Radeon and VMware graphics. And the guys from VMware graphics, original tungsten graphics, one who originally wrote it and merged it into the Linux score. In addition to that, TTM is used indirectly through the DM VRAM helpers by ST, BOX, High Silicon, MTA 200 and the Virtual Video Box Video Driver. So how does TTM works? First of all, TTM defines so-called memory domains. Science is managing the local memory on your graphics hardware. The first thing that TTM defines is the so-called VRAM domain. That's representing your memory on the GPU. Then you got the system domain, which is system memory, which is mapped or available to the GPU or accessible to the GPU. So the GPU can directly write over PCIe and DMA to the buffer objects in the system domain. And the last thing is, you have this web domain. And in this web domain, that's what's not directly accessible by the GPU, but still backed by system memory. So the CPU could write and read from it, but not the GPU anymore. And stuff in this web domain can, in theory, also be swept out to disk, if necessary. Those yellow dots, or maybe orange, those orange dots here are buffer objects which are allocated and then backed by memory in the different domains. And the important thing in TTM is that those buffer objects can move from one domain to another. So for example, if you suspend your laptop, buffers from the VRAM domain usually move to the system or swept domain for the simple reason that the local GPU memory is turned, the power is turned off. So if you wouldn't evacuate those buffers, would get just random stuff on your screen or just a black screen if you turn the laptop on again after a suspend resume cycle. So that's what those red arrows here are good for. When applications then use those buffers, they usually move back into video memory or system memory, and that's what the green arrows represent on this slide. How does eviction works? Yeah, first of all, how eviction is a process of making room in a specific domain. So for example, if you have a lot of applications that allocate video memory, you're sooner or later going to run out of that. So what TTM does, it looks at the linked list of buffer objects in your video memory domain and starts to evict things from that domain. So it moves things in other domains. Usually the system domain, if that's not enough room anymore, it also moves on and moves stuff into this web domain. Eviction works by locking the LLU, then picking the first entry on the LLU and calls back into the driver to test eviction is value able. For example, in video memory, you usually have a part which is CPU accessible and not CPU accessible. And so the driver can decide if it needs only CPU accessible video memory that an eviction might not be value able to actually do. So it's skipped in the thing. Then we actually evict the video to move it to another domain. And then we retry the original allocation which failed in the first place. And if that succeeds, the algorithm ends, otherwise we repeat from the beginning again. This is actually whether inefficient because when we, for example, have a lot of videos on the beginning of the LLU list, we test over and over again if they're value able to evict. And that actually doesn't make much sense, but more on that later. Other functionalities that TTM provides. A big functionality when you want to move buffer around is that you need to make sure that nobody is accessing that buffer. When the kernel moves the buffer from A to B, you need to make sure that neither the GPU nor the CPU is still accessing A and writing changes to A while the thing is copied to B. So for that, we have the CPU page fault handling in TTM. That works by TTM tracking at which memory location, which virtual address location, buffers are mapped into the user space. And as soon as we want to move the buffer around, those mappings are invalidated. And if, for example, the CPU tries to access this memory by writing a reading from it, we end up in a page fault handler. This page fault handler can then identify which buffer object is meant to be accessed by the CPU and wait for that buffer, for the move of that buffer object to finish. That's what the CPU page fault handling is doing here. The other big functionality is the DMA memory page pools. The DMA memory page pools also originated in the ADP days. So on ADP, graphics hardware or general could only access uncaged system memory. Now uncaged system memory is something rather unusual, because when you want to access something fast with the CPU, you usually want to have it caged, so that one read doesn't result in tons of memory transactions. But making system memory uncaged is actually a rather costly process. You first call into the Linux kernel, say get three pages to get a free page, and then you change the caching attributes of that page, which requires a system-wide TLB flush of the kernel page table, and that can take quite a while. So what the memory page pool does is, when those pages are freed up, their caching policy are not changed back immediately, but instead we put them into a pool. So on the next allocation, we can serve from the pool instead of asking the operating system, the cooperating system for free pages. Later on, that was the first idea of how AGP worked. Later on, this was extended to provide uncaged wide combined pools for modern hardware, for PCI Express hardware, as well as decrypted pages and huge page and DMA32 page handling. DMA32 page comes into play when you have auto-graphics hardware which can't address all system memory. Yeah. As I said, TTM is, the whole framework is roughly 11 years old, nearly 11 years old now. It was merged upstream in June 2009, if I remember correctly. So it has quite a bunch of history. And with that history, it also acquired quite a bunch of problems which I want to outline and discuss how we started on solving those problems and what else we should probably do to clean that up quite a bit. The first problem is that after TTM was merged in the upstream Linux kernel, it quickly became a dumping ground of driver-specific features. So whenever a driver maintainer had a feature which he wanted to implement, he just implemented it and instead of his driver, he put it into TTM because the assumption was that somebody else could sometimes make use of that. Usually that's probably a good idea in open source software. The problem here is that everybody did that without a central maintainer which actually filtered out which features were useful and which features were not useful. So what happened is that TTM became, as I said, a dumping ground. And as we later, we'll see on the next slide, it was rather hard to word that and move functionalities back into the drivers again. The next bigger problem is that we have a DMA, that we actually abuse the DMA API in TTM. So when we ask the cooperating system, as I said before, we have those page pools. And when we ask the core Linux kernel of the cooperating system to get us a free page, we always get caged system memory. Changing those attributes of those pages is actually completely x86 specific. And that's also the reason why the ATP back end in TTM always only worked on x86. There used to be AGP hardware on this PowerPC, for example. But as far as I know, we never got that working with AGP back end in TTM. Additionally for that, those page pools make assumptions how the DMA API works internally. This, then, works together with things like IOMMU, where you have an MMU block and memory management used in the source bridge, which also maps your DMA address coming in from the graphics hardware into the physical system address. The third item on my list here is that TTM has a horrible mid-layer design. And there are things which TTM calls driver and driver calls back into TTM. And also at the very bottom of this is a functionality which tries to limit system memory usage for processes using the graphics hardware, which are actually implemented from the wrong side. So, let's see how we tried. As I said, I'm one of the TTM maintainers for two years now. And during that time, we started to try to fix those problems a bit up. What you see here is the lines of code which the Linux kernel used for the TTM subsystem. As you can see in the beginning, please note the scaling on the left side is not that bad. As you can see in the beginning, when we took over, we continued developing new features for TTM and merging that into the upstream Linux kernel. And so the code base first started to rise a bit, rise a bit. And then some of our developers started to ask why we have this function here. Who is using that? And then we got around to, hmm, nobody is using that. Somehow we have forgot to clean up things in roughly 10 years. So, from one day to another, we removed 13% of the whole code base. That was quite a bit. Well, it wasn't for unused code, but it was more like a bunch of features which were used by only one driver. VMware graphics had a huge bunch of those which were in TTM, but we could just take them and move them around. That was roughly 2,000 lines of code. And as well as features in Radeon, as well as AMD GPU, which were only used by those drivers. And since we are also the maintainers of those drivers, it was relatively easy to move those around. The problem here is we still have features which are only used by, for example, Nouveau. And Nouveau actually doesn't seem to implement them correctly or doesn't seem to use them correctly. So, I am trying to sync up with the Nouveau guys to actually get on a page what we should do with that. Either remove it, move it into Nouveau or what else we need to do. My colleague, Wei Huang, did a talk about bike moves in TTM. That's a feature we did late last year. And this is a little bump here at the end, which is like something like 100 lines of code which got added for the bike move features. And later on we cleaned up a bit of stuff we actually didn't use. And so we got down with the line of code count quite a bit again. But I think something like 5% of the code base could be removed without losing much functionality. Usage of the DMA API. What you see here is the organization of those page pools. So you have a page pool for normal system memory which has, again, normal caged pages, normal uncaged pages, normal white combined pages. Then you have a pool for DMA32 caged pages, uncaged, and so on. And those page pools are actually quite important for quick application start-up and good performance. Because as I said before, changing the KG attributes of memory takes time and takes a lot of time. Especially if you don't use huge pages, for those who don't know that, the linear kernel are distinct between normal pages, huge pages, and sometimes called giant pages. Normal pages are on x86 4 kilobyte in size, huge pages are 2 megabytes in size, and giant pages are 1 gigabyte in size. So if an application happens to allocate 1 gigabyte at a time, you can actually end up getting a really big block of memory 1 gigabyte. And not just thousands of small 4K allocations. Makes the rule think much, much more efficient because you have not so much housekeeping overhead. So what's abusing here? The problem is this rule implementation is completely x86 specific. So we had a really hard time getting it working on PowerPC as well as ARM. Some, yeah, let's call them desperate Chinese guys even try to get our driver working on a MIPS architecture. And we are telling them over and over again that they have platform coherency issues and they don't really want to listen to that. So that's also one of the problems which end up in this domain here. Solution to that is moving the whole pooling concept into the DMA API. The DMA API already has the concept of pooling pages, of pooling core and memory allocations. The problem is it doesn't have a concept of uncaged and wide combined memory. So I've actually already synced up with Christopher Halvick, one of the core Linux developers in that area. That he wants to provide us with a bunch of new flags to make those allocations, move those allocations back into the DMA API. And when that's done, we need to implement memory management callback in the DMA API pools and that will probably solve this problem. When you look into the pooling code, there's a comment at the top of the file that this should probably be part of the DMA API and not be part of TTM. And this comment is sitting there for over 10 years now and nobody has fixed it. So yeah, just tackling things which are well known for a very, very long time. The biggest issue and that's also why we think that TTM in its current form is more or less a dead end are design issues. And the design issue which you can't easily fix even if it's an internal Linux kernel component with no external API dependencies. The problem here is when you call, let's imagine a user space driver wants to display something on the screen. Not something very unusual but it might happen. So it calls into its driver and says, hey, I want to allocate three megabytes of video memory to display something. The driver then calls into TTM asked for, hey, I need this buffer backed by three megabytes of system memory. TTM then says, oh, I'm running out of video memory backed by video memory. I'm running out of video memory. Let's call the driver what to do. The driver says, okay, evict some buffers to system memory when you're running out of video memory. TTM then calls and says, okay, I can do that. Let's call the driver to move those buffers. When you want to move those buffers, you're seeing, oh, in my system domain, I'm running out of space. And the driver calls TTM again and so on and so on. We stumbled over this actually quite a while after taking over maintainership just by looking at backtraces. When we got backtraces which was like this long and we are thinking, why the heck do we see the driver calling TTM, TTM calling the driver and so on. And it turned out that TTM is really designed in the way that your calling path or your call chain just does a ping-pong between the driver and TTM just to make it clear this is not a recursion. It just functions calling each other. Yeah, it's good that we had a good piece of kernel stack because with a four kilobyte kernel stack this would probably overrun your stack and you got the stack corruption and that's probably it for the kernel. So yeah, we discussed this internally at AMD quite a bit and it's a big problem because then for example you imagine you add a new structure and want to protect it with a lock. Imagine you take a lock at your driver at the top and then the driver calls TTM and TTM calls the driver again and somehow in the middle you need to take the lock again because you don't know who is calling you. That's impossible, that's lock recursion, that won't work. So yeah, one possible solution to that would be to put the rule thing from the head to its feet. So instead of this calling ping-pong, TTM becomes some sort of component that just tells you which driver to evict and then the driver is which buffer to evict and then the driver is in control of the eviction and no longer TTM and then you don't need this recursion anymore. Yeah, how to fix it? What's coming up often is the idea to kill TTM with fire and just while the replacement and you start to use that in drivers turned out that is most likely not a good idea. We've prototyped something like that or thought about something like that and for the reason that it's used by many, many drivers in the column as well as AppStream wants a complete solution. David Arlight pretty much said that we want something that works for everybody and not one memory manager, often Intel, one memory manager for MD, one for the novel guys, we need one common memory manager. So that's actually not going to work. Additional to that TTM is mostly bug-free by now. Over 10 years, I mean we gathered quite a bunch of dumping stuff but we also hammered out most bugs in it so it actually works quite reliable. But the good to remove is maybe the AGP support. Hands up, anybody still using AGP-gate hardware? I don't think so. Really? Oh yeah. But you're not using that stuff so that's not a problem for you. Removing AGP actually won't mean that AGP hardware wouldn't work anymore because at least the hardware AMD ever sold had not only AGP but also this internal, how do you call that publicly? I think GPU VM, where page tables are handled like on modern hardware and the driver is handling the translation tables. So we would just remove the last translation table handling and move that into the driver and everything should more or less continue to work as it is. Just lose the common AGP translation table management. The next thing which comes to mind is slow decomposition and that's what we're certainly already doing is removing stuff which doesn't work, which isn't used anymore and moving stuff which is used but only by one driver into the driver again. That was like a certain percent code warp quite a bit. Then we have at some location to have unnecessary complex handling because TTM was to some degree meant to do something which by today's standard doesn't do anymore, managing translation tables. Vibers are managing that so we could remove that as well and it goes along with dropping AGP support properly. Last thing on my list here on that sublist is moving code into the DMA API. As I said, I already discussed that with Christoph Helwig and I think we have a plan we just need to implement it. So that is something we will probably do in the next year or so that we will get with quite a bunch of code here. The last bullet on my list here is moving more functionality into new components. Some guy at Redhead started with that. Previously, the jam objects, I didn't mention that before. Jam objects are your user space communication handle with your driver. So what you can think about this is jam is the front end of the driver interface and TTM is the back end of the driver interface implementing the backing store where buffers are actually are. Previously, it was organized like this that TTM and jam were completely separated. Recently, somebody of Redhead gone ahead and made the jam objects base class of TTM. This has a feature that we can now go ahead and to move duplicate handling. For example, we had for the same buffer objects, we had at least three reference counts. One in GEM for the handle to user space, one in TTM for the buffer itself and another one in TTM for eviction logic and list hands. With this move and a bit more cleanup, we can probably go ahead and move all of these reference count into a single one. We'll probably make handling quite a bit cleaner and quite a bit easier to maintain. Then what I've already prototyped with Daniel Federer from Intel as well is a DMA buff locking framework. Basically, when you submit commands from your user space application something like draw and triangle or draw a square or something like that you get one of the commands to draw something but you also get buffer handles where to draw, what textures to use, what index buffers to use and so on. Those handles are usually submitted to the kernel in the so-called command submission package. That's more or less the same idea for all drivers, not for Intel as well as AMD. But what we don't have is a common framework for handling those buffers. Every driver currently implements its own functionality to walk over those list of buffers and stitch together something which then can be given to TTM to move buffers into place to evict things and stuff like that. The DMA buff locking framework is a framework based on DMA buff and then should unify that throughout all drivers so that we don't need to implement that in each driver separately anymore. The last thing here is LiU cursor handling. As I said before doing the eviction, TTM is relatively inefficient when it comes to pick which buffer to evict. But it currently does, it always starts at the beginning and then moves over maybe thousands of buffers which are not valuable for eviction until it finally finds a candidate and then evict that candidate and then it starts from the beginning if that's not enough to find the next candidate. That is quite inefficient. So what this LiU cursor is doing is instead of getting a lock on the head of the list of the LiU list, you get a lock on the last item which is not valuable to evict and then move on from that to the next item and try to find something which you can remove from video memory and move it to system memory. So yeah, that's pretty much it. That fast. Questions? Yeah. Thank you for the presentation. Sure. So you have this piece of software that is like all the aspirin that is used by a lot of people and you want to change some part of it and you said earlier that you removed tons of time and lined up code overnight. How do you deal with that? How do you do changes in FBI that is used by so many people? What is your policy? Good question. So the question I should repeat that is how we make sure that we don't break anybody else. Is that correct? Pretty much. Okay. Yeah, first of all, removing the code was pretty easy. You just search if an exported symbol is used by anybody and if it turns out it's not used you can remove the function. That was rather easy. What's more tricky is moving stuff into drivers we don't maintain. For MDGPU and Radeon, it's relatively easy because we have the hardware, we have CI systems, we have QA which can look at the code and test it and make sure there are no bugs by removing or moving stuff around. So that was relatively easy. For the MBA that was relatively easy because we just asked the people to test it. I sent out a patch to Thomas, I forgot his last name, to Thomas Heldström, about moving stuff into the VMware graphics driver and to repeat it, oh yeah, good idea, let's do this but let's do it a bit differently and you made the typo here because of that it doesn't work. And based on that we rather quickly removed or moved code around but we didn't manage to do, for example, moving the NoVoCode because at MDG we don't have the NVIDIA hardware to test that, obviously and we also don't move around stuff for QXL or other things while we have some patches regarding that it's mostly just removing maybe 10 lines of code and that's not really worth it. So yeah. And general testing for us happens mostly on AMD hardware. Other questions? Is there a way for multiple applications hiding from the RAM to Q thrash or are they going to be stable in a certain situation while it's being extracted? Yeah, that's a really good question. So you're basically asking what's the policy and how different applications fighting for VRAM are basically going to do. So TTM, the real idea here is that TTM provides the guarantee for moving forward with a command submission. So if you have sent a bunch of, a list of buffers into your kernel which you want to have in VRAM, TTM then first of all locks down this list. Those buffers are not going to move then anymore. Then it goes over the list of buffers and when it finds, oh, something is on swap or something is in system memory, but I need it in VRAM, then it moves in that buffer. While doing so, it can happen that other things are addicted. In the same time, other applications send down requests to move in buffers as well, they start to evict something as well and then they would wait on that handle to become free. So at some point, you have to guarantee that you have moved forward with some submission. So if you have, for example, 10 submissions, one of them will win and move forward while others are addicted again and retry and have nine submissions and one of them will win and so on. So in theory, it's guaranteed that command submissions goes through. In practice, there's a lot of bunch which can go wrong. That's one of the things which didn't work so well in TTM when we started with it. We used my usage. Well, there is a functionality to figure out where a buffer currently is. So if the application is expecting a buffer in video memory, but it was moved out to swap or system memory by the kernel because of eviction, then an application could figure that out. The problem here is that this usually doesn't justify the extra overhead of asking the kernel. So you would need to send a list down to the kernel. Hey, give me the location for those buffers. And usually the kernel says, okay, I'm giving you a back set list, but till the next command submission, that could have changed already again and it's actually not that useful. What we do in the AMD DPU driver is we have statistics, how many buffers we have to move around for a command submission. And as soon as applications start to fight for memory, you see those buffer moves go up and then the application could react and go down a bit. I think the OpenGL driver actually, the major OpenGL actually do use that, but I don't think that our internal OpenGL driver actually does that. Question, you said that there's a way of finding out it's being swapped out, et cetera. Is that that kernel interface where you find the physical page mapping and you get the bits which tell you whether your page is dirty or swapped or swapped back, et cetera? Is that that one? No. That's what you mean. That's a question about CPU page table mappings. What you mean is that the CPU page table mappings. GPUs, at least our older generation GPUs work a bit differently. They can handle page faults. So you have page tables, but you need to guarantee that those are filled. So you don't have those dirty bits or your swept bits or whatever. So it works a bit different. The latest AMD generation could actually do page faults, but we currently doesn't use it for HMM yet because it's kind of hard to implement. Is that thrashing eviction? Yeah. It kills your performance completely. Yeah, exactly. Yeah. To repeat the question, what's that question? I don't know. Just a statement. Just a statement. To repeat the statement, yeah. Whenever you got into eviction, it completely kills your performance and completely agreed to that. So we actually need to improve both TTM as well as GEM to get rid of eviction situations or to avoid eviction situations, except for things like suspend-resume where you really want to evict everything because it's going to be damaged otherwise. Yeah. I saw some smarts in TTM to figure out basically clustering of the LRU. So if you have a large buffer that you want to place in the VRAM, and then you have a lot of small, scattered around 4K buffers on the LRU, are you going to evict all of them first and then just figure out, oh, that didn't make any space for my buffer actually? Yeah. So the question is how intelligent TTM evicts things and that's completely brain-dead. So that's from the beginning and just evicts until it fits. Yeah. That's not very, very well implemented. That's correct. Anybody else? Central memory management code. It seems like there's... What do you mean with central management? Memory management code for Linux. For the CPU. That's the same thing. Problems, yeah. Okay, the question is basically could we reuse more of the core Linux memory management? Implementation or ideas? Both, yeah, actually. Yes and no. Yes, for example in the DMA API we're actually trying to move functionality there. No, because GPUs, at least the older generation GPUs were differently than the CPU. As I said, they have page tables, but they don't support recoverable page faults. So at all times you need to make sure that your page tables are filled from the beginning to the end. One very fundamental idea in the Linux memory management is that at all times you can clear the valid or the read bit or the write bit in your page tables and get a page fault and handle that. This way, for example, is copy on writers implemented in this way. You clear the... When you call fork, all write bits in your page table are cleared and the page which you want to access with the CPU is copied and the clone is created and the parent uses one copy and the child uses one other one. This won't work with GPUs because GPUs currently can't handle recoverable page faults. For AMD, Vega is the first generation which can, in theory, do that, but they unfortunately implemented only Windows requirements on Microsoft requirements. So the Linux requirements are a bit different and we actually have a hard time to actually use that hardware in this way. Excuse me? That would probably be something. I think the next big step we are going to use is stuff like memory limits and C-group. As I said in this slide, the big reason why we have this mess here is that at the very bottom of it have a limitation of how much memory applications can use and this is probably better put into C-group and into core system management. That's the final case, yes. Five minutes left. I don't think so. The first problem is I'm not an expert on the Windows side and second problem is I'm not sure if the Windows side isn't under some NDA. I don't know. Somebody else? Okay, that's it.