 Hi, I'm Daniel Stone. I'm the Graphics Lead at Collabra and talking to you this evening and a couple of weeks ago from wonderfully sunny London. I'm here to talk about synchronization and all of the weird and wonderful ways that we managed to actually get some pixels into your retinas. In more useful terms, what that means is we'll be walking through the whole pipeline that we use to render things, to transfer all of the graphics we've rendered between different processors and different contacts and devices, how some of that's implemented in the kernel, which is probably a lot more odd than you'd expect, and then looking into some of the work we're doing to support some of the future use cases in terms of users, APIs, hardware, everything. And then also what this means for how we actually present in user space is sort of step in one level back. So our first starting point is DRM and not the Digital Rights Management, but actually it's the Direct Rendering Manager. And this is our interface between user space and the GPU hardware. And again, we're starting really, really simple. We're starting completely dumb. We're assuming we have a single device. It's got one command queue, which is completely FIFO. It's completely coherent. This is a complete lie as far as hardware goes today, and it's not the first lie I'm going to tell you, but it's easier to build up the model and to understand things that way. And then we can stick on some of the more complex uses and some of the developments we've had over the past 15 to 20 years. The thing to know about DRM is that that's how we actually interface with GPU hardware from user space and also within the kernel. But unlike something like, say, a storage device where you have NVME on the bottom end and then you have POSIX APIs on the top end, we have none of that. All of the device operations, everything from allocating memory, executing commands, checking on synchronization, even enumerating device capabilities. That's all done through device-specific Ioptals. So all of the user space you see is device-specific. There's no very, very low-level bare metal just render this triangle because we don't have that kind of lingua franca for GPUs. So in graphics, we use Mesa as the canonical example for OpenGL and Vulkan implementations. And we're very insistent on having open user space preferably in something like Mesa just because without having that open user space, we can't actually understand how the driver works because it's all tied up in these device-specific functions. But that being said, let's pretend that we have this abstract, very straightforward GPU and just have a quick look at how we actually go about using it. So anyone who's ever had to write a bootloader will know that most of your first work is just stepping through, getting increasingly more capable with the ability to address and allocate memory. You sort of have this elaborate bootstrapping routine and it's exactly the same for graphics. Once you enumerate the capabilities of the GPU and you know what you want to do with it, the first thing you're ever going to do is to allocate yourself some memory. And this isn't Melloc or Vmelloc or Kmelloc or anything. We have specific carved up buffers for GPU memory. And that's just user space requests us that we allocate a certain run of bytes we allocated for user space. And we hand it back what we call a BO, which is just buffer object. And all the BOs, as far as the kernel's concerned, is just a pointer or a set of pages and a size. That's another lie, but we'll come back to that. But BOs are the obvious ones are your actual pixel data, right? So the textures you want coming in as your sort of input graphics data, the user supplied stuff, and then your frame buffers and render targets coming up of what the GPU's actually rendered for you. But there are quite a few more different types of BO. Things like the GPU state, which is both read and write, the compiled shader programs that we ask the GPU to execute, they're all encapsulated inside BOs. So yeah, the memory that we do allocate for BOs to back them, that can be done either in, depending on the driver's choice, either in system RAM or in some kind of dedicated video RAM on some systems that may be from CMA, a physically contiguous cutout, doesn't really matter. The allocations are done on a sort of device global basis. So the device tracks all of the allocations across the entire device, and then it surfaces this to user context just through an integer handle. So every time user space wants to use a BO, it will address that BO by the handle that's being given. Yeah, this is usually just, as you'd expect, an array of page strokes or an array of DMA addresses, or however it's being allocated, just that fundamental memory addresses. Yeah, just a dead simple diagram here. The handle, the integer handle just refers down to the device's core tracking structure for the BO, and that's what sort of conceptually wraps the pages. The handle is just an identifier. Right, so now we've allocated all our data, you know, we've got our inputs, we've allocated our outputs, we've allocated space for the state and the shaders from. The second device-specific eye-opter you're going to do is command submission. So you can tell the GPU to actually go and do something, please. All of the command submission eye-optals are fairly different depending on how the GPU is actually built and structured, which varies not only between different vendors, but wildly between different generations as well. So as is tradition, we end up having all these kind of sub-drivers with APIs that look more or less similar depending on how wildly changes are. But almost all of them will take a list of buffers annotated with whether they are input or output buffers, as well as all of the parameters it needs to know to be able to execute something. And then once we have submitted the command, it kind of just appends it to a scheduler queue, and the next time the GPU becomes free, then that's the thing we'll execute. And then into our glorious third step of actually seeing what it is we've rendered. Buffer access is also frequently a device-specific eye-optal. So you'll give it the buffer you want to access, the area you... So yeah, similar to the DMA API, we'll give it the buffer you want to access with that integer handle, the area optionally of the buffer you want to access it, and the access mode as well. Then we'll materialize and mapping back into the user space process, so it can just access it as CPU visible memory, and we've got the triangle that we asked to render displayed in CPU memory. Yeah, that's not particularly compelling, and if you want to find out more about how GPUs actually work, then there are lots of good guides out there, but the bit we want to get to is synchronization. You know, GPUs are these extremely asynchronous, extremely parallel, very deeply pipelined engines, and omitted in all of this lightning quick overview of how you'd go about dealing with them. We haven't discussed synchronization, which is kind of the point of the talk. So one of the things if we go back to the command submission eye-optal I described, is they do take a list of buffer objects that the command is going to access and a mode of read and write, and that is to allow the kernel to be able to reason about exactly what's going to go on, so we can do what we call implicit synchronization. Implicit synchronization is basically a lot of hard work to create the illusion that everything going on is completely synchronous and that the GPU is one nice 50 piece of hardware which happens to execute in mock step with the CPU. So when we call that driver specific eye-optal to map our gem object into the CPU address space, that quietly stored for possibly quite a long time, just doing a full pipeline stall and waiting for the GPU to complete the last command which touched that buffer before we expose it back up to user space. We don't just use this for serializing CPU and GPU either. If it was just that, it'd be relatively straightforward, but we also have to consider GPU versus GPU. For better or worse, implicit sync is what ended up getting partly built out and partly accumulated over the last couple of decades, and that's sort of the foundation of how we share anything between processes and the context is this illusion that everything's just perfectly synchronized for us behind the scenes. The way that's classically being done is that the driver will record for every command. There'll be a sequence number or some kind of identifier that it can record and then query the GPU to see if that sequence number that command has already fully retired. If it has, then async is done and it inserts stalls both CPU and GPU side to make sure that there's no read-against-read or read-against-write hazards and everything just behaves like the world is in fact a single FIFO context. So I did say between processes, but I also said that the handles we had were local to the DRM context that we created. So rather than having to share contexts, which would end up in your whole system being a single context with multiple clients, where we landed was with DMABuff. It's relatively straightforward. It just allows a reference to a BO to be passed between anything, so different contexts, processes, whatever. So conceptually, really, it's just sort of hoisting that abstraction and the abstraction of referencing it just hoists that one level further. So it's nothing in and of itself. It just points down to an actual material gem buffer object. Also, we need to share between devices. This isn't just for the things like you want to be able to decode some media and then access it without copies from GPU or stream your desktop out to show everyone on Twitch how amazing you are. If you look at most ARM systems, the GPU and the display controller are completely separate IP blocks with completely separate drivers from completely separate vendors. So that's something that we really need to get really solidly locked down quite early. When I said that GPU memory was special, it's not really that special. It is essentially just pages and addresses. So all we do is that every subsystem that wants to participate in DMA buff sharing, it has an import and export API, which looks something like this. Again, you pass a reference through user space through the file descriptor to the original buffer, which is materialized by an export Iocool from DRM in this case. Then for Vferol2, we have a similar mirrored import Iocool, which takes a DMA buff file descriptor and magics that into Vferol2's local concept of the buffer, like VBuff2 or whichever one you're using. Then internally, the DMA buff in the kernel has an ops table, so Vferol2 can go query the exporter and ask it for the list of pages, so for a scattergather table of what's actually backing that memory. Once you've done that, there's no real need to communicate between the two. You can largely just use them as if they were native buffer objects. So despite being really amazing, there's a lot of stuff DMA buff can't do. For one, it's not an allocator and it never will be. It's not a constraints offer either, which is essentially what precludes it from ever being an allocator. If you look at the external developers conference talks from any of the years gone by, you'll see a lot of talks about generic allocation and trying to find a pathway to get there, but we're still not there and even so DMA buff will never be a very much used-as-space solution and a multi-subsisting solution to deal with. But, you know, it's something. What it has given us is the ability to share buffers between DRM, Vferol2, Wailand, X11, EGL, Vulkan, G-Streamer, Pipewire, Vapi. You know, anything where you have essentially external engines doing work is going to be addressable via DMA buff. So it's just sort of our universal exchange that we have for exchanging buffer contents without inserting the copies everywhere and destroying our performance. Anyway, Vapi is still not about synchronization, but it's a nice parallel. So now that we've exposed to use-as-space this sort of portable concept of a buffer which can contain some stuff and be addressed by hardware, the next logical step is to expose sync operations and sync points. So we also did this as a file descriptor. Essentially, what happens is your command-submission-i-uptal or whatever equivalent you have in your subsystem which will cause the hardware engine to do work will return this DMA Fence FD. And that will represent the completion of the hardware work. It's also, again, it's portable cross-device and cross-subsystem. You generate one when you make use of some work and you consume them whenever you need to synchronize against the completion of that work. It signals exactly once when this work is completed and then it will never be anything but signaled again until its last reference goes. And it is guaranteed to signal in reasonable time. So you can't generate a DMA Fence for something that might or might not happen at some point in the future. The answer is usually about five seconds in extremis, but you only materialize the Fence when you have really committed to the work to the hardware and you're sure that short of your GPU having not something like this, that work is going to happen and it's going to happen in an amount of time which isn't going to make the user really, really unhappy. So in the kernel, the way Fences work is they kind of have these two paths. There's an optimistic, unshared path where you can create Fences and they're essentially just used for accounting and track it all around and there's no performance impact to this kind of internal mode. In that internal mode they're essentially just as they were before. The devices are still synchronizing against themselves perhaps with some internal fast paths where we need to break out. So say if someone's requested a CPU side-weight because they want to know when the GPU's finished for whatever reason or you have a cross-device weight because you have some GPU work that you want to consume from a media codec engine to sort of speculatively do it towards the codec engine but tell it to block on events that hits this enable signal ring kernel callback in the DMA fence which tells the driver that the CPU wants to know as soon as that job's fully retired so you go into some slightly degenerate fast path switch on a Q's and make sure that you get that completion notification when it runs. The only other thing the user space can do with it and as I was getting up before apart from importing it is it can poll on the file descriptor and be notified when it's ready so it sort of lets the CPU sleep nicely until all of the work's retired and you're ready to refill it or do whatever else you want to do with it. So those two are quite related not just in the slightly confusing choice of DMA name because they don't always imply DMA. Every DMA buff has this DMA res or reservation structure. The DMA reservation just ties a DMA buff to a set of DMA fences and that allows us to extend this implicit synchronization that we already have within the same context across different contexts and processes so when you as a kernel subsystem receive a DMA buff it's your responsibility before you execute or schedule any work against that DMA buff you need to check the reservation to see what other people have put there as work that they've already queued and similarly when you're queuing work you need to place a reservation as well to tell others to synchronize against and that lets us extend this whole now completely broken FIFO concept not only across GPUs which are way more complex than we're pretending but across entire devices as well so they can synchronize against each other so this lets both APIs and subsystems which aren't hence aware still be able to exchange buffers with other processors but also use a space if anything hasn't been updated yet for fencing or you're bisecting the implicit sync gives us this, maintains that illusion and doesn't completely break all your content so why do we still have this kicking around why do we keep extending this illusion to multiple devices part of the answer is X11 X's model is really not amenable to synchronization and neither is its code base to be honest so that's a really helpful one to update but even Wayland, our glorious shiny ultimate savior window system that hasn't universally been updated for explicit fence awareness the reason that hasn't happened is because it turns out it's not actually anywhere near good enough for some of the things we need to do with synchronization and with fencing and this is where things get a bit strange a little bit unknown and slightly into the hand waving territory to be honest so Windows has this counting semifall mechanism or might be called timeline semifall over there we have POSIX semifalls which are binary they're signalled or they're not Windows has value semifalls where your semifalls signaling operation can set the semifall to an arbitrary value or increment or decrement the value and then conversely your weight operation can block on a specific value so you can have one weight which might not be signalled or might not be broken until ten signal operations have gone past until they set the right number and Windows is a very coherent world, right? it's a single unified structure so they put that straight into DX12 and so as a first class concept DirectX12 has not just the same binary fence mechanisms that EGL and Vulkan have they have the full integer counting semifalls and because it's in DirectX there's a lot of gains who rely on it and to be fair for a fairly good reason so now we've got that in Vulkan too if you want to know more about those or go into some details the two talks I can recommend are Teniel Maders covered a lot of the details of the core Windows primitive few text to Cisco and also Jason Eckstrand covered it specifically from a graphics point of view a couple of years ago when we first floated timeline semifalls and the reason he was able to talk about them is because they were so invasive and they were so difficult to all of the way that we'd handled synchronization before that it was actually put out and run past the Linux community before it formally became part of Vulkan just so we could understand a lot of the implications a lot better so you know how bad can that possibly be right like all we need to do is put binary into integer and you know you're not going to need to account for the full 32-bit integer space of every semifall you know you don't need to blow it out into the world's registry we've got good data structures for the tracking sets and there seems to be a new one every week in the future so how hard can that be and didn't we already do this by layering on top of DMA fence another primitive called DRM SyncObject and yes and no you know SyncObjects were mainly created for performance reasons as well as as always file descriptor limits because everything is a file descriptor but you only have 10 to 24 so SyncObject again gives us this two-level handling of synchronization points where firstly they're created with the context local integer ID and then you export them to a file descriptor and it turns out you know this ended up in a bit of a stalking horse for a SyncObject can contain multiple DMA fences which is how sort of the basis for the timeline semifalls because now we can say give me the DMA fence for this semifall becoming this particular value rather than having an event or not having an event so you know that's got us a fair bit of a way there but the thing is as well as allowing for arbitrary integer values they also allow you to do work before signal you know games are very very heavily threaded these days they throw a lot of stuff at a lot of different cores and traditionally you know if you'd have a kind of world loading in one thread asset loading in another thread local state in another when you needed to reconcile that you'd basically have a sort of monster tree of CPU side semifall weights and then the only thing you'd do after they fired would be to wake up and then flush work down towards the GPU so for better or worse the timeline semifalls we have and we need to support also allow for wake before signal so not only does that completely torpedo everything we said about fences complete in guaranteed time so you don't need to think about synchronizing operations between different processes because the kernel will insert tools for you essentially that's no longer true because it might never signal so it's already a giant security issue with nothing else definitely a frustration issue for sure and it also kind of makes a mockery of how we allow user space to give a ton of operations specify the dependencies kind of transitively by identifying which buffers they're accessing and then they kind of will figure it out for you that's really really hard when you get into things like wake before signal even before you get into it becomes far more murky trying to understand which operation is actually going to trigger which weights if you're doing things like increments if you're doing things like a sort of reset signaling as well where you signal the semifall at the highest possible value just to try and clear absolutely everything up all sort of collapses in eight probably can you not schedule jobs sort of as a coder to this it gets even worse because all of DRM's memory management is really tied in with the rest of the kernel's memory management turns out GPU memory isn't special enough to not get swapped out to not essentially be subject to reclaim or at the very least have some interactions with reclaim bonus points if you have something like Zed swap as well and that's even before you get to us essentially doing our own swap and paging where memory can move and migrate between VRAM and system RAM depending on who needs it because without something like VMA buffets we have no way to be able to force memory to be resident at certain places we essentially migrate that on demand and because we've built up this the kernel will schedule everything for you nice stack where user space doesn't have to overthink it it doesn't and all of that gets really interesting because you can't reason about how things are going to execute anymore which is definitely a problem when you're trying to synchronize things it's just really odd everything is going in this direction where it all gets a bit weird and a bit less knowable and the foot guns that are given to users are just getting bigger and more treacherous so the direction that hardware is going and sort of matched by the direction APIs are going it gives you tools like use space controlled memory resiliency so you can create very large bit sparse sort of virtual textures and populate those or page those in and out on demand which makes your GPU side code much more simple but makes driver authors side a bit less fun to be honest interesting but maybe less fun and this is only getting further and further extended so we've moved a long time ago from GPUs having a single command ring you know that that sort of went down a notch to it was a single command ring but you could tell it just give it pointers to use space admitted batch buffers and tell it okay just go execute this patch over here it wasn't quite full context because it's not now we have real context and they're still very much a privileged thing they're still very much mediated by the kernel and all synchronization still globally resolves but we do have genuinely independent and isolated contexts you know as part of this it did take a very long time but actually getting full MMU isolation between contexts did actually happen so you could use those relatively safely and now the direction things are going is that in sort of a similar vein to things like IOU ring and it's following that trend of the kernel doesn't have to be involved in every single operation you know rather than sort of submitting these these individual calls where you cross the syscall barrier every time and you make the kernel do a load of counting and it's about reducing that syscall overhead being able to batch off as much as you can and sort of just letting use a space free run and the kernel and the hardware catch up when necessary and they can handle their own synchronization so what I come into is that the kernel's just doing more and more counting it's adding more and more overhead the tracking is getting more and more and the APIs are pushing us to a position where we can't even necessarily do that tracking with any kind of reliability or confidence and the hardware vendor's really embraced that okay well let's just go full throttle over here and the model is very much going towards you know the kernel mediates creating user contexts but once they're created it's very much all yours it's you know your your free run scenario should not necessarily involve crossing the syscall boundary you know you get arbitrary command submission into the GPU and you get your synchronization and your synchronization are also coming back to you and on a polling basis at least coming back to you where you need it and the only time the kernel's involved is for similar situations where it you know needs to manipulate process page tables for you know actually ensuring some kind of resource security but in that free run state the kernel's just not going to be there anymore necessarily so yeah it's um this is largely driven by games rather than wins you know games genuinely need gigantic throughput thousands upon thousands of draw calls that they need to put together it's just it's getting it's getting to a point where the performance overhead isn't manageable so you know that's something even if the hardware didn't push us there it's we'd have to be working on reducing that overhead no matter what you know in almost opposition to games we really need to be better at GPGPU and compute um you know this is a market which is largely being eaten by CUDA a completely proprietary solution but it's a proprietary solution it works really well um it works really well in these workloads where you know you're rather than you're sort of 16 or 11 or 7 if you're fancy millisecond budget that you have to get a frame out in a game sort of closer to weeks for compute jobs you know you've got these giant long running programs absolutely enormous data sets um that they might need to page in and out and any year now where we're going to have a GPU triggered demand page faults um and much more on the preemptability side as well um and they're very very very different uses and demands but we've got the same hardware at the bottom and we've got the same APIs at the top that both of them hit and we're just kind of nicely sandwiched in the middle we're not sure what that Goldilocks solution is and you know this isn't a presentation of this is the API of the future I have all the answers um because the only thing we found out last time I don't have any talks to reference apart from the axe.org developers conference which is getting a bit of a circus guide to the galaxy it will have happened two weeks ago when this hasn't happened yet as I record it but here we are there might be some discussions at xdc almost the entire year so far on dri develop has been dominated by this as you know hundreds and hundreds of males because we had all of all of the vendors come into this same tipping point and all we really found out and concluded in all of that is that the hardware implementation we know the most about right now got it wrong and they don't quite have something workable yet but all of them are on the cusp of having this so it's something we need to think about and put a lot of work in we've started doing some foundational work around a hybrid model so no matter what the final model ends up looking like in this kind of fully user space control steady run no one knows anything about synchronization anymore use case you get that most of the time but sort of like dma where I said before that because it was a wrapper around a driver object and essentially a set of interop helpers it did let you have your own best case internal synchronization when that was all that was needed but as soon as you needed to break out to different devices or CPU side we sort of flipped the switch and at that point you take the performance penalty and the thinking is the same for this new synchronization hardware so we let it run ahead at full tilt when it needs to but as soon as it needs to deal with something which isn't aware of fancy new sync an older media API or an old windows system like x11 or any of these things that's when it would step in at the margins and take a much less performance path but on the understanding that this is something you would only need to do once per frame so it didn't matter too much about the performance properties of what synchronization looked like then and we could eat the overhead of mapping the new world and it would be fencing an implicit sync world because if you wanted to get more performance then you get to implement the new thing this is not a design outline because why not there yet and the hardware isn't there yet and user status isn't entirely there either so we haven't got that much time to figure it out we will know surprisingly soon I think how this is going to shake up and what all of the impacts will be we have at least no kind of in broad strokes we know the art line I would say then switching top to something which gets substantially more straightforward but also pretty integral on the display rather than the render side of things these are both conflated in DRM they both have the same memory access and very similar synchronization models and displaying stuff is pretty useful once you've rendered it so yeah we don't need to worry about too much of this enormously complex but must be an enormously performant synchronization model on the display side because it's only a couple of transitions per frame so that's we've got less to worry about there but we also have not the opposite but maybe the reverse concern because display is in a way backwards to rendering your path of getting pixels onto the display is obviously forward from render to display the data you've rendered moves to the display to be consumed but where all this ties in with synchronization is how we deal with timing which flows exactly backwards from there through the display to the window system to the client is how we derive our timing because we have this fixed display clock if we pretend that VRR doesn't exist because that's the thing we're working on but we're not quite there yet and to be honest most games don't care about VRR they just want to run it up until the entire time and they don't care why it is just as long as it's as fast as possible anyway, in a fixed refresh rate world you know that your display clock will tick at the 16, 11, 7, whatever milliseconds at each point in time you know exactly when your deadline is for the next frame and absent tearing which you can't do on all hardware and looks pretty horrible anyway absent that, you know that if you don't make that fixed deadline in time then you're going to be one frame late that's pretty horrible for games obviously but even just the things you wouldn't expect too much like animation media as well that are quite sensitive to jitter it's really off-putting so one of the things we put a lot of effort into squashing was a lot of the cause of that jitter and we did that by having this model where the timing propagates backwards from our known deadline of the display right the way through back to back to the client with various kind of intermediate deadlines along the way so yeah this is, you know, the v-blank model that has existed since the 80s v-blank hits, you take the interrupt, you draw your stuff you know it completes in 16 milliseconds it gets displayed at the next time I mean it is brilliant if you can produce content it's slightly faster than refresh rate it's brilliant if you don't mind taking 16 milliseconds of latency if you search for terminal benchmarks it turns out that people have opinions about this and their opinions are that 16 milliseconds is not okay we know that Linux and the open display stack are also being used in some extremely time insensitive environments like neurological imaging where we have to be really committed and really accurate and also really prompt and we just can't take that extra latency so already those assumptions that underpin that really straightforward sort of ping-pong v-blank model have gone so what we did was we kind of slightly abstracted the concept of v-blank so we still obviously have our displays fixed clock but one of the things we did with Wayland is we added a signal to clients to prompt them and say you should paint now and the easiest implementation of this, the one we started off with in every compositor, was just on v-blank we send the event and then the client paints then but it's certainly disconnected from v-blank it's designed to not be in phase with v-blank and it may not necessarily even be at the same rate as v-blank so one of the things having this separation allows us to do is it allows us to cut down the latency because we can take the v-blank and then in the middle of the frame we can say okay client paint now and hopefully the half frame 8ms margin is enough for you to make it there because we know exactly when clients are submitting work and how long it's taking we know through all of their fences and all of the synchronization information they give us exactly what's going on we can adapt to slower clients so we can give them an earlier start within the frame boundary so they're still able to hit that next frame that's just coming but they're given more lead time to do it if they are just terminally slow then rather than sort of throttle the whole desktop to the slowest client as we have been doing rather than doing that what we can do is we can paste them out so we give them a lead time and a time of one frame knowing that they'll be able to make the next and we can paste them at a consistent 30 frames a second which unless you're a game always looks much better having that consistent frame rate rather than sort of juttering around making odd ones here and there but not add a nice cadence so the example for that is Michelle's been doing a lot of work with Jonas and the other human mutter developers to implement that kind of adaptive synchronization within mutter and within GNOME shell and that's something that comes in really really useful because you know as I was building up to before we now have completely arbitrary clients who can issue wait before signals who can write themselves into completely unschedulable cycles where their jobs will never complete whatever so we need to we've now got this kind of two stage model where we've separated the display tick from when we tick clients and to solve this fundamentally there's no way around it since the sync contract has now been broken to solve this we need to add a third point in this frame by frame composition cycle we need to add this third point which is a late blinding decision point for compositors so they can accept a new client rendering before it's being completed but rather than taking a client on a complete blind faith that it will complete in time the compositor wakes up just before it's deadline and checks has the work been completely retired yet if so that's great we'll use the new content this frame if not we'll stick with the old client content this frame and hopefully next frame you know the client will be able to make it we've learned that graphics is surprisingly difficult and not in the sense of you know animating the hair beautifully or water effects that are actually realistic and don't take a month to render out and you know at the low level plumbing we're approaching this horrible uncanny valley I guess and you know sandwiching where we have games that want to play just as well as consoles they want to coexist with desktops where an electron notification might turn up and suddenly demand you page out a gig of RAM they want to do this on cheap laptops on the other side we have these relatively low throughput but high execution and high memory demand workloads running on much larger kind of fixed hardware like we'd seen the network offload and RDM abled and now both demanding the same APIs on the same hardware but the good news is yeah it's got everyone's attention you know we're not it's not like these issues aren't known it's not like they haven't got priority or focus there's really sort of grabbed us all over the last while so everyone's been waiting for some of the details to shake out particularly on the hardware side a lot of foundational work and a lot of bits and pieces which might seem unrelated but build us all up to this future where synchronization goes from completely guaranteed to completely unknown the good news is we've probably got a while anyway because yeah no one can buy GPUs so yeah thanks a lot for that looking forward to Tony questions or discussions or chat or someone pointing out how there's a continuity error because I'm wearing different clothes when I turn up in the chat and surprisingly given everything I've discussed and the industry as a whole we are ironing so this is the kind of thing that interests you at all then please get involved thanks