 Okay, welcome to the talk, Free Software for the Machine by Keith Packard, well-known for his work on the X-Windows system. And this is about something else that he's working on since two years at HP. And welcome. There will be a 10-minute Q&A at the end, and please wait for the microphone. Thanks. Is that working? Yes. Good afternoon. Thanks for staying late at Faustem on Sunday afternoon. I'm sure we're all very tired and ready to go home. I want to talk to you today about the operating system and other software that we put on some new interesting hardware. I have a couple slides on the hardware. I have extensive additional presentations online. If you want to go learn more about the machine hardware, today I do want to focus on just the software aspects. As he said, there'll be time for questions, but if you have questions in the middle of the talk, feel free to raise your hand and we'll see if we can't get them introduced during the presentation as well. So the machine is all about a new concept in computing called memory-driven computing. Existing computers are very much centered on the central processing unit that CPU ends up gating a lot of the communication between various components, and it ends up being a tremendous bottleneck when you're trying to move large amounts of data. In classical computing, you have an application. How do you communicate with data? How do you talk to your data store where you go through the operating system? You have file systems, you have disk drivers, you have block buffers, you have the page cache, and you have a lot of distance between your application and its data. In memory-driven computing, we're trying to collapse the storage hierarchy and getting the operating system out of the loop and having your application communicate directly with the underlying store. Now, that's a byte-addressable memory store. You can get those in the way of DDRT or NVM memories. The DAC subsystem in Linux is all designed to talk to this. So we're starting to get to small systems with memory-driven computing, but where the machine is trying to go is it's trying to take memory-driven computing from a single computer and constructing an entire machine room full of computers, all of whom can be interconnected at the memory level. So for memory-driven computing, there are a bunch of different components. Fast and persistent memory. If you want to be able to get at your store at CPU speeds and you want that store to be the final resting place of your data, it needs to be fast and it needs to be persistent. There are a bunch of technologies coming out that promise fast and persistent memory. We have HPE's memrister technology. We have Western Digital's Reram. We have Intel 3D Crosspoint. We have IBM's Spintorque Transfer Memories. All of these technologies, we hope, are coming soon to give you this fast, persistent memory which will enable a new generation of application design systems called memory-driven computing. You need a fast memory fabric. Right now, the only way that the CPU can talk to memory is through its DDR bus. That really limits how much memory a single CPU can talk to. By replacing the memory interconnecting your system, you can dramatically increase the scope of memory that you can attach to your computer. I'm going to talk about a new memory interconnect that we worked on for the machine and the technology which is being evaluated by a new consortium that's building a new memory interconnect. The fabric turns out to be the interesting part of the machine in almost all ways. It offers this tremendously broad reach to a huge amount of memory. One of the other interesting aspects of memory-driven computing is the notion that a single kind of general purpose computer is not going to be the best thing for doing all of your computation. People are already noticing that in a lot of neural network applications and other machine learning applications. We have people using GPUs now. GPUs are terrible at general purpose computation, but they're great at very specific computations. The world of memory-driven computing, what we want to do is we want to unleash that capacity to plug in different computational elements into the same memory fabric. We want to be able to have all of your computational elements applied directly to the memory. Of course, we're at a software conference. One of the things that memory-driven computing is a full employment act for software developers because to do effective memory-driven computing, you're going to need a lot of new software to take advantage of the close coupling between the application and the store. A lot of it is very vastly simpler software, as you'll see in the presentation here today, but it's very different and it takes a lot of new time and ideas. Here's the machine that we built, a very simple sketch. It's a homogeneous collection of different nodes of computation. You can see in the blue a little Linux computer with DRAM and SOC. Literally, there are four different Linux operating systems running on this slide right here. All those Linux computers are connected into this fabric, so they each have a communication channel into the fabric of memory called a fabric switch. Connected to that fabric switch, nearby each computer is a collection of fabric-attached memory for the machine prototype because we don't have this fast persistent memory available to us today. We've connected DRAM. It turns out that when you put 320 terabytes of DRAM in a single rack, the cooling challenge is kind of daunting. What I didn't bring was the pictures of the team actually doing the development on the hardware, and they all have ear plugs in their ears and headphones on in order to reduce the noise of the fans that are cooling the hardware. It turns out that DRAM is really not going to scale, mostly because to save data in DRAM, you have to have the power turned on. Whereas with our fast persistent memory, to save the memory, you don't need any power at all. Fast persistent memory offers two tremendous benefits for computing at scale. It gives you the ability to know that your data is safe in the face of power failure, but it also gives you a tremendous ability to increase your storage without increasing your static memory power consumption. So your power consumption now scales at the amount of data you're accessing, not the amount of data you're storing. So that's a simple schematic of what we've built. Here's a single node, a little more detail. Within that computing node, you have a little DRAM in an SOC that runs Linux, and then you have some new interconnect technology called our Next Generation Memory Interconnect, and this particular system that connects that into the fabric. Because these are separate Linux computers, they're actually running different operating systems. The protection domain between the application and the memory now needs to be underneath the operating system. The operating system is no longer in the privileged place of providing all the access control to your storage. So we've inserted into the bus system an access control technology, just a little firewall between the operating system and the hardware itself. So there's a hardware firewall. Also, we need some address mapping. The poor little SOC that we put in this thing does not have enough physical address bits to address all the memory in the machine. The instance that we've constructed will address up to 320 terabytes of memory, which requires 49 bits of addressing. The largest SOCs that we can purchase today that we look to evaluate today only offer 48 bits. The particular SOC we've used only offers 44 bits. So here we have another case where we need another addressing in direction between the application physical addresses and the hardware because the addressing of the SOC is not large enough. So we've had to put address translation in. That turns out to be kind of a performance cliff because when you change the physical addressing of the SOC, all of your caches in the SOC are physically tagged, which is to say every cache line in that SOC has a physical address associated with it. So if you're going to remap the physical addresses underneath the SOC, you have to flush the entire cache. Modern SOCs hate that. It takes a long time to flush the cache. And you can see here that the memory complex is represented as a single block here where the SOC has access to the entire fabric. So every SOC in the machine has byte level load store access to every byte of memory in the fabric. We put a machine in a single rack right now which means that every SOC can talk to up to 320 terabytes of memory. That new interconnect technology has led to HP helping found and joined a new consortium called GenZ. It's a new data access technology. So this is a data access technology that's designed to replace all of the interconnects within your computer. So the PCIs, the DDRs, and other interconnects within your computer, QPI's, you could replace all of that with a single homogeneous fabric. And that means that you can interconnect many SOCs and a lot of memory together. DDRT is a great technology. It served us very well. But it is strictly point to point to the SOC and a selection of DRAM. The wires have to be short. The signal tolerances are very tight which means that you can't plug in very much memory. If you look at what the maximum memory you can plug into a typical processor, it's maybe a terabyte, maybe two, maybe 10. HPE makes a machine right now that offers up to 24 terabytes of memory. The way that we do that is by interconnecting SOCs that are each talking to their own DDR memory. So the access latency between an SOC and memory depends upon how many SOCs and how much interconnect space there is. So by using a homogeneous fabric, we can flatten the access times for memory across the entire fabric. The GenZ is a consortium. Working on this technology, they haven't released any particular hardware yet. We're working with a bunch of industry leaders, partners to develop this new interconnect technology that's going to lead to being able to deliver memory-driven computing. So let's talk about the software on the machine. When we started the machine program, the research OS people were very excited. They said, oh, new hardware, you must need a new operating system. And then we went out and talked to our partners who wanted to develop applications, and they're like, no, you don't get to do an operating system. At least give us Linux so that we can try it out. And of course, the Linux group at HPE was very excited because it's the opportunity to do fun, new work in Linux. So we built a collection of software, collectively known as the machine distribution, a piece of which is Linux for the machine. And within that software framework, there's a bunch of new APIs for applications to talk to, which I'll talk about in a minute. And then there's the basic node operating system. And off to the side, there's another piece of hardware called the Top of Rack Management Server that runs additional global services, one of which is the librarian that I'm talking about. And drilling down a little deeper to look at just a piece of software called Linux for the machine, here's the software that we built that runs Linux on an individual node. You can see within Linux user space we have a bunch of new libraries, we have some additional atomic access, we have some additional... we have the standard POSIX APIs, and we have an additional library for managing this caching issues because of the memory fabric. It's not cache-coherent. We're using the PMEM name here is actually from PMEM.io, which is a collection of people working on NVM APIs. One of the things they've added to that is the ability to persist memory operations. So you do a store, well, a store is going to go into your cache. If you actually wanted to get it out to the NVM, you actually have to flush it out of the cache. And so this PMEM library offers an API for that. We have extended that to the library to the ability to invalidate cache lines as well. Because the memory fabric is going to span many, many machines with a lot of memory, right now the current memory fabric that we've implemented for the machine is not cache-coherent across machines, which is to say if I want to communicate a byte store from one machine to another, I have to store the data into memory with a regular store instruction, I have to flush the cache with a regular CL flush instruction, and on the other end I have to invalidate the cache and then read the data. The PMEM library encapsulates all that in a nice little API so you don't have to know what process you're running on. Within the kernel we've added a bunch of new drivers, because we like to hack in the kernel. There's the Atomics driver, which lets you do atomic operations to the fabric memory without doing all the cache manipulation, it does it all for you. There's the cache flush support that I talked about in order to flush the caches, and then there's the new file system. When we started the program a couple of years ago, the researchers came up and said, what we want is we want to be able to manage the storage in the machine, we want to be able to provide collections of blocks of memory, we want to be able to provide naming and access control. So we've invented this new API that's going to do all the storage management for us. And the Linux team looked at it and said, you've invented a file system API, we have those already. How about this nice POSIX file system API and the nice existing VFS layer in the kernel? And they were like, huh, that kind of does look the same. Yeah. Researchers, you know. They think they invent something new and what they've done is reinvent UNIX yet again. So we created a file system to manage the storage, the bulk storage of data within the machine. We call it a librarian, and the reason for that particular name is, well, the bulk storage management within the machine manages a collection of pages and what do you call a collection of pages? Well, it's a book, right? A collection of pages is a book. So we have a collection, we have a book and the granularity, the storage granularity within the machine, each book is eight gigabytes of memory. So that's your block size in your file system is the small little eight gigabyte chunks. And when you manage a collection of books, well, you manage them in a library, right? So that's why we call it a librarian. And kind of in between books and the library is a collection of books as a shelf. And so it turns out that, unfortunately, the least usefully named thing within our little world is the shelf, which turns out to be the same thing as a file, because it's a collection of books and each book is a collection of pages. So at that level, a file is the same thing as a shelf and so a library is a collection of shelves. Yeah, the metaphor doesn't work great, but the nice thing was we got to come up with a bunch of unique names. So when we could talk about shelves, we knew that that was a file within the librarian file system. So it let our conversations work a little better, which is what naming is supposed to do, so that was nice. So here's what the librarian file system looks like. Outside of each note of the machine, another machine we call the top-of-rack management server is the actual metadata manager for the file system called the librarian. It stores its metadata in a very sophisticated file structure called a SQL database for extra performance. Remember, the books are 8 gigabytes in size and the total store of the machine is 320 terabytes. So there are only 40,000 books in the machine, which means that the metadata management problem in this particular instance of the machine is pretty small. I only have to be able to remember 40,000 objects. So the librarian is written in Python. It uses a SQL-like database to do the metadata storage and it uses regular TLS communication to the nodes to this LFS proxy, which is a user-space application running on the node written in Python that uses a hacked-up version of Fuse to talk to the kernel. So here I have a regular application. It talks to a regular file system layer. That file system layer looks a lot like Fuse and all the metadata operations pop out of the kernel into this LFS proxy, get turned off the network after this global librarian on another machine entirely. The transactions get logged in the SQL-like database and all that return results get wended their way back through this whole API chain. And we were a little concerned about performance, but in real applications it turns out that, well, at 8 gigabytes per chunk the applications are doing maybe a couple hundred operations per second, so it actually works out pretty well. Another big piece of the machine is security. Security is supposed to be built in by default. I told you about the hardware, the firewall that we put in that protects the fabric from the nodes. In order to manage that firewall, you have to talk to, you have to manage the contents of that firewall on each node. Well, in their infinite wisdom, the hardware designers put the control of the firewall, put the SoC itself in control of the firewall that is supposed to be protecting the fabric from the SoC. Well, okay, that's how they wired it up. So what we did is we stuck the control for that firewall inside the ARM trust zone, which is a kind of a miserable security enclave within the ARM ecosystem, but it's secured from the operating system, and then we have this firewall proxy up in user space that passes the firewall commands. It's kind of a colluge, but it's a research prototype. It doesn't have to be really good. It just has to work. So we have a secure system, we have a performant enough system, and we have something which is relatively easy to prototype, relatively easy to develop, manage, maintain, and replace if we need to, and it all works pretty well. Another big part of the system was those application libraries on top of the kernel. If all you give the application is a memory mapped file and you want to be able to share data structures across nodes in the machine, there isn't any real APIs that we have right now that do a good job of sharing data structures across the machine. Remember, the goal here is to have the application directly manipulating its data structure in persistent memory and have that data structure be shared across nodes of the machine. We're trying to collapse the storage hierarchy so we get rid of serialization of data. We get rid of an external database server that we use in SQL environments. We get rid of all the stuff and have the application manipulating the data directly. So what we've done on top in up in user space is constructed a couple of different APIs to kind of explore this area. The one I want to share with you today is called Managed Data Structures. We're doing a lot of ongoing development in this area so it's a pretty active part of our research program right now. So if you look at the way a traditional database work, a traditional database works by having the application sit on top of a relational API which then talks to a database client API to an external database server and that database server records the persistent data in the file system. So it's a deep stack, right? Every time you do some application manipulation, you're going over the network to the database server and it's transacting stuff down to the file system. By doing all the database data structure manipulations within a single address space right to the persistent memory, we just have the application sitting on top of the Managed Data Structure runtime. So we get tremendous performance improvement by doing that directly. What MDS actually does is it provides data structures, data structure APIs in user space directly manipulated, directly manipulated by the application. So the application can just create new data structures like a linked list or a hash table or whatever it wants and the operations on that data structure are done directly in the persistent store and are directly visible across the fabric to other nodes in the machine. So we have a wide variety of little data structures you're familiar with from regular C and whatever language you're used to. So we get rid of all these additional abstractions and we're just doing raw data structures. It's very efficient. We're able to share the data because MDS actually offers both the Java and C++ API so you can write a Java application and the C++ application. They're directly accessing the same data using simple abstractions that are common in both languages and they're sharing the data together. So here's what... I'm going to go off and talk about our emulation story right now. Linux for the machine, of course, operates across the cluster. It looks like a cluster to Linux. You have nodes. Here, I have just two nodes on the screen, although you can do as many as you like. In a single rack, we can get 80 of these. So here we have your node one with its user space and its library and file system in the kernel talking to regular hardware and the same with the other node. And then the top of rack management server is running the library and on top of a kernel and it's got all of its magic hardware. In the hardware that we've built right now, each node is running an ARM processor and the management server is running on an X86 processor, but that doesn't really matter. The nodes, of course, communicate through the fabric. You'll note that there's a distinctly missing line. The management server has no connection to the fabric. The management server is actually a regular X86 server, so it can't actually see the data that it's managing. It can only know, oh, this guy has that block and this guy has that block, at least they're not colliding, but the management server can't store data into the persistent memory and it can't read data out of it. So when you're operating on regular hardware, that's how it works. Now, what we wanted to be able to do is before we had the real hardware, we needed to be able to start doing application development. We wanted to be able to take existing hardware and construct a synthetic version of the machine. So we did two things. The first thing we did is we took this old ARM emulator that we had and put in registered level simulations of all the new fancy hardware that we had. So we're running a simulated CPU and a simulated Z-bridge and a simulated fabric. So in a giant machine, I was executing machine applications without one-one-thousandth of the speed of the native hardware. So we could do kernel development, we could do driver development. We even did the low-level BIOS and firmware development in that environment. But when you try to actually simulate applications, it's really way too slow. The other problem, of course, is that you need a really big machine in order to simulate even a reasonably scaled cluster because the computational cost is huge. So we were using our Superdome X servers, which have 16 giant processors and 24 terabytes of memory to simulate a modest-sized machine. And that's a multimillion-dollar hardware investment if you just want to do application development. So we came up with a really kind of acute hack. One of our kernel engineers, Rocky Craig, went off and discovered that QEMU actually had this new magic driver that they were exposing to virtual machines within the Linux environment called the Ivy Schmem driver. And the Ivy Schmem driver actually lets you share memory between virtual machines. It lets you take a chunk of memory from the hypervisor and map it into the address space of multiple virtual machines. That sounds a lot like our fabric, doesn't it? So what we've done is we've actually constructed a cluster emulation where we take multiple virtual machines. Now, these are all running on the same underlying physical machine, but we take multiple virtual machines and construct within the hypervisor this piece of Ivy Schmem, the inter-virtual machine shared memory. And we make that memory available to each of the VMs. And now those VMs running Linux can now pretend that that's fabric-attached memory and they can manage it as if it were fabric-attached memory. So now all I need to do is stick another virtual machine pretending to be the top of rack management server with librarian stuff running in it. Now all of a sudden I can do storage management for this inter-VM shared memory and get a synthetic machine running. I have it running on my laptop, so I'm actually able to use all the APIs for the machine running on a single laptop. This stuff is actually all the source code necessary for this is up on GitHub right now and I actually have a Debian repository that's public. So you can actually take a Debian machine and install a couple of VMs with Debian unstable on them and just install the necessary software to run a completely synthetic machine. I talked to a customer on Friday who wants to play with the machine and when I showed them in about five minutes how they could take three VMs and get it running the machine APIs, they were pretty excited. Do I have ten minutes you said? Ten minutes, okay, about ten minutes. Slightly more. I had to re-woot my machine just before they talked. I may give the demo a try, we'll see. I think I actually... Yeah, we'll see, in any case. So I actually created a simple little application and I'll show you the source code to it in a minute in Emacs where it's just using the basic primitives of the machine that the Famitomics library to do atomic operations and the libp-mem to do data cache management and it's a simple chat program. You type in words in one window and they appear in another window and so you can see how that application works and I can actually do the demo. Actually, why don't I go through the rest of the talk and then if we have time I can do the demo. I just wanted a couple of slides here to show you where the free software for the machine was available. Here's all the links. The github... github.com slash fabric attached memory has all the necessary bits to run... all the necessary Linux bits to run the fabric attached memory emulator. github.com slash Hewlett Packard has a bunch of the other APIs that we're using that has the managed data structures library that I talked about. It has the multi-processor GC API which does garbage collected memory allocation across multiple processes which is kind of tricky because the processes end up crashing a lot and so the MPGC takes care of that. We have Atlas which is... it makes accessing fabric attached memory reliable so it has kind of a transactional model and fabric attached memory makes it easy to program. As I said, we're trying to build a wealth of APIs to help developers learn how to do memory-driven computing and all of these are public. And then the... the Debian repository containing all the bits necessary that come out of this is in this downloads.linux.hpe.com. Let's see. I wanted to show you some pretty pictures of the hardware we built because I like hardware. Here's a single node of the machine. On the left you'll see the SoC. On the far left you'll see the SoC. In the middle of that you'll see its local DRAM. It just has a little bit. It has 256 gigabytes of local DRAM. And that runs Linux completely out of memory. It has no persistent storage at all. And on the right side you'll see the four terabytes of fabric attached memory with the memory controllers, the next generation memory interconnect that goes between that and the rest of the system. And then on the backside there's a huge number of optical and electrical connectors to connect the fabric to the rest of the system. Here's another lovely piece of hardware. This is the box that we slide all the nodes into. It has the networking interconnects, which is pretty cool. Those connectors in the back are optical connectors. And here's the other really cool piece of tech that we have. So here's a piece of silicon. It's a single chip. And within that piece of silicon there's a laser in it. It's a little ring laser. It actually emits light out the top of itself. And then you just stick a... You use this optical connector to stick a piece of optical fiber right under the top of the silicon. And now I have an optical interconnect directly from the silicon. It's called Silicon Photonics. It's amazing technology. And this is actually within the machine prototypes. So we're actually doing optical stuff, which is pretty cool. And if we have a minute I can show you the demo. It's not that exciting. I think I'll show you the code to show you how that actually works because it's pretty fun. Oh, thank you. Turn off the presentation here and put this on the other screen. Oops. Now I'm going to try to manage. That's the entire program for communicating between two machines. You can see at the top, is that too small? I see a bunch of people squinting. What? I don't know if I can even... I can make a text bigger on that other screen. I mean, I have to pull it back. I have to pull it back over here so I can manage. Oh, it's... I'm not running. Let's see if this works a little better. I put a couple of equal signs in there just for good measure. So at the top you can see it's just using the FAM atomic library to do an atomic read. You check the status of this little semaphore. And when the semaphore is the right state, it either reads or writes data into that and then the other thread just does the opposite thing. So when you run this program on two nodes of the machine, and like I said, I'm sorry I had to reboot my machine just before I started the presentation or I could show it running here. But within the virtual memory, within Libvert, you can actually see that I have a torms machine and I have a couple of nodes. And so when those are all up and running, you can actually do communication between them. But I wanted to leave time for questions and comments, and so I think I'll probably end the presentation there. Make sense? So you did say that different applications like Java and C++ can share a memory. Yep. So how is this supposed to work if Java has a managed heap and C++ has unmanaged memory? How can they share the same... Or is it not the Java heap which is just a shared memory which they use often? Yeah, so the managed data structures library provides these data structures to the application. And so the managed data structures library is written in C++ and it's exposed to Java through a JNI. So when you want to create a new object in Java, you go through the managed data structures system. As I said, that sits on top of the MPGC library, the multiprocessor GC library. So when the Java... when the JVM loses... loses a reference to the object, then the JNI tells the MPGC system that the reference is lost and it goes ahead and collects the memory from that. So there are hooks within the JVM that allow the JNI system to know when an object is no longer referenced. And we have garbage collection within the MDS system that is connected between those two. So you're actually able to do garbage collected memory within C++ and garbage collected memory within Java using this library. So it's a... there is a library between you and the actual data. You aren't doing a new in C++ you're calling the MDS system to allocate new memory. You're not doing a new in Java you're calling the MDS library to allocate new objects. So it's managed data structures it's not native data structures. So... so I was going to ask a question about that so that implies that you're not sharing page tables then. No, they do not need to share page tables they're just sharing the raw underlying memory. Okay, but so if I have an existing library so I worked on GPUs for a long time and the biggest complaint was that I couldn't just take a piece of C code and just run it on the GPU and run it on the CPU pass a pointer storing a data structure I mean you're not going to be able to do any of that unless you use your managed data structures. It depends on how you manage things. If you map things to the same physical address on the two nodes then you can share pointers and we have the Atlas library does that in particular it allows you to make sure that the memory lands at the same address and so that you can actually access things. It doesn't matter if you actually share page tables as long as you're sharing the same virtual addresses the operating system may have to do some extra management in order to keep those virtual memory tables lined up. We aren't doing any paging of course because it's all physical memory so all you need to do is make sure that the virtual addresses are lined up and that's really simple to do with the M-Map system call so it's actually fairly easy to arrange the addresses to be the same. The big problem that we have is that the CPU has a very limited address space and the ARM64 address space is only 48 bits right now and so the virtual address space is only 48 bits right now which means that you have to be careful when you're running the application to multiple nodes to reserve the appropriate portions of your virtual address space so that you can get the virtual addresses to be the same just because you don't have enough address space to make it easy. One question over here. What kind of memory chips do you use? What I got was non-volatile. Is it kind of ferromagnetic RAM or something you build or something you get on the show? Because we couldn't get any fast non-volatile memory right now the prototype that we've built is built in regular DRAM which is non-volatile until you turn the power off. So from a research prototype perspective it allows us to do all of the research we need to do into memory-driven computing in terms of application development, API development all this kind of stuff and we just don't turn the power off. So other question over here? Is there any plan for a cheap version of the machine that the hacker can buy and use at home to write software? Is there an affordable version in the plan for hackers to just buy and use at home rather than only enterprise users buying the machine? Well this machine itself as a research prototype is not a product at all. We have I think a hundred of those nodes on the planet right now and we aren't really planning on building more of this particular prototype. Whether any of the technology in the machine research program gets into products is something that the product groups at HP are working on. I'm a part of the research team building the prototypes. So in terms of what part of this technology is going to make it into products I don't have any inside information on that right now. Thank you for the talk. I don't know if you were on the previous talk that was sure about the RISC-5 but could you possibly implement the machine with the new RISC-5 CPUs? I don't know how much memory you can address with those though. Well that is one of the explicit goals of the memory driven computing initiative is to allow you to plug in whatever processing elements you need into a heterogeneous environment. So you could have ARM and X86 and RISC-5 and power processors all connected to the same memory fabric. The only thing you need to do is put into the processor the interconnect from that processor's memory bus inside to the next generation memory interconnect or in the future the GenZ system because GenZ is an open consortium of which IBM and ARM a bunch of the ARM licensees and AMD are all members. You can imagine that we'll get a wide ecosystem of different processors. The specifications for GenZ are all open some of the sample implementations the VHDL layouts for the sample chips that HP is working on, those are also open. So you can, I can easily imagine the RISC-5 people working within the GenZ consortium taking those specifications and adding that interconnect to that processor and enabling this kind of technology there. Absolutely. I don't know what the licenses are it's a royalty free license for using the technology but I don't know, I haven't looked at the licensing of the GenZ stuff but I know that it is very possible for the RISC-5 to connect to GenZ without joining the consortium and without paying any royalties. Hello. Yep. What are the difficulties of using SSD with perhaps with a cache rather than waiting for more exotic memory technologies? Well and in fact that's similar to how NVM works today. You can buy T-Memory which is just regular RAM with flash on the back side. So you can imagine building a system that uses DRAM backed by almost any persistent persistent storage technology absolutely. Yeah. Again one of the benefits of using truly persistent memory is not just that it's persistent but that it takes no power to retain the content. So you can get a tremendously a much larger amount of memory in the same physical space because you don't have a thermal constraints of DRAM. Right now our 320 terabyte version of the machine is technically air cooled. When you push air at that speeds is it still a gas? I don't know. There's a lot of air being moved through that machine just to cool the DRAM. So if we could get truly persistent memory we'd be able to lower the power budget dramatically and increase the amount of storage you can get within the same physical volume of cooling requirements to go away. But yeah, there's lots of options for people to emulate that persistent memory using DRAM with other persistent memory behind it, slower persistent memory behind it. I have a question myself. Regarding the semaphores are they being dealt with in hardware also? Yes. In the machine hardware because that fabric attached memory is external to the processor and shared by multiple processors the atomics are actually performed in the memory controller itself. So when you do an atomic operation it's actually a transaction across the next generation memory interconnect fabric between the processor and the memory controller itself so that it's globally atomic within the entire machine not just within a single node. So we can't just use the atomic operations within the SoC itself because those are only atomic to other threads running on the same SoC they're not atomic with respect to other SoCs running in the fabric. So yeah, it's actually done within the memory hardware itself. A question. Have you run LM Bench to compare local memory access times with the I suppose this the fabric's memory access times? The current implementation of the memory controllers and the fabric is all in FPGAs it's not running as fast as it should FPGA performance is not the same as ASIC's doing DDRT access so we know what the memory performance is likely to be and it's dramatically lower but that's simply because we're using FPGAs for the early implementation so that we can explore different ideas in the interconnect and so that we could do the board design in parallel with the hardware design so we got the boards back and the FPGA firmware loaded on and those two were the schedules were aligned so we didn't have to have the ASIC design and then the board design so we did a bunch of stuff in parallel like that and as a result the hardware is a demonstration of how it can work but it's not a complete performance simulation of what an eventual piece of memory driven computing hardware would work like so no it would be a lot slower. I think they're running at less than 500 megahertz the FPGAs and so do the math it's not very fast it's a lot faster than SSDs a lot faster than NVMe but it's not as fast as DRAM because of the interconnect technology being done at a gate array I'm here you could have to wait, there you are sorry if I missed the piece of information but will there be any API or any way to schedule the process to respective nodes so to control the locality of the process so that the actual computation process you run on a CPU it's located as close as possible to hardware oh yeah, so we've done a couple of things for this, one of the things we did within the file system each file within the library file system has extended attributes that direct the library as to where the memory should be allocated within the system so you can define where the allocations occur and then of course each node is a separate Linux instance so if you want to control where in the machine your application runs, you can direct it to a specific node directly that's under your control entirely you can imagine extending things like MPI to actually have more control more abstract control over where things are done right now we don't right now it's SSH into the node and start your application but as I said we have the allocation control so you can control where the library gets allocated question behind you? so is there any support to fork a process on a different node? each node is running a different instance of Linux so if you want to run a process on another node you can SSH over there and run it but there's no support in L4GM nope, we haven't done any of that work yet still to be done we got to the point where we could actually run applications across the fabric for API analysis and now there's more research to be done we just need more money right? that's what researchers always say hey, I learned a bunch of stuff and now I've got a bunch of questions that you need to pay me to solve I have one more what do you think will be the problems that are best suitable for this architecture? so we've actually done a couple of different problems one of the interesting ones we've come up with is large data right now in a large data environment the way that you solve large data environments is you partition the data across a lot of computers that means that you have to figure out which node to go ask for the information by using memory-driven computing you're able to access all that data symmetrically so image search is a promising area of possible research another area that we've looked at is graph analytics where you have a large graph the balancing of the data within the graph requires access to a lot of different data across the entire fabric and by using the fabric each node can compute the values for one graph node by looking at the neighboring values irrespective of where those nodes are located in the fabric so there's a bunch of places where right now communication is dominating the cost over computation and those are a lot of the algorithms that are faster in this architecture so basically anything that fits within the memory of the machine is going to be a lot faster in the machine than it would be in nodes with a collection of separate DRAM bits where you have to spend a bunch of time communicating I talked about using SSH to control another node is that do you have a way of running IP over the memory fabric or is that a separate network? that's a good question in the machine prototype let's go over