 Okay, hello everybody, please get seated. So our first talk, our first regular talk of the official DevConf schedule will be by Keith Packard from Flip Packard Enterprise and Bebe Project and he'll be talking about delivering software for memory-driven computing. Please give him applause. Good morning everybody, I've been here in Montreal for about six hours. To answer the most pressing question that everybody's been asking me, where is Bdale? Bdale just got off of a boat in Vancouver, Canada and is heading for the airport. So I'll let you know how his flight goes from the day and we shouldn't see him here tomorrow. Yeah, and I know that that was the movie. Please hand me your talk. Yes, I want to hear you. Thank you, I appreciate that, I do appreciate that but the only question I have been asked so far today is I figured I should probably answer that one first. Okay, so I'm going to talk about how we're using Debian to deliver software for memory-driven computing and I want to first give the 30,000 foot view of what we think memory-driven computing is. I'll show a little demo, a little video for a demo that we did. It's really boring, it's literally a video of a computer monitor. So I apologize for that. But it's memory-driven computing, it's a giant box, it's got a bunch of LEDs on the front, they blink a little bit and that's all you get to see. My life in computer graphics, 30 years in computer graphics, it was really easy to do demos and now I'm doing memory-driven computing and it's like what do you do for a demo? You have a web page. So memory-driven computing, what is memory-driven computing? It's our vision of the future of computing. Why are we doing this stuff? The basic motivation is this gap that we've seen and I'm sorry for the corporate slides but this really does kind of give an introduction of why we're doing this, why we think it's important and then I'll show you what we're actually doing. So we're seeing a gap between the amount of data that's being generated and the processing speed of computing. This is basic, basic gap. Moore's law has ended, computers aren't getting any faster, we haven't seen significant improvements in processor speed for what, 10 years now? It's been amazing. And meanwhile, all of us seem to be uploading cat videos at an even ever faster rate. The amount of data that we're capturing continues to increase. Network bandwidths continue to go up. I mean we're shipping 100 megabits networking now. Who would have thought? I lived through the early years of Ethernet and we went from 3 megabits to 10 megabits in like 12 years. And now networking is getting faster, at a fairly good clip. But our computers, our ability to compute is not keeping up. And the basic problem is one of computer architecture. We've had essentially the same computer architecture for 60 years. The processor, and there's some memory, and the only way you can talk to that memory is from that processor. You have this offline data storage, you have these networks you can communicate in other fashions. But all your communication is very structured and very much software-driven. There's no hardware assist for the communication. If you want to communicate with another computer, you have to put together a little network packet and spit out a network packet on your network interface. We would like to think about large-scale systems as being much more tightly integrated than that. We're trying to get to the point where you can actually put a lot of data in one place and have a lot of computers, a lot of not applying arbitrary amounts of computing power to it. So what we're trying to move, we're trying to move from this notion where the memory is kind of a peripheral, a processor. Well, the memory is where all your data is. We're trying to take the memory out of this peripheral notion and put it into the center of the computer. So right now we have kind of two notions of computing. We have scale-up computing, where you get a bigger and bigger and bigger processor and more and more and more memory. Well, Moore's law kind of says that's kind of over. We're not able to do a lot of that. Or you have this distributed mode where you take the notions we built in the 90s, the old clusters and that kind of thing and you just make a bigger and bigger and bigger cluster. One problem with a bigger and bigger and bigger cluster is now all of a sudden your data is getting sharded in a tiny, tiny little fragments. So every processor can only see a tiny amount of the other problem. So you have to work very hard to get your problem to the point where you can actually do useful work on a thousandth of one percent of your data or less. As your data sets get larger, the only way you're able to satisfy that is by breaking into smaller, smaller pieces. So sharing everything, very difficult. From a hardware perspective, we're trying to do that. HPE has a couple of products. We have the Superdome X platform, which goes up to 24 terabytes, 24 terabytes. And then we have the units are hard in my world. I apologize if I misspeak them. And then we have a recently acquired SGI and their MC990X hardware is brought into the HPE product line. And that one goes even further than that up to, you know, 60 or 100 terabytes of memory. So we're starting to get pretty big scale-up systems, but those are really reaching the ends of viability. And so shared everything is not very viable either. So what we're trying to do is we're trying to get something in the middle, something where we have, something where we have the characteristics of scale-up computing in terms of every processor being able to touch all of the memory. And something of the characteristics of scale-up computing where you can add arbitrary amounts of computation and have it able to integrate into the systems smoothly. So we're trying to kind of find something in the middle, find new computer architecture, and basically between these two extremes. So that's what memory-driven computing is. We're putting memory in the center of the computer and then we're being able to attach computing to it. And so we're building, we obviously built something called the memory-fabric test bed in our machine-reach research program. And that's kind of an example of that. And I'm going to show you what that looks like. I have cool pictures. You can see this is actually in four columns in the office where my co-workers work in. You can see that the rack, it's a very deep rack. And the hardware that we built in, it doesn't fit in this very deep rack. It sticks out the front and it sticks out the back. So it's kind of maximally sized to that. The original plan was to actually fill an entire rack, but we had power and cooling issues. And we had actually another interesting issue. It's very difficult to find, to take this new computing architecture and immediately find problems to attack it with it. So we've spent 50 or 60 years sharding our programs into the scale-out computing. All the supercomputers in the planet are scale-out computers. All the problems in the world are designed to be scale-out systems. So it's actually difficult to find problems, big problems that need this kind of hardware today. And that's what we're actually spending this year doing is actually going out and finding some new interesting big problems. But we do have hardware working and it is pretty cool. Here's one of those little nodes. You can see it has... This is just one of the... There were 40 of these in that rack. This is just one of those. It has four terabytes of RAM and a many, many, many core ARM64 processor on it. And we put 40 of these in a rack. We designed it to scale up to 80 in a single rack. So the system that we built is 160 terabytes. You can obviously scale up beyond that. And the way that it connects all these racks connect is through this interesting new memory interconnect. We did this memory interconnect as a prototype of a new system interconnect called GenZ. This is not GenZ, but it's a lot of the same ideas are in the two systems. It's designed to be a load-store fabric, which means you plug all these 40 nodes together and you can execute instructions on the processor and fetch data over the fabric. So unlike a network, unlike even a rocky and enhanced Ethernet where you have this DMA capabilities, those still require software intervention for every transaction over the fabric. With this architecture, you're literally just executing CPU instructions and it's going and fetching data over the fabric. So your latency is very low. Your bandwidth is obviously very high and the complexity of the software is very, very low. You don't have to do any complicated mediation of offering data or figuring out where the data is going now. You can just execute instructions and take advantage of the enormous amount of memory. This is one of the network back planes you can see it has a lot of wires in it. There's a combination of copper interconnects and optical interconnects. Obviously, we're doing a lot of work with optics to make sure that we can reach beyond the rack scale and go to data center scale with the same kind of bandwidth and latencies. You can't get there with copper. It's too big, it's too slow and it takes a lot of power and so we're doing a bunch of work with optics. One of the chips that we made is this cool little X1 optical interconnect. It's literally a silicon that has lasers etched onto it. So we're using silicon fabrication to build these little ring lasers. I went to a tech talk last year at HPE Internal Talk about this and I was just blown away by what they're able to do in terms of improving the performance and getting the bandwidth out of the chip. So it's a multi-ring laser on a single piece of silicon that all feeds into this single optical fiber coming off of it. It's just like, you can do that. It's really cool. So we take advantage of that, of course. And then we try to find problems. One of the problems that HPE has these days is that we run a big network and we have a lot of data coming out of our system. And then we have a lot of people trying to attack that network. And trying to do analytics on the patterns of the attack that are coming in is very difficult because you have to be doing it in real time. This is not a bad process. You really have to be updating your database in real time. You really have to be doing analysis of the data sentence coming in and one of the techniques that we're using is this large-scale graph inference. People use this sort of social network in graphs. They use it to figure out advertising. They use it for all kinds of data where what you have is a bunch of independent agents that have relationships. You want to find out how those relationships affect what the operators are doing. And so what you want to do is you want to balance this enormous graph. It's got, well, this has got this way and this has got this way. The thing about large-scale graph inference is that you can do local computations, but every iteration of that local computation changes what your locality is, which is to say through every step of the graph you have to completely change the data that you're operating on. And in a traditional scale-out architecture, that means you do a little step of your inference and then you shuffle data across the network and then you do another step. The steps in this problem are very small, which means you're spending most of your time doing communication in a traditional scale-out architecture. Memory-driven computing gives us its ability to just do a step and then say, oh, I need different data. Well, let me just go fetch the data. It's out on the memory pool. I don't have to do any distribution and shuffling the data. Let's see, the problem that we're working on here, we have got some numbers for you. Three and a half billion web pages, hundreds of billions of computers. So that's, you know, the scale of the problem is starting to get interesting. The particular problem that we were working on here is actually a security analytics problem at HP analyzing a single hour of data. And a single hour of data that we have is like, I think it's, see if I can get that in. Not tell lies about the numbers here. Find my other piece of information here. And yes, I can't see my screen. So we have a single hour worth of data, which is 20 billion points in a data set. And what we're trying, the problem is, is if you're looking at traffic analysis of a single hour, you're missing any sort of long-term data. What we really need to be able to do, we think is to be able to get it out of the data, and so you take an hour, multiply it into a single week, and you're getting significant more data. So this particular video that I'm going to show you here is just a single hour's worth of data, and you can see how little of the memory fabric tests that we actually need to use. Let me actually go find the video here. Okay, then. If I drag it here, of course, external monitors, the delight of modern high-resolution graphics. See if I can actually make it not overfill the external monitor. I should be able to hit this key. Here we go. Okay, so this is just going to show you our little memory, the memory fabric has been executing this problem with an hour's worth of video trace data. So you can see I have 40 nodes, 160 terabytes of memory, and you're going to see just how much memory is required, what percentage of the system memory is required for this problem. So there's 20 million data points, there's 55 million connections, and I have an enormous amount of memory, and you can see here the little blue spots, those indicate the data that we actually are using for this problem, and they're scattered across the entire machine, and it's going to execute this fine analysis problem. So the algorithm is obviously computing. You can see it talking between the nodes to find data across the fabric, and it's going and collecting data from various points, and it's doing a single step here. It's not terribly, like I said, what do you do when you have a computer, the only thing it does is fetch from memory, and store back to memory, it doesn't do a lot of fun stuff. So you can see here the problem is actually starting to reach out, and using the fabric to take advantage of the fact that it can fetch data from all the way around the system, and not have to do any complicated communication. And so you can see this is actually showing each node, each computing node, going out and touching memory from the other nodes and fetching it for the analysis step of the problem. You can see how much of the memory bandwidth of the system we're using right now is 0.61%. So there's a lot of headroom available here, and this is hours worth of data, so we think we can competently do a week's worth of data in the system without too much trouble, which is pretty cool. And so that's kind of one of the problems we're looking at right now. So the bar charts here are showing you the convergence of the problem, showing you how the actual solution of it, and it's getting better and better and better. We're converging, you know, 32%. And that's to say, as you do each iteration, the graph gets weighted and rebalanced, and you do another iteration, the weights update, it solely converges on a solution. And you can see it's nearly converged by now, and it's done a lot of computation, and that's the excitement of demos on the system. This particular graphic right here, we've since updated the underlying system there this summer, we've got a couple of interns making that software more reliable. Let me show you what they've been doing. Oh, it's almost converged, it's so exciting! We got to show this demo over and over and over again in Las Vegas a couple of months ago. Yeah, that's the excitement of demos on the machine. So you can see we're getting a significant speed up, though. We ran the problem in a scale-up architecture, and the scale-up architecture is literally 128 times slower, so we really have some pretty impressive speed up. It really is all about getting rid of that communication overhead and getting the place where you can take your computation and apply it to the data without moving your data into the copy. Okay, now I actually want to talk about the software that we built. Oh, I have another one. Do I have a Monte Carlo simulation? Here's a Monte Carlo financial simulation. So the Monte Carlo simulation, obviously, one of the goals there is to be able to do a bunch of random analysis on your data. Well, it turns out that if you pre-compute a bunch of the data in this financial model and use interpolation within your pre-computed set, you can generate results a lot faster. And in fact, if you pre-compute a lot, like 100 terabytes of data, it's about 10,000 times faster than computing it from scratch every time. Who would have thought? So the availability of an enormous amount of memory, just an enormous amount of memory without a huge amount of computation can speed up some problems dramatically. You kind of look at the problem a different way. Instead of thinking that memory is just being a tiny memory like 10, 20 terabytes, I think FETA is the actual reasonable scale memory of a couple hundred terabytes. You can start really computing some stuff in advance and change your ability to change a problem from having to operate in batch mode to being able to operate in real time. So we did some financial risk modeling and we were able to take it from something which used to take several hours to take just a few seconds. It was literally 10,000 times faster, just by having a machine with an enormous amount of memory. If you had to do that in a cluster again, the problem with doing it in cluster is you'd have to distribute the problem to the entire cluster and somehow figure out which parts of that data were relevant for that particular request and that would take a bunch of time to transmit that data and in that cluster model it's faster to compute than cache but in a memory driven model it's faster to cache than to compute. I want to just get beyond this stuff and talk about what I actually came here to talk about which is Linux for memory driven computing. Debian is all we use for memory driven computing and because what else would one use? It's the universal operating system. Obviously it scales from my watch to the biggest computer on the planet. So all the software I'm talking about right now is off on GitHub and we're trying to do all the development in the open. It's really hard to take a corporate structure and move it from development in a little closed silo to saying actually your commits are available at GitHub all the time. So we're teaching people how to do that right now. The last step in the block we have is we have some Jenkins infrastructure which is tied to our GitHub Enterprise instance inside the firewall which automatically does our CI, our continuous integration and testing stuff. We haven't got that replicated externally so the developer is like I'm not going to push it, it's not getting tested. I'm like oh I really hate telling people that they have to move external and they don't get testing anymore. So we're fixing that. But that's literally the only stumbling block we have at this point is trying to get to the point where we have continuous integration and testing outside our firewall. But most of our stuff is being done externally. This is the system we built. We have hardware, we have Linux, we have a bunch of libraries and I'm going to talk about some of those. So we came to DevCon three years ago in Portland, four years ago, three years ago, I think it was three years ago to talk about HLinux and what we were doing with Linux in HPE. HLinux was something we were building for our Helion system. Helion was kind of an open stack deployment vehicle and we were taking Debian and customizing it for that. Debian took a bunch of work to make Debian suitable for that not because Debian wasn't ready, but because OpenStack has specific dependencies on a lot of different packages. So we actually take Debian and take random versions of various pipeline packages from anything from really stale to really brand new to try to construct a horizontal stack that can support OpenStack above it. And that's what HLinux is all about. How do we take a Debian system and make it be able to be very purpose-built for supporting a specific OpenStack deployment? And we did that. Helion OpenStack recently got sold from my employer off to MicroFocus and SUSE and it's all very complicated right now. But so HLinux no longer really has a role in our organization in terms of supporting the Helion system. And so what we're doing right now is we're transitioning that from this HLinux base which is how we started the Linux for the machine because we had a Debian system and we needed to support another architecture so we kind of built on top of HLinux and we're transitioning from that purpose-built horizontal Debian distribution to just running Debian and taking a small pile of packages and adding them in. So we have a Debian unstable system and then we just have probably 15 or 20 packages that we've built including some new kernel modules and new device drivers and that kind of stuff. So it really is just Debian running in most of our environments. We're not running this on the MFT yet because we need that kernel bits and a bunch of newer stuff that is not quite even Debian and stable yet. But we're getting closer and closer. So we're taking this transition from our very purpose-built horizontal HLinux distribution to just crunching it down and saying, okay, we're just running Debian and then we're just going to add a few packages. I want to talk about the packages that we're having. So we have two systems that we need to run Debian on. We have our external management system which runs management services and our file system metadata management. And this is where all the kind of interesting packaging stuff and a lot of the stuff that I'm going to be talking about in a few minutes is about. And then on each of the nodes, each of those 40 things, the green lights, they run another Debian system. It's running entirely in memory. So this is kind of like an NFS root environment but kind of not because we don't have... We are actually just... We're building in an in-randis and that's what you run. The nodes are entirely stateless that way from the perspective of the individual node. Of course, they have access to this fabric-attached memory, this enormous pool of memory which is persistent and persists beyond the life of this node. So it's kind of a weird little world that it lives in. So we have these very ephemeral, entire operating system instances. So in a lot of ways it kind of looks like a container sort of thing. We kind of spring this node into existence. It runs an operating system for a while and then it gets shut down. And the data that it's computed is stored in memory, this persistent memory. And so we needed to build a way of getting these things up and running. And I think... I'm hopeful that the piece of technology will kind of inspire somebody to think, wait a minute, that's a neat hack. I wonder if I could use that for what I need to do over here. So I wanted to talk about that. We have three hardware targets that we're targeting for this system. We obviously have a memory fabric test bin, that giant piece of hardware that I showed you. We have an emulated environment that I can run on my laptop. I can actually build a machine on my laptop. Obviously it does have 160 terabytes of memory, but it does have all the same architectural characteristics. And I did a whole bunch of testing for a program that we'll be showcasing later this fall with a German research institute. I did a whole bunch of prototyping that, you know, on my laptop. It's like, okay, that's kind of cool. I can do memory development. Memory-driven computing development in, you know, on the airplane. It's always nice. And another piece of hardware that we have is this MC990X, which is this scale-up-ish computer from the SGI division. And it is normally delivered as a pure scale-up system, up to 128 processors, up to a bazillion bytes of memory. I really don't know how much memory it could hold. But it turns out that you can actually take this thing and partition it and break it up into individual little virtual computers with a collection of processors and a networking interface and a bit of their own little local memory. And then they can communicate over the fabric that is in the hardware with other nodes in the same box. So we can build something that looks very much like memory-driven computing, with hardware that we're shipping today. And that means we can actually do a bunch of memory-driven computing research with hardware that we have available today. A newer hardware coming out is going to enable it will scale it up a little bit bigger. It has some other characteristics and interesting. But it lets us kind of do research in the software and systems development we need with hardware that we're shipping today, as well as this prototype hardware that we have. Obviously, with the prototype hardware, the big problem is availability. There's like, you know, four of them. Whereas with this MC990, you know, you can make as many as you want because it's a shipping product. That's three targets we have. So one of the other things we lost when all of our software resources went off to micro-focus was we lost our kind of our system integration and development team. We no longer have anybody working with us that maintains our build system. We don't have anybody working for us that maintains our build hardware in particular. So what we're trying to do is we're trying to build kind of a virtual build environment using containers. We actually had some interns over the summer build us a container that you drop it on a random machine running a random operating system that supports Suza, Fedora, Red Hat, Ubuntu. You just drop it into a random machine and it springs up a devian container and you say go, and it goes and defences all of the software that we need for our system, downloads it from get, compiles it and emits devs. So it's kind of a build in a box. It's just you dump it on a random machine and push it go. That was done by a couple of interns this summer. That's been really useful because it means that we could just go out and find a random piece of hardware somewhere that happens to be working today and get our software built. We don't have to depend upon having the magic build server that's off in the corner and it's, you know, gold-plated and never touched by anybody. And that's been very useful to us. And then we also have another little container that we run that can actually stand up and deliver a devian packages. That's a container that's got a bunch of aptly packs in it. And you just hand it a pile of devs and say go and it, you know, serves devian, constructs a repo and serves devian bits out of that, which is kind of cool. Did I question? No, sorry. Okay. And then we have this, we have external management services that actually kind of run on this external server. And we actually have another container that runs all of these as well, oddly. But we have the librarian, which is our metadata service that I'll talk about, our file system metadata service. We have this manifesting thing and that's the service we built that actually constructs these init-randis and takes packages and builds them and customizes them for each node. And that's kind of the software that I want to spend a bunch of time talking about in a few minutes. And then we have that pretty dashboard which shows all this stuff. And then we have the usual selection of random network services. So this is where all the bits live. They're out on GitHub. All the packages are there. Some of the, I think there's probably two or three packages that aren't currently being kind of developed externally. We're mirroring them externally, currently we're not actually actively developing them out there. And we're trying to, as I said, we're trying to get to the point where we can do that. Because obviously then other people can contribute and until we do that it makes it more difficult for other people to contribute. And that's our current plan. So this is all the stuff that we're shipping and I'm going to go through a bunch of these and tell you what they are. The ones with stars here are the ones that are actually that's where the project lives. There's no clone elsewhere. And we're working on making that happen for all of them. So this is the little container I talked about, the build container. So just a Docker container for building all of our packages. It's just got a script that runs through and get clones and then de-builds them. All the packages that get admitted are unsigned of course because there's no signing authority here. So it's useful for testing. We need to figure out how to make it useful for actual deployment and how we can actually get the packages signed and how to actually make it part of the continuous integration system as well. So this is what we're going to be replacing our currently creaky internal Jenkins instance which runs on a box that has no system in for it right now. God, I hope it doesn't crash. So this builds all the packages that are necessary to kind of stand up one of our little instances. It doesn't build a bunch of random stuff that we don't need for testing. It was done by an intern this summer. Austin worked on this. It is really cool to work at a corporation that has a very strong history and policy of bringing in high school and college interns. I think we had eight or ten of them. We have a group of eight engineers now and we had like eight or ten interns working with us. It was really cool. All of a sudden our group doubled in size in the summer with a bunch of high school and college students which was great to see. Here's the repo container that David did. It's either called DevServe which is the name he started with and then some marketing guy I guess got at him and said we have to put our corporate branding on that and called it the L4 fame repo container. This is not specific to L4 fame or our memory computing stuff at all. It's literally just a container that you throw a pile of Debs at and it stands up as a Debian package repository for it. It automatically generates indices, automatically starts up an Apache instance that will serve them out. So it's kind of convenient if you just want to build a bunch of Debs and you don't want to have to go through the problem. I use a mini deep, what is it, mini something. I don't even remember what it's called anymore. Just to do that on my local laptop but this looks like it's probably even easier than that because it's all automated and all you have to do is build the Debs and hand them to it. Very convenient. When you update the Debs, it tracks changes. Those are just, oh look at that, .dev change. Let me go rebuild the index. So kind of a mini decon I think it's called. That's the thing I was using. This uses aptly of course to generate all this data and it's all nicely automated in package. And the other container that a couple of interns did, my and Madison worked on was this, all of our management services are now in containers. You can just kind of, they can pull down the Debs automatically build up a little container and just say go and now you have all the management services. This means that we can kind of stand up test instances or infrastructure really quickly without having anything customized on your own box. So if you want to do memory-driven computing development, you can get this torms in a box, stick that container into machine, instantiate a couple of VMs that run the nodes and get L4 or fabric-attached remuneration running on your laptop in a matter of minutes without touching your base operating system at all. So for people who want to just come and toy with it a little bit and see what it's like, we're working on making it very convenient. I know the thing you worked on by Lillian and Annie this summer was our TM dashboard. That dashboard is that pretty UI you saw before. Trying to generate actionable intelligence about the state of a memory-driven fabric is really hard and trying to capture what's going on and figure out where the bandwidth problems are where the performance issues are where your application bottlenecks are. And so we're trying to build some infrastructure and so this is a little web... it's all web-based. It's a little web service that goes and touches all of the nodes and all of the infrastructure and says what's going on with you? We have a bunch of monitoring and logging hooks there and it captures all that data and presents it in a pretty little web UI that looks like this. You saw that before. So you can see, we're trying to generate data that shows the user what's going on in real time. And that was done by a couple of interns this summer. The new version of that. The old version from the demo was done by a team in Bristol who are now looking... now doing something else. So nice to have that kind of very customized system brought in house. The library monitoring protocol, that's what this management tool uses. It goes out and touches all the nodes and brings the data and shares that out to that dashboard. So the emulation shell script. This is kind of the first stuff we released. It's literally just a shell script that takes the Debian packages that we deliver and constructs a synthetic set of memory-driven computing nodes by using this QMU KVM hack called the InterVM shared memory system. So you generate, you take a pile of memory on your host machine on the host machine and you then make that visible as a device in all of the VMs. Now all the VMs can touch this memory. But that looks a lot like memory-driven computing to me. It looks like a lot like fabric attached memory. So that's the environment that we use to do all of our development for all of these tools. So the initial project we put together is some simple shell scripts that generate these nodes and we're working on improving that to automate at the point where you can actually stand up a memory-driven computing test infrastructure on a single machine. So the library is kind of the heart of our, the way that we take this fabric attached memory and present it to applications. Normally applications think about memory and you think about Maliq or maybe Mmap as a way of getting into memory. Well Mmap takes files and when we started this project a couple of years ago the researchers came and said well what we really want is memory that persists across the file system reboots and chunks of memory that are resizable and have names and have access rights. So we built this really complicated new system called the, what do they call it? The wholesale memory broker and it had this API that you call you pass it a name and you pass this little mask and then access rights and it would map that memory in your process. I looked at that and said you know that looks a lot like something that we have in the POSIX system called a file. Like no, no, no it's not a file. It's memory and I'm like yeah it has storage and it has an extents and it has a name and it has access rights and it looks like a file to me. So we created a file system and the file system is entirely in memory but it's in persistent memory and it's distributed across all these nodes and that's what the librarian does. The librarian manages, we grouped the chunks of memory into a group of pages what's a group of pages is called a book. Our books are a fixed size you know the smallest allocatable unit of memory in the MFT which is only 8 gigabytes and then you can collect a bunch of those so you have a substantial amount of memory into something called a shelf and a shelf is the same thing as a file. So when you talk about shelves and files sometimes you'll see it's talking about that a shelf is just a file but a shelf is a specific kind of file it's a file in fabric attached memory. So the file system is visible across all the nodes so now we have a distributed file system. Well what's the easiest way to write a distributed file system? Well the easiest way to do that is to have a single central server that serves out all the data about the file system. So that's how we built our first instance which is what the librarian is. The cool thing about the librarian is the librarian doesn't actually care about the data because the data itself that's all just memory mapped has to be able to do allocations. Well sure, you can have Book 27 I can't see Book 27 but you can play with it. So the librarian actually runs outside of the fabric and file map on a separate machine doesn't even have to run within the machine itself and all it does is serve the metadata that serves allocations and access rights and oh sure you can have access to that. There's hardware support in the MFT to actually prevent the nodes from accessing memory they're not supposed to go into that today. But that means the librarian running externally is actually literally able to control access to the fabric and cache memory. The way that we built this it's written in Python, Python 3 of course and the way that it hooks into the node operating system is by using fuse. Obviously the read and write paths don't go through the old fuse paths they go directly into memory and we also added end map support which we just typically bother to do. And so it's literally just a fork in the fuse code we forked the kernel bits, we forked the library bits, we forked the Python library bits we created our own parallel universe of fuse that does this librarian thing and that's what library files still looks like. It uses the VFS layer it uses the fuse bits to talk to the thing and the awesome part is the metadata is stored in a little SQL light database to make sure that we had it was persistent and transactional and that kind of stuff. And I am running out of time and I apologize I spent way too much time showing you how cool that we're doing computing not enough about what we're doing here. Metadata, we have a little Atomics library, we have a Hello World and the thing I wanted to spend a bit of time talking about which I'll spend the rest of the time talking about is about this manifesting thing. So what we do for manifesting is we need to generate a kernel for our, for the nodes and that's the only data the node gets the node does not have a real root file system anywhere it runs right out of the main disk. The manifesting service runs this little RESTful service somewhere in the network and it contains this little RESTful service that you talk to to generate these images and it also talks to TFTP server and a so bootp and TFTP and all those usual boot system and it also talks to a DNS service to get the names bound up and a DHCP service to get the IP address is allocated and then we have a little CLI application that talks to this service and it stores the kernels of ramdisk for pixie to use and so what we do is we create this golden image that contains kind of a usual devian install a red usual devian in ramdisk and then we first modify the image for each node we unpack the golden image go in and play with it to set the hostname to set all of its IP addresses and to give it all of its TLS keys that it's going to need to be able to operate in the environment and then we just hang out that ramdisk to the nodes and so the node the node is then able to operate in the environment and have whatever customization needs oh you can also add new packages to that it's like oh this node needs to have a patchy server and what the heck and so you can actually put into the manifest which is why we call it manifest you can put into that the package that you need and the customizations that you require for that system let's see obviously we don't have local storage so they run entirely on the ram so on the torbs we have the dhcd server the tfdp server and then it just serves out the stuff to the over pixie and that's a pretty simple technique obviously a usual system at that point is going to get that in ramdisk the only thing it's going to use it for is to go find its real root file system and it's going to go in fs-mounted or it's going to go find some device or it's going to go do whatever it wants to do but ours is really actually truly this list there's no it's just running on a ram the operating it's just crash as well its state is persisted in whatever fan operations it's done and so it runs entirely out of ram which is pretty cool so manifesting overview uses the empty bootstrap obviously we're just standing on the shoulders of giants as usual it's got this little restful service you can send commands, you can send oh I need to build a manifest for this so I want that manifest to be run on these 27 nodes so it really quickly really lets you quickly configure the entire machine to run the software that you need and then you can quickly transition the machine from one state to another and each of the nodes are running and get the system reconfigured very rapidly so the goal is to be able to transition from one project to the next by just rebooting all the nodes and having to come up with all the new software that we need and that's about what I have time for today thank you all very much for coming out this morning and I hope we have a great week this week I know I'm certainly looking forward to playing with everyone here thank you if you have questions please line up at the microphone hello I have several questions I may not have time for all of them um I'm probably digging into some of the stuff you were skipping over I was curious to go forward aspects of the hardware because of this innovative new design often in engineering we finally shift our choke points from one place to another so in this fabric attached memory we find that the memory controller or controllers are threatened to be your design choke point in the future it's very likely and the reason of course is if you have a lot of contention for data in the same location then you can have a lot of people accessing it's a true fabric with enough bandwidth to handle anybody to anybody at full speed but if everybody is focused on a single piece of memory then obviously you're going to be limited by bandwidth and that single piece of memory is to spread the data across the fabric so you don't have a single choke point like that and so you get fairly flat access but obviously now you have to design your data so that it's spread across the network spread across the fabric to avoid those kind of choke points and then you run into replication challenges yep I'll yield the floor I'm sure it can't be the only one to have thought about this for applications it must be incredibly quick for doing password cracking and stuff you would think how big a table can you get the other problem that I've looked at is chest inking very similar problem any problem we have an enormous amount of data and you don't know what of that data you're going to need today thank you obviously I should go into password cracking what can go wrong what can go wrong okay going forward following this question all computer science all data structures and so on were assuming that memory is in carriers like hash and so on so we have slow memory, fast memory all the databases for example all the indices are to be dealt with slow RAM is fast when everything fits in the RAM as far as I understood there is no penalty to access this persistent memory do you think we are on the verge of of revolutions databases will need to behave differently well we've gotten rid of a bunch of the storage hierarchy so we've gotten rid of your disks and your networks but we still have a lot of caches of course the MFT is not cash coherent between nodes and so you have explicit points where you're introducing latency delay in the system to synchronize data across the fabric and so there are new challenges and plenty of employment for our computer scientists going forward so fear not okay I think that's the last question we had time for I'm really sorry I'll be here all week if you have more questions whatever really great to see all of you and thanks for coming again