 Okay, hi all, welcome, it's about the hour, so time scheduled, so we have to start. My name is Jeffrey, I am with the Open EDF project today. And I'm hoping to explain a little bit about what we are trying to do, and mostly also get your opinion on things. So, the genesis of the Open EDF project was, the fundamental question, and that was, what if storage for containers, storage system itself, was actually as a container? So, basically what we do is we try to distinguish the complete storage system and store it in containers so that it can be a resource-class citizen. And they just said, no matter how flexibility around this to do what we do with this particular workload, they can say, oh, we're just doing that, right? Just an NVE device doesn't do any faster than that, and we're done. So, this basically, this is how we come up with the idea in a term, container attached to storage, because we thought that we could augment the actual gas pipeline in our storage systems. And another fundamental thing that we do, or we do, we still do, is that typically storage developers, because I'm trying to comment on the storage background, storage developers typically reason for this to happen tonight. So, this is my guess. And I have to break that somehow so I can get the application. We basically converted that. How does the end-user, you see there, you get my persona, how does he work, and how can he make his life easier by giving him tools, fresh and intense, out of storage? Because that, originally, in storage consolidation, was done for efficiency. Back in the day, it had to be small storage silos that were sitting here. And back then, storage devices were very small, so they were in a megabyte range. So, they were just sitting there, not going to utilize in terms of passive performance. So, they consolidated all that access storage into one storage system. And then, obviously, also, I made data management in particular for take-back-ups, which is take-back in the day. So, it originated in the local area network, and Morrell was a pioneer in this area. And he actually had to have a special TSR, Terminate and Stay Residence tool, or Daemon. It wasn't called Daemon, because it was Doste. But you logged into the system, and then the volume became available to you. And this is also the time where TCPIP was not established yet. It was IPX, SPX was still very much used back then. And still, even in Windows 95 and Windows NT, they were adding additional drivers, not shipped by Windows, by the way, but that would need to be installed in order to attach to these storage systems. So, fast forward 20 years, then we had a dedicated network, a real storage area network, or a network-attached storage system. These were very specialized hardware appliances. So, they had very specific hardware that was not otherwise available to the average computer shop. So, also, it was a separate network. So, they segregated the network. It was a different network fabric. It was fiber-channel. There were no IP packets moving back then. It was all fibers. It was very specialized and also very special switches, with special interconnects between data centers so that you could do federal over-across data centers as well. So, how does this look in a picture? The picture says more than 1,000 words, I suppose. So, this is just a random storage picture that I googled. And if you look at the whole picture, what it starts out with at the bottom is the shared storage. And the shared storage is not in the sense of, hey, I'm having this system and sharing this out through an event or whatever, but this was shared on a fiber-channel level. So, the actual disks were visible on both controllers. So, when you do LSPOK or something, both machines would see that disk, right? So, the shared storage applies to the actual disks. And then they were connected to the fiber-channel switches all the way up to the controller. This is basically a front-end, back-end system. So, we have the front-end observing out the protocols that we all love and hate, supposedly. Ice fuzzy in the vests, SMB, or what have you. And then, at the lower end, we actually have the storage-related protocols. And these front-end systems that we ran, for example, to accelerate white caches, that alludes a little bit to the special hardware, right? And you did not have, for yourself, PCI caches. Not even v-devices looked a lot like it, but it was not the same as an NV device today. And the CPUs were relatively small. I think it was around 2007. They were still using pens from the CPUs in this type of system, because it wasn't really CPU-tested at all. So, another fundamental property of the system was HFAO, right? Storage can never go down, because if storage goes down, we can all go on. So, HFAO would typically need to be done in 180 seconds. And this was a heart limit, so to speak, that was configured on the actual clients running on them. So, you have to install special drivers, right, to set up the settings in the Windows registry or in the SIGMS or whatever operating system you were using. So, they were fine-tuned for every operating system that you were using. Unified, so block-and-file servers. Later, also, VTL, virtual tape library, and back of today's. And when all your data was on that particular storage system, then you can do some interesting stuff with the data in terms of storage features, like snapshot clones, dedu, and replication. The significant downside of this is that these systems were huge, huge, expensive, right? There were millions of dollars, euros, and euro, after all. And not only that, they were really imposing the way that you operate storage to the upper layer. So, the specific features of the storage system could or could not do bubbled all the way up to the end user. And that was, you know, I still think, a significant downside. So, it had a profound impact on what you could do with the software. So, software-defined storage. Software is eating the world. Software-defined storage happened. The hardware, you know, commodity hardware, as it evolved, became more powerful, quicker, and what have you. And systems like Luster, and SheepDoc, and all these open-source, software-defined systems came along. All built on commodity hardware. So, Fibre Channel luckily died. They tried, the industry tried Fibre Channel over the internet. And luckily, it was born dead because, you know, it was a very, very expensive way to keep people on the Fibre Channel coolant so that they could charge you a lot of money to do so. So, luckily, for software-defined systems. Devices became faster, obviously. So, all the state devices and NDRAM was not needed anymore because now we had SSD devices that did, like a Zeus RAM. It was like a SSD device, connected to the SAS bus, but it was actually around it. So, you could really have right-optimized workloads at certain aspects. But, the architecture, however, did not change. The way that we had the separation between computer storage is still there. And this is where fiction happens when we look further down the road. And that is when the containers re-happened. And I emphasize that we, because containers are not new at all, originated, I think, in the BSD space with jails and later on, so I have zones in our Linux with containers and one fundamental thing is that the way we develop, build, and deploy and run applications has changed a lot. Like I said, we invert it. So, we look from the app up all the way down to disk. So, what can we do there? And what are the observations that we're seeing? So, modern day apps have native standard features. They have load balancers, they have database sharding. They use sophisticated consensus algorithms like AppSource and Graph. So, you don't really need to have a whole lot of storage complexity in order to achieve scalability. The way that we redeploy them, Kubernetes is based on a Google Workpaper. If you haven't read it, it's a very interesting paper to read and explain where Kubernetes comes from. But if you look at how Kubernetes operates with workloads that can come and go at any given time, persistent or not, you kind of wonder, okay, so how do I make that work with my storage system, let alone with different storage systems at different regions and different locations and different vestiges. So, it's a big pain point. So, with Kubernetes, we have basically tar balls on storage, right? It's a computer, it's a defined compute storage networking. Everything is confined in a tar ball, but it's actually a bit of a shocker, because it's something to that extent. People also got smarter, right? Developers, they took they accounted for failure. You should always write your software such that it can handle failures. But people also start to write software that are DC aware and even regional. This was very much fueled by Amazon, AWS services with these regions and stuff like that. But people now develop their software and if I can't get it, I'll try to get it somewhere else and don't mind how they get there, but we'll see about that later. So, the mindset of developing applications, high availability has been built into applications. Not as much as, here's my app, I don't know how it works, you just want it on your storage system and you better hope that your storage system never fades. So, the friction between teams is still there, probably even more so, because the workload is coming though, right? So, now typically you have one VM with one mount point that's okay, but now if you have a hundred thousand containers spinning up and two weeks later they're all gone and then two weeks later they come up again, you have all these mount points and those systems have a limit on the amount of mount points they conserve, right? Also, your storage app gets a little bit nervous with the, it feels that it has no control. So, typically what happens is, you know, the legacy storage system in particular they try to duct tape this stuff with the plugins as they call it, right? So, instead of doing an API call to something that is sensible, they have this interceptor they intercept the API call for vision along the legacy storage system or whatever. But people generally don't really like that, so the friction is still there. And the other thing is that although you are able to mount something from your storage system they did not unlock the storage feature that you might want to use, for example this dataset is used to take that file. So, don't bother to compress it so in my Ammo I say no to that because the developer knows that. So, how does this work a little bit? So, this is a small animation I didn't make it myself, but I think it was a pretty good one. So, the idea is that you have your operator who expresses his intent in the Ammo. And, you know, is this human readable? Yes, does it not make sense, but it's actually in the file and maybe not, but it's human readable. This gets executed by the environment and below you have your content. It's magical, right? But, these systems are not stable. So, it's easy to understand how do we do these mistakes. So, if we take into account the stateful aspect we have our storage systems and then we try to somehow, some way, volume plug in our way to make this work again with storage, but the problem is there's a very tight plug-in. You lose the flexibility of moving systems around, spinning them around fashion one and moving, taking them back out and all these cool attributes that we are using. So, the other thing is that in this style blend you have all these workloads and you don't really know how your workload behaves on the storage system because, you know, it gets mashed and becomes unclear water. And with all the observability and tracing tools that are available these days you really don't like that. So, an alternative approach is kind of the same idea, but now, instead of doing it with an eye of blender we actually spin up the gaming container with your workload, right? It's very simple. It's a little bit more complex to implement, but the genesis of the idea, like I mentioned, is that, you know, each container workload of home ties or micro-services or control. So, practical things. So, how do you do that? And so, you have this queue utility that you can run and you add to your OpenEBS operator. This is a database that we want to spin up, for example, so we apply that to the Kubernetes system. And then, the Kubernetes does its work and all the people, we have those persistent volume claims, as they call it. And that's it. So, what we really want to achieve is that storage itself just fades away as a concern. It becomes, you know, DevOps way, agile. Data ability is really important for various people. So, that's a little bit about the control plane, right? Go ahead and take it over really quickly. There's a lot more to it in the way that we integrate the Kubernetes and all that stuff and talk about its own. But I also want to talk now about what OpenEBS is about and to change these conclusions because we did our research looking at the system of applications and how they develop. OpenEBS is not a distributed file system. And there are various reasons of why it's not a distributed file system. One of the reasons that distributed systems are generally are relatively hard to manage in production. I have worked on two proprietary distributed systems and, you know, you have to hand hold them a little bit. And depending on how big it is, you probably need a small team that is at least to keep them operational on a day-to-day basis. And to really unleash the full potential of distributed files system, you need special drivers, right? You cannot just, you know, have your petabyte scale multi-gigabyte throughput system and just connect with SIFs and just expect that data to come out of the pipeline. You need to have special clients to unleash the basis of such data. Containerized workloads are segregated and so what I mean with that is that instead of having this big data gravity spaying in your data center, you have small workloads with very small data sets. The sum is pretty big, but individually there are relatively small, right? So, talking about a Mongo database with, I don't know, 300 gigs, there's relatively small. If you keep it that small, it becomes very simple to manage. It becomes easy, of course, to replicate it. Replicating half a petabyte and if you can control this on a workload system, it really becomes feasible. So, another aspect is that hardware demands a change in the way that we do. And, you know, we finally can change the actual architecture because we still have this same model of segregation from storage to computing. And the single end of the e-device can do up to 450, or half a million octaves. It's just one device. So do I really need a distributed system to gain higher power? Likewise, do I need to scale the petabytes like I mentioned, but I'm not built because the individual workload is relatively small. So, and one final aspect and this is really, really fundamental is that when you're doing containerized applications you are already working on distributed systems, right? And all these microservices working together to achieve one goal, and I feel given the complexity of distributed systems adding another distributed system to that to make it a match of two different distributed systems I don't think that that is the right approach or not saying that it's not by definition the right approach, but in general, you might want to try something else. So, the comparison and, you know, kudos to the graphics on the right side is actually mine. You probably can see that, but it kind of looks the same, but for a specific application, right? So that's the fundamental idea. So I'll go over the components a little bit that are in there. So first of all, there's the controller, like our front-end. This is where you connect to, right? And we are starting out with block. We can do file-based if we want to but it seems to block is a lot of simplistic a little bit a lot easier to employ because you don't have to set masks for the amount of forms and things like that. It's more one-to-one mapping to the container. ICERD, that's basically ICSI over a v-band, obviously, NMD over Fabric. They're actually working on NMD over Fabric for TCP. I'm not sure what to deal with that is, but if that actually gets released, then that would be really, really great. So this controller is pluggable, right? So we can just swap out those services because there are containers out of the wall. So that allows us to have a pluggable back-end so that we can change the way or change the logic that actually persists the data on this, right? So if you have a database workload, right, you might not want to have a copy on white file system or anything because your database inherently already works with white-hand lobbying and things like that. So it's kind of doubling up. So depending on your workload, you may want to define your gamma a little bit different. Right? So on declarative state, a number of replicas, snapshots, schedules, the retention of your snapshots if you want to replicate it, and the consistency levels are also important. So with that, I mean, if you have a T-band replica, you can say I want to have a quorum and I acknowledge right. I have a quorum and I acknowledge back to the client. So to speed up right-end, obviously. And obviously this also needs to be DC aware. Region aware is a little bit more difficult because regions typically have a big distance between them. But DC aware is certainly a possibility. So now we get a little bit more into the meat of storage systems themselves. And like I said, the replicas are pluggable and we have several. And today I would like to talk about one that we are working on or at least trying to figure out if it would be a good idea. That is reusing the DMU layer of CMS. I don't know if people are familiar with CMS. I think you should. If you're not, you definitely should need about it. But it's a well proven battle-tested file system. But it has a very interesting design. It's like models like a database. So we basically took that in a component of CMS and run that in user space. And one of the properties that it has is that each right is a sign of transaction. And these transactions are matched into transaction groups. And these transaction groups are flushed to stable storage. This is done atomic. So either all right succeed or none of them. So this also means that there is a file system check. More often than not, when my laptop dies, I put up Linux and it doesn't I know checks or whatever it says. But that's not necessary with this system. This is important again because if you replicate data you want to make sure that you replicate a consistent data. That's one. The second thing is when you start up these containers based off replicated data, you don't want to have the system go basically for 20 minutes or whatever. That's a new file system. And say, oh I can't repair it. Well thank you file system. That doesn't help. So pool storage model. This is a virtual memory management type like allocation scheme. So you assign devices to that pool and the system allocates logs from that pool. And this was considered a rapid layer violation in 2006 by the Linux community. Because unlike any other system right now it combines several pieces together. So Linux, you need to partition drives or LLVM things and then put a file system on that. That's all taken care of by the system right now. So it has a very easy way to consolidate storage devices and basically allocate storage devices. Another thing is end-to-end data integrity. Very, very important in a file. It does this by checking all rights. It sounds very simple, but actually it's not because it's also a public right file system. But we all know that because CFS has been there for I don't know 10, 15 years. But now it runs in a container and in use. And the upside of this is if you move your system from cloud to cloud because my view of CFS is not native in the Linux kernel. It has a different license. It's CDL versus TPL. That won't make so you can't have it. You can, but they have to install it yourself. Now when you move from clouds and you want to scale up and scale down like that, you cannot afford yourself to first compile a kernel module across your fingers and load it and then actually try to import the data. So kernel dependencies or this dynamic kernel module system or whatever it is. So the other thing is that it does not take the kernel, right? So if you load a non-GPL thing and your Linux kernel will say, hey, kernel take it. Maybe that's not a big issue, but if you have a support contract with I don't know, let's say a Rathad or whatever, that actually might become a problem. So with the capabilities we have asked what it allows us to do is do things like this, right? So the logo says it all, right? That's our mule. It's a combination of a horse and a donkey. I didn't know what that was either, but they can't make babies when they do this machine. So you have your workload in US Central and it's running CentOS, right? CentOS and I'm not going to pronounce it. Basically you move your data or your compute. So everything in containers, you move that to your Ubuntu machine and then you and then you decide I want to move to Amazon, right? And normally this was really hard because you would either be tied to a compute volume or a Amazon volume. Now we have this additional abstraction running in the container that makes this possible, right? So again, your data mobility becomes a real feature today. So but there is a problem, right? A big problem, actually perhaps it's the elephant in the moment. So anybody guess who said people who think that user space files systems are unrealistic for anything, but toys are just misguided? Anyone have an idea who said that? Yes, yes. He kind of knows a thing or two about computers, right? But like I said, this is 2011, so I purposely put the gate there fast forward. It's 2018, so seven years down the road in the hardware, like I said forces of change in the way we do things. Good for us. Microsoft could be good for you. NTFS. Yeah, okay. Yeah, could be. I do not have the finances to acquire a license. So how do we achieve high performance numbers in user space because we have a lot of context which is every time when I do a mutex lock in user space, the kernel comes in, right? Context switch. We lose performance. The other thing is how do we do direct memory access trends for user space because we don't want to copy the data and we want to immediately tell the CPU hey, there's the data, just grab it and move it to the device and we're happy. The other aspect is besides that, the other thing is that the kernel actually becomes a bottle, right? So we are moving towards 100 big devices that work and if this is interrupt driven, the CPU can't really make or the kernel can't really make forward progress. It gets interrupted every time and there is, although Linux is very concurrent, there are still co-pads that have to be sequential, right? So the kernel really can't keep up. Same thing goes for NDE devices, right? And we're moving to a 3-day cross-point and the problem only gets worse. The other observation is frequency remains relatively stable, right? The core count goes up. If you buy a Mac and a laptop thing, they have 19 cores, right? It's just amazing, right? So I see a lot of idle cores somewhere. So, you know, this is not just me claiming this. This is a graph from Intel. On the lower side, you see the number of SSDs and then you see the number of I-outs, right? So, like I mentioned, the single device around 450, half a million I-outs and then when you add more devices, the total number of I-outs does not increase, right? So, that's not getting the biggest bang for your buck, to be serious. On the right side, this is a very contrived example, so I have to admit but this is a simple web server because you can write web servers with, I don't know, three, four lines of code these days. One is using the kernel network stack and the other is using a user-level networking so it completely bypasses the kernel to process the package. You can't do anything else on the network card either, so the network card is lost, right? But it does show you that if you do things differently, you can get more performance higher. So, how do we do that? So, as I mentioned, we bypass the kernel and it's a paradox but in order to do that, we actually need to kernel, right? So, we instruct the kernel, famous the kernel and those devices go, give me those addresses and I'll work with them. So, there's a trust solution there. And then we use a framework called SPDK and with those systems, we can create containers that do IO call work, right? So, I don't know if this is a clear picture but we map the register to user space and then SPDK does its IO directly into these addresses. So, Polmo drivers, what that means is that we take a CPU instead of using interrupts, we constantly are pulling forward. So, if you look in your system, you will have a core that is 100 percent C. But that gives you the ability to skill, you know, to millions of other things. But, you can imagine that if I have a Kubernetes system that I would run, I don't know, 100 containers on that particular machine and all of them would try to do 100 percent CPU, that's a problem, right? So, what we did is we containerized all the IO devices and we called this an IOC, the IO container, and instead of submitting the IO to the kernel, we submit the IO to the container, right? So, it's a user space DevFS. So, but then we have this problem, okay, so how do I tell my MongoDB instance to do IO to your IO company, right? The rule is not going to change because we had this great idea even though I say so myself. So, picture, how does this look? And so, this is the IO scene, so we put their mix, the SSDs and CPU, and through the CPU mechanism, we can actually control a little bit what CPU it takes, so there are some tunables there left still, but the idea is that I allocate these resources for storage on my node, right? And, like I said, we have on the course these days in those high power shows, and then we have this plus, the short speed, that we communicate on. It actually really looks like a micro-show, right? So, I did why. So, how do we communicate, right? So, the idea here is that we reuse proven technology from the virtualization world called V-host and Verti-Oscusi. This is done entirely in the user space. So, the IO container basically serves out through this V-host protocol the locked devices. Just like you would connect, you move to your, you connect it and you say use this device, you kind of just basically specify, use this socket, because what we will do is we will use a shared memory and basically you connect to that shared memory system and that's how we exchange data. And, because this is a V-host interface, and it is consisting, we do not have to change any of the software that's already there because completely abstract. So, the replica containers connect using this Verti-Oscusi system. They expose sockets, like I mentioned, but there is no lit Verti-Oscusi available. The protocols are embedded into the hypervisor. So, there is no standalone library that you can check out and just run. So, we are currently developing that. Looks good. We moved some bits already so that's really good. Another important aspect is memory management. So, like I said, we outweigh the memories that we stick to in the IO container. But we use huge pages for this. And it's an undocumented feature of a Linux kernel. But these huge pages are in memory so that means that the kernel doesn't move on the map. And, because they're kernel-so, we can actually DMA straight into the NV device from the Verti-Oscusi IO and that's what how we can put on. There's also some future work going on. I actually checked the fast-time schedule from last year and FDIO or FIDO, as they said. They basically take this user space networking to the next level with something called VPP. And they can achieve on this bus, this will just be a tier of it per second. So, bandwidth side of things because when we have these high million 500,000 IO SSD devices, we also need a lot of bandwidth. And that's why FIDO can deliver as a universal additive. It's from Cisco and in particular what we are interested in is VPP-VCL. So, VPP-VCL stands for Vector Factor Processing Virtual Communication Layer or something. So, it becomes a little bit complex but the idea is and this pretty cool is that they emulate the native PSD sockets, right, like sockets and we see that kind of stuff. And you can help you preload it with your existing software and automatically getting the benefits of the Vector Factor Processing. It won't go as fast if you would rewrite it from scratch. But, you know, you know when you rewrite everything from scratch obviously. So, to make it a little bit more clearer on the host side, the host, user thing, just to make sure. So, we have the host. We have some huge shared pages which are containerized as what the second blue line is supposed to indicate. And then we build a virtual queue and then we expose that virtual queue through an IPC mechanism just to give memory on regular video specifications. Nothing magical to that sense. But it allows us to communicate as fast as you can. Final slide and summary Open EBS. We will have to bring the advanced storage features to individual containers and with advanced storage features I mean things like copy and write, dating integrity, because I truly believe that you need to make sure that the dating is actually the dating you need. Cloud native, using the same software development paradigm instead of cloud washing and existing system duct tape it, we actually rebuilt the whole thing from scratch. Implemented fully in user space so that we avoid congestion in the kernel like I mentioned and because of that it's also multi cloud. So move your data in and out of different cloud providers, not just cloud providers but also on your own storage system or whatever you have in the office. Declarative provisioning and protection policies that remove the friction between the teams. So there is no more, hey I need you one and things like that. And the other thing is that we are not a cluster of storage instance but instead we are a cluster of storage instance and that means a fundamental difference. Okay, so with that I'm done, thank you and if there are any questions please feel free. The data in the containers when we reschedule the container for whatever reason is crashed and so I actually mentioned it but we want to deploy the replicas or the only deployed storage system we actually make sure that the application lives in the same pod as the storage pod and when we when the application dies we have both the interviewer and the scheduler and we have this affinity to reschedule it there. If for whatever reason you can't do that we reschedule the application to another right and if you can't do that for whatever reason I'm not sure if you implemented that already but then we have to copy the data to another machine that is taking off. But that controller running in the cluster takes care of it. Yes, yes, yes Right, so the question is if we can do multi read write many so the sheer nature of iSCSI and block based protocols is that they are not in share. In that picture that I have with the big box with the shared storage both controls can see them but only one controller can actually access at the same time and they do this with SCSI reservations systems SCSI reservations shoot the other node in the head like paradigms to avoid that but in essence you could if you want to work with SCSI reservations now most of the people ehh not really so ideally what we're looking for that would be a file back end that you can share a little bit better so we don't have right now a file based a few points so to speak so we only have a block based controller but we could relatively easy with I don't know does that answer your question? Yeah. But you could for example just run in your container and that's what we're just doing like that how does it develop? On top of CNS servers and I have some level virtual machines using the GMP and Linux containers using fuels will I be able to design this paradigm using open ES? Well so it depends well of course everything doable so one of the things with SPDK is that it has different IO back ends so I write the code once but I can select during the configuration you are writing to a regular rotational device in that case it would be the A and the back end so then there is the kernel is still involved then there is the way to do NMV and there is even the set plug and we could even augment it so that would be one way the other way is that you would I don't know just point your V-host to an open ES V-host. Yes I would like to get direct of the Nostral and maybe even of the ZFAS and do it in user space. Yes. Well that's the trend right? Yes. So user is a new kernel so thank you my time is up