 Okay, hi guys. Thank you for joining me. My name is Itai Bogner. I'm the founder and CEO of SatoScale. We'll talk a bit about hyperconvergence and OpenStack. I built a presentation which is more educational. Okay, so I'm not trying to sell you a product today. If you want me to sell you something, come tomorrow. I have 20 minutes to sell you a product. So I hope that you'll find the session educational. Hyperconvergence is an emerging segment of the market and I'm biased of course. I believe that hyperconvergence is the next architectures in data centers. For those who are not familiar with hyperconvergence, I'll talk a couple of slides about hyperconvergence and what does it mean and then I'll cover what needs to be done to OpenStack, changes, etc., to make it suitable for a hyperconverged infrastructure, as a hyperconverged infrastructure solution. So really a brief history of data center architecture. It's really a couple of seconds. It started with single servers, obviously. So the storage was server-side storage. There were workloads running on the same server and that worked great. It was reasonably good in terms of pricing. The problem usually was with the reliability of the storage, not because the disk weren't okay, but if the node went down then you lost all storage. So the next phase of data center architectures were split infrastructure. Again, this is a high-level brief history, so let's not argue about all the phases in between. Split infrastructure was all about having the storage in one end, EMC, what's called a sun, different fabric, and the compute workloads running on servers. They used two different fabrics. They didn't share anything besides the fact that you can access storage from the workload. That is working today really great. It's very expensive, both on the capex side. Storage systems are very expensive, both on the opaque sides. You have to have a storage admin. You need to manage looms and all that stuff with storage. And basically, this is what the dominant architecture today in data centers that we see today. The next generation, as I said, unbiased is hyper-converged infrastructure. In this architecture, we're basically going back to server-side storage. So the technological trend that we are adding on is the ability to build a cluster of servers connected with a 10 gig E, 10 gig Ethernet, end up interconnect. And you manage this cluster of servers in one holistic solution, serving both the storage needs, the compute needs, and obviously all the networking that makes things happen. A very brief history of hyper-convergence, because it's really an emerging segment and there are very few solutions out there today. So today's solutions are basically marrying two distinct subsystems. You take usually a distributed storage system. Probably most of you are familiar with CEF, although CEF wasn't exactly designed to be deployed on a hyper-converged architecture, but it is a distributed storage system. And a hypervisor, it can be KVM, VMware, or any of those solutions that are available today, you install them on the same servers and this is a hyper-converged solution. Problem being is that those two subsystems are basically black boxes to one another. They are using the same fabric, the 10 gig E interconnect. It's not a split infrastructure anymore. And they are managing their resources independently, meaning if there is a virtual machine running on KVM, VMware, or whatever that suddenly needs more CPU cycles, more memory, more network bandwidth, it may interfere with the storage solution that is running on the same node. So this is what I call first generation hyper-convergence. It's really an emerging segment working very well. So there are very successful companies in this space, despite the limitations that I outlined. Now, what I want to cover in this session is let's build a great hyper-converged OpenStack solution. And as I said, I hope to cover most of the issues that we tackled in Stratoscale. I'll try not to insert a lot of shameless plugs about my company, but I'm not promising anything from time to time. I will mention my company. So let's start. So in terms of the requirements, we are seeking to build a software-only solution. It's a single infrastructure. It serves all the needs of a cloud solution. You can store everything on that cloud. You can run anything on that cloud, VMs, containers, whatever. Obviously, you would like it to be performant. You would like it to be reliable. And you would like it to be efficient. So there are many considerations when building a hyper-converged solution. Again, I won't be able to cover everything in this session. But a couple of points you can see on the slide. The storage is dictating the failure domain. So if you are looking at a couple of wrecks, and I'll cover that in a bit, you need to understand what's the topology of the network, and distribute the storage according to the topology. We'll cover that. Hardware heterogeneity. So if you have a small cluster, you would probably like the nodes to be very similar because you need the storage to be being served from the same servers. The bigger the cluster, then you can introduce more heterogeneity into the cluster and have bigger nodes with more storage on them or nodes with more compute on them. And then as the cluster grows and you feel that the cluster, let's say, lacks enough storage in terms of capacity, then you can introduce more nodes with our oriented to be more storage node rather than compute node. But again, remember, in all cases, we are running the workloads on all the nodes and all the nodes are serving storage requests. Other considerations are performance and resource utilization. As I said before, the storage sub-system is also using memory and CPU cycles, obviously, and it's using networking, it's using bandwidth on the same fabric. Workloads are using that as well, meaning the same resources. And we need to somehow compensate or manage, as we call it, manage the interference. And we'll cover that as well. So here I'm going to cover the three basic building blocks. So we'll start with describing the storage sub-system, the networking, and the compute sub-systems. In a storage sub-system, basically you take all the server-side storage or all the locally attached hard drives that reside on the server, you aggregate all the blocks under a single namespace, okay, so you have basically one big huge block device. And per VM, per workload, you carve out volumes, which volume is obviously a subset of the blocks on the block device. There's no need for storage admin, no need for managing loans, the workloads that are running inside the hyperconverged infrastructure. What they see is, from their perspective, they see a locally attached hard drive. The performance is great because all the IOPS are actually distributed across the cluster. So you can actually, blocks are residing on multiple nodes, you can do a lot of load balancing there, so you don't have to read the same block from the same node all the time, because you keep replicas inside. So mentioning replicas, obviously the storage sub-system needs to be reliable, okay, so we, the distributed storage sub-system needs to keep two or three replicas of each storage block on multiple nodes, because you don't want to be susceptible to node failures. And that's what makes the storage sub-system reliable. The compute sub-system. So the key message I want you to take out of thinking about the compute sub-system is that you need to have very fine grain control of what the hypervisor is doing. Okay, let's say if there is a VM that suddenly needs more CPU cycles, I want to be aware of that very, very quickly. The same goes for memory and so on and so forth. The reason being is that in a hyperconverged infrastructure again, you are susceptible to interference between the subsystems that are running on the same server, and you would like to do some kind of a cluster-wide load balancing of the workloads of the resources to prevent or manage those interference and collisions and contention points that you are susceptible when you are running on a hyperconverged infrastructure. The networking. So again, we are running on a single shared fabric. This fabric is hosting or transferring a couple of different types of traffic. It's the guest traffic, management traffic, the storage, live migration, so on and so forth. So obviously there needs to be a sub-system doing quality of service, traffic shaping on this internal fabric. Again, in a first-generation hyperconverged solution, there's a lot of interference between the various subsystems, specifically the storage and the compute, because there's zero awareness on the resource utilization from those two subsystems. In what we are building today, and this is what obviously Starter Skaters built, we are building a hyperconverged vertically integrated solution. So all the subsystems are fully aware of each other, and we know how to do a very good resource balancing inside the cluster. So a very high-level trace of the solution from the control plane. So we would like to have a very scalable installation process, literally talking about the ability to install hundreds of servers very easily. We envision the future data centers, especially in ISPs, are going to just bring in more X of computers, and they are going to be connected to the same cluster in real time, meaning during operations, and we would like to add those nodes to the cluster. Same goes for, we are using the best practice of distributed system. You want the system to be reliable. You don't want it to be having a single point of failure, so on and so forth. We mentioned a couple of times a fine-grained control of the resource utilization in the subsystems. Once you have that ability, you can do a very efficient cluster-wide resource balancing. Again, if there's a VM interfering with another VM, if there's a VM interfering with the storage on a specific node, you need to be able to either move the VM, to do traffic shaping, whatever you need to do in order to prevent this contention. So this is the downside, supposedly, of hyperconvergence that you have. You need to manage this interference. So we talked about the installer. So we are using a single image. This single image includes all the services that the hyperconverged solution requires. Specifically, in the OpenStack Parallels, it's Cinder, Nova, and the rest, the Keystone, etc., the rest of the subsystems. We are deploying the same image on all of the servers in the cluster. So all of the servers actually contain the same image. There's no, we are not appointing any specific server to be a manager in the cluster. All the decisions are being based on consensus, and there's no need to do sizing on hardware. You don't need to have a very big hardware. There's a lot of memory, let's say CPU cycles, in order to be managing the cluster. Distributed system. So I mentioned in the previous slide that the image is the same. Now, what you see here is sort of like a table on the left-hand side, you see services. So each service has a different factor for scaling out. For instance, we know that Cinder needs to be three times more nodes, more services, more instances of the service than Keystone, or whatever the measurement is. So every service has a different factor, a scaling factor, and the distributed system knows how to start and stop instances of those specific services on specific nodes according to a lot of runtime insights that we are collecting inside the cluster. So again, the system itself, managing the system itself requires CPU cycles, requires memory, requires networking, and we know how to balance our own services so they do not compete on resources with the workloads and the storage subsystem. So I hope the message is clear. We're building a vertically integrated solution and managing all the resources in a very holistic manner. Specifically to the storage. So this is a great example. So the storage, again, is a distributed system. So every node is both a storage client. Here you see the block storage client and a block storage server. The block storage client is serving these volumes to the workloads. And when it gets an IOP operation, it knows immediately how to go to the correct server to serve the block storage. There's no metadata server in this architecture. Please note, okay? Every block of storage is one hop away from the workload, meaning if we are doing a live migration of the workloads, we are not copying storage. We do not need to update any metadata. Everything is distributed by nature. And there's a single hope it's a mathematical function, if you will, that every client of storage knows where the appropriate block for read or write operation resides inside the cluster. We know how to do storage steering. So you see the nodes that I draw here on the slide, they do not contain the same amount of storage, a different number of hard drives, let's say. And we know how to carve out tiers. We know how to use flash and hard drives. And we know how to, again, expose that as very efficient, different storage tiers with different capabilities. Networking, we mentioned before, traffic shaping and managing the interconnect. So again, there's a single interconnect, doesn't matter whether you literally have multiple links. The interconnect is a shared resources. And this shared resource is being used for guest traffic, public traffic management, live migration storage, so on and so forth. You need to manage the traffic there. You need to prevent, you need to adhere to SLAs that customers are applying on workloads. So we take everything into account. We sort of like compile, if you will, some kind of a rule based for traffic shaping and queues, etc. And this is an integral part of the system. Maybe it's even the most sensitive part in a hyper converged infrastructure, again, because it's a single shared resource, and it's imperative that you manage that single resources and adhering to your policies. We talked a bit about failure domains and what dictates the failure domain. So the topology in here, I'm showing the two blue boxes are X, residing on data center A, or data center one. And there's another REC, let's say in data center two. The disability system needs to know how to work inside RECs, across RECs and across data centers. So there's a lot of logic of where do you place blocks of storage, and you need to adhere to policies and failure domain rules and affinity rules and anti affinity rules. A very complex subject. Again, I'm giving you a taste here of the complexity of the solution. We mentioned also load balancing. So there is an analytics layer. And I mentioned a couple of times having the ability to have a fine grain control of the resource utilization on all of the subsystems and collecting very efficiently runtime insights of the system. So we are collecting a lot of runtime insights on the system, a lot of metrics. Runtime metrics says we are sort of like processing them in various levels of intensity. So there are some sort of like algorithms that are working on a couple of seconds minutes. And there are other algorithms that are literally working in a sub-second latency to make decisions to prevent or solve interference problems that happen on the spot. It requires fine-grained control on CPU scheduling, fine-grained control throttling networking, and very efficient and low-latency live migration of virtual machines. Again, one of the things that we do in order to solve interference is we might decide to move VMs. It can be the offending VM that actually consumes a lot of CPU cycles, let's say on a specific node, or the neighboring VMs in order to allow this offending VM to continue to work. So there's also a lot of algorithms that we are developing in order to make the best decisions on this cluster-wide load balancing. There's an admission control process where we have a workload. It's being analyzed for a couple of seconds, and then we are that's an initial profiling, and then we are continuing profiling the workload and all the time making fine adjustments to how a specific workload is running, how much CPU cycles to give it, how much memory to give it, so on and so forth. Other technical aspects of the system, this is really high level. Internally, we use console.io. I'm not sure whether everyone is familiar with console.io. It's sort of like similar to ETCD, coming from CoreOS, it's coming from HashiCorp. It's a key store distributed database, if you will. The nice thing about it is it exposes parts of its registry, if you will, as DNS, and we are using it for deploying our internal services in a highly available and load-balanced manner. There is a problem with it. It's strongly consistent, meaning today, today's version of console.io, it doesn't, if too many nodes fall, it gets into a lock-up situation, and we are solving that. So we know how to work in a split-plane environment, and that's what we did. Again, if you look at a multiple-rex scenario, the likelihood of the uplink between the rex to go down is higher than the links inside the rex to be disconnected, and if a link inside the rex is disconnected, you lose one server, and if an uplink is disconnected, you lose, you get a split-brain situation, those two rex can continue to work, presumably they have all of the storage, all of the compute, and this is part of the placement algorithms that I mentioned also. To give you an example on how to work with a service, it doesn't matter with service. In a regular, in a typical environment, there is a request, and this request, when it comes in, usually it goes to an HA proxy, which is doing load balancing for you, and then it reaches instances of the subsystem. The way it works with console.io and what we did is sharding. So you can use sharding as long as you control the request side, meaning in a typical web, public web, web server, when the request, you do not control the request, then you cannot use sharding. In our scenario, all the systems are talking internally, okay, Nova is talking to Cinder, so on and so forth, everyone is talking with Keystone, so we are using console.io as, with its ability to resolve DNS and do health checks on the services. So console.io, we give it a script, it knows how to do health check on the various instances per service, it knows how to build the DNS responses, so once, let's say, Nova is trying to connect to Keystone, we are resolving a DNS named Keystone.service.datacenter, those are implementation details, and we are getting an instance, one of the instances of Keystone inside a cluster, so everything is done on the client side with the help of console, and not, we don't have another middleware that we need to take care of, so this is one of the ways to make a system highly available and reliable and load balancing. Another example is self-healing, so what we wanted to achieve is we didn't want to keep a lot of context on the cluster side, so let's say Nova is asking Cinder to do some operation, what happens if there are many, many points that the operation can fail, so the question is who is responsible for the garbage collection, let's say the create volume failed in the middle, and in our case what we built, we built a system where, again, we use shouting, we get an instance to do the operation on our behalf, let's say that this instance fails, okay, we automatically go over to the second instance, okay, and we continue from there, we get a response, we are doing a publish and subscribe with message queue to other interested parties who want to listen to the result of the operation, but the interesting part is that we are doing the garbage collection on the subsystem side, which is entirely self-contained and not on the requester, the reason being is that the subsystem has enough metadata and knowledge, and it knows how to do better garbage collection than an external entity like the requester, so in the case of, let's say, Nova and Cinder, where someone asks Cinder, let's say, to mount a volume or whatever, if there needs to be some garbage collection inside the block device, the subsystem that is managing the block device is doing that and not the requester. Wrapping up. I hope that, you know, I showed you, you know, how to build a hyper-converged open-suck solution, okay, and again the idea is that you're taking a cluster of servers connected with a 10-gigana interconnect, and you are serving all the cloud needs, resource needs, the storage, the compute, and the networking from within that cloud. You do not need, you can use external storage, but you do not need it. You can use any other subsystems, but in a hyper-converged solution, everything is internal, everything is reliable, and everything is self-healing. Okay, thank you very much.