 Hello. Good morning everyone. Thanks for joining our presentation. So my name is Lima Pandey and Vish and I will be talking about building cloud for WebScale IT. So I led, just a brief background about myself. I led product. Is it turned off? Can you up the volume on hers? I think he's got it. I think he's got it. Try now. There we go. Okay. So good morning everyone. Thanks for joining our presentation. My name is Lima Pandey and Vish and I will be talking about building enterprise cloud for WebScale IT. A brief background about myself. I led product management for software stack at Nebula and I have been working in open stack and infrastructure space for a number of years. I led product management at companies which is Yahoo Oracle EMC and Vish needs no introduction. He is the first person to start coding on the open stack and Vish, why don't you? So for those of you who missed my talk on Monday or who have never seen me before I was one of the original authors of Nova and at NASA I helped create the project and I was the technical lead for the first couple of years of Nova and I've been on the technical committee and the board directors of open stack. So a lot of background work for many companies actually deploying this stuff and finding out what works and what does it. And we want to say that Nebula was a pioneer in the open stack space and we are extremely proud to be a part of that journey. So today we are going to talk about building the enterprise cloud for WebScale and at the end of the talk we will try to show you a little demo about what are the different components of enterprise WebScale IT and show you demo. In the last decade or so we have seen a trend towards cloud computing and applications being delivered as a service. This trend is driving a new class of computing called WebScale IT. So this term WebScale signifies massive scale of software infrastructure, data repository and hardware platform. So in this new era the program or service is composed of ten or more small programs which interact to implement a complex end user service such as email or Salesforce, CRM, maps, etc. These programs are implemented and maintained by different teams across different geographic locations or these programs could be open source components. So to run these kinds of massive applications we need a computing platform that can support these massive workloads and this is the goal of computing in the today's modern data center. The computing infrastructure should be able to run applications and workloads and the infrastructure should provide scalability, high availability and fault tolerance. The next important thing is the applications must run efficiently. That means that the applications meet their service level performance and availability. The computing infrastructure must be cost efficient too. Modern computing infrastructure should help organizations generate higher revenue at a lower cost. And today WebScale IT is not just, you know, it should help the corporations generate higher revenue while operating at a lower cost. So one thing I would like to call out is that WebScale IT is not just for the Google's and Facebook's of the world. This kind of computing is equally important for a broad range of corporations and research institutions. So the attractive economics of commodity service puts clusters of hundreds of nodes in the reach of many corporations. Combine this with a large number of cores per socket and a single rack of server can have thousands of threads which is almost like a small data center. And this unprecedented amount of compute power requires that we design our data center in IT with this WebScale infrastructure in mind. So let's there are different kinds of atoms of computations when the applications can be run. We are very familiar with running applications on physical servers. In the last several years, VM has gotten a lot of popularity and a lot of data centers have deployed VMs in their data centers. The next compute, atom of compute, is containers. And so we will be talking about containers. How does it fit into the entire data center? So container is an abstraction of the operating systems and multiple containers can run on one operating systems. So each of these different kinds of compute have its own trade-offs. You can have agility, performance, and security. And we will talk about some of those trade-offs in our next during this talk. So let's look at virtual machines and containers side by side. Most of you, I'm assuming, have seen this kind of abstraction. So each virtual machine runs a copy of the entire guest operating system. Each image must contain the base image and then you download all the libraries and packages needed for the application. The hypervisor provides an abstraction of the hardware and the guest operating systems runs within the VM. In contrast, containers are isolated spaces, instances, which is made possible by the operating system virtualization. So containers are OS abstractions. They run directly on the operating systems and many containers can run on the same OS. So they share the OS underneath. They can be much more efficient because you are eliminating one layer that's the hypervisor and do not have the baggage of running the entire OS in a container. So with the virtual machine, when a package is updated, the entire image needs to be rebuilt and application has to be restarted. In a containerized approach, application components are partitioned into separate containers. Updating a package or component just requires one container to be updated. Containers can be started in a fraction of seconds as opposed to minutes, maybe for VMs. This makes possible for containerized systems to rapidly respond to failures or changes in the workload. So containers, let's compare and contrast, compare containers and VMs across a couple of different dimensions. As we talked about, containers add a new level of agility to the software development. Docker-based applications run exactly the same as on your laptop as in the production environment. And Docker or container runtime, for that matter, encapsulates the entire state around an application. And you do not need to worry about the missing dependencies or bugs due to differences in the underlying OS version. So in terms of managing the life cycle of the applications, we have given containers a real green and virtual machines as a red. In terms of performance, containers eliminate an additional hypervisor overload. So performance-wise, containers are close to the bare metal performance. So containers decouple applications from the host OS and abstracts it and makes it portable across any system that supports, say, Lexi. Users can have a clean and minimal base Linux OS and run everything in a container. But there is another aspect of portability that we have to keep in mind when looking at the containers. The base OS and the containers are all run on Linux. So in the case of the VM, different guest OS can run on the hypervisor, which will not be the case in the containers. So we have given this portability, you know, sort of, it's very portable if you run your applications, but if you want to run a different guest OS and the host OS is different, you cannot have that kind of portability. However, the most important distinction is about the security. VMs offer stronger security and isolation guarantee than containers. And they'll spend, you know, at least some time talking about the container security. So as I mentioned, you know, containers have isolation based on namespaces and C-groups. However, they could be vulnerable to users or applications with root privilege. Applications in one container can potentially access data in a neighboring container on the same host. So this is an active area. There's a lot of research going on right now in the container community of how to make the containers secure. So far, you know, the Googles of the world who have been deploying containers for a long time, the strong recommendation is, you know, put containers in the VMs and then run it so that, you know, you do not have to worry about security from two different customers' containers running on the same physical host. Other thing is run containers on a separate security domains on different hosts. If you have two containers, you're running on bare metal, make sure you're running on two different hosts. So let's look at, you know, we talked about containers and VMs. What are we trying to build? So going back to our previous, you know, that we are the data centers is about building IT infrastructure, web scale IT infrastructure. So we will be looking at the building blocks for the distributed systems. You know, what, how do we build a distributed systems? Because these are massive scale. They need, we need to support scalability, availability and false tolerance. So these are the common building blocks and we'll be talking about some of the components, how it applies the containers, you know, which has been prevalent in the VM world too. So operating systems, containers, networking, orchestration and management, storage, all the different building blocks that we have seen in the VM world will now need to have an analogous solution in the container world. I want to say that there are a number of vendors and open source projects being worked on in the container ecosystem. Some of them I wanted to call out, you know, CoroS, Rocket, Kubernetes, Google Container Engine. This is a cool logo for the Amazon container service. Dockers, you know the dockers. So it's, there are a number of vendors right now working actively in this space. Okay. And so CoroS, you know, Vish, do you want to talk about CoroS? So how many people have heard of CoroS Linux before? Oh, wow. Okay, never mind. I'll skip this slide. So there are a number of efforts that are really built around making a very lean and mean OS just for running containers. Red Hat has Atomic. One group that has started up to actually create a Linux OS from scratch built off of some of the Chrome isolation pieces that are used in Chromebooks is CoroS. And they have a few very nice features that allow you to run a containerized OS very easily. And particularly when using it on top of OpenStack, they have really good integration with something called CloudInit, which essentially lets you put a bunch of packages in as you're launching the VM so you can create all of the pieces that you need quite easily. But one of the big selling points that they have is they have this automatic update feature. So they have two different root drives and essentially in the background, if you're subscribed to the update, it will download the new version of the OS and then it will trigger a reboot in the similar way to the way that your phone updates or the way that a Chromebook would update, which allows you to keep up to date with kernel patches much more quickly. This is important because as Lima mentioned, there are security vulnerabilities when you're dealing with a container. The problem with running stuff inside of a container is that you're exposing the entire Linux kernel to that container in most cases, in which case if there's a kernel vulnerability, then you run the risk of someone breaking out of the container and compromising the host OS. The interesting thing about that though is that there's a couple different ways to think about security. One way to think about security is that you want to make sure that new exploits don't compromise your container and compromise your thing. So if there's a zero day, for example, someone figures out a Linux kernel vulnerability, they're going to be able to break into your container. So you can counteract that by running inside of VM so you don't expose a Linux kernel or you can counteract that by having a system that updates very quickly so new vulnerabilities are discovered and patched as quickly as they come out. And sometimes it's actually the older vulnerabilities that cost you a lot more than the newer ones. So keeping your stuff up to date actually in many cases from a security perspective becomes much more important than not exposing the kernel. So this is a great way to mitigate some of those security concerns of running things inside containers because you have the ability to keep your OS up to date very quickly. The next thing, did you want to start this? So one of the main things that Corwell's released is this service called EtsyD. How many people in here have tried using ZooKeeper before? Okay, cool. So the idea when you're running a distributed system you generally want some sort of source of truth for that system. In the old days people would just throw everything in a database. Databases themselves aren't necessarily super good at being distributed. The idea of modern web-scale services is you want to keep things highly available all the time. One of the best solutions for that is this algorithm called Paxos, which is what ZooKeeper is based on. And it's a great algorithm for keeping a highly available distributed system. Unfortunately there are about five people in the world that actually understand how Paxos works. Which means if you've tried writing something like ZooKeeper, even using ZooKeeper, you've run into a lot of issues with how to configure it, how to get it running. EtsyD is a much simpler algorithm. It's actually based on something called Raft. And the author of Raft created the algorithm for a distributed consensus because he wrote his PhD on it because he discovered that Paxos was really hard to understand and therefore hard to implement. So EtsyD uses the Raft algorithm and it's essentially a replacement for ZooKeeper. You can think of it like that. Most of the distributed systems that are being built today, including the ones we'll talk about, Kubernetes, are built on some sort of distributed consensus. So they need something like this running in the background. So essentially you have a cluster of machines, generally three or five machines, running this distributed consensus algorithm so that if one of them fails, you can still get to the source of truth for data. Now, let's talk about container networking. So one thing, you know, with the talker, you know, we have, you know, container networking had some bottleneck with that. Each container is assigned an IP address that can be used to communicate with other containers within the same host. So for communicating over a network, containers are tied to the IP address of the host machines and must rely on port mapping to reach the desired container. So there are a lot of solutions running inside the containers to advertise that external IP and port as the information is not available to them. And this kind of, and then Docker has solutions with this port mapping and natting and all those things which is very complex. So there is a different solution, you know, Flannel, which is, we think that it's a better solution for container networking. And so Flannel solves the problem by giving each container an IP that can accommodate container communications. So it uses packet encapsulations to create a virtual overlay network that spans the whole cluster. So basically you can think about that Flannel gives each host an IP subnet and that allocates IP to each of the containers, which the Docker daemon is then able to allocate IPs to the individual containers. So I think this is Flannel, we believe that is a much better solutions for container networking. Flannel, that brings us to the whole distributed systems, you know, the concept that we talked about, the Kubernetes, which is an open source project by Google. And this is a system for managing containerized applications across multiple hosts, providing basic mechanisms for deployment, maintenance and scaling of applications. Some of the important features of Kubernetes is like, it provides self-healing mechanisms such as, you know, auto restarting, rescheduling and replication containers. And it supports a robust declarative primitive so that, you know, for maintaining the desired state of the user. So you don't have to say, give me three containers, but you can just specify what you want and Kubernetes will take care of that. So Kubernetes is primarily targeted at applications composed of multiple containers. And these applications are elastic distributed microservices. So Kubernetes can ask users to ask a cluster to run a set of containers. The system automatically chooses hosts to run these containers on. And Kubernetes scheduler is currently very simple. It is policy rich, topology aware and workload specific function that significantly impacts the availability, performance and capacity. The scheduler needs to take into account any individual and collective resource requirements, quality of service requirements, hardware software policy constraints, affinity and anti affinity specifications, data locality and other constraints that you need to specify to the scheduler. So workload specific requirements will be exposed through the API as, you know, needed. So there are different components. So I think maybe since you have been working on the demo, it will be fun to walk through some of the components. Yes, so funnily enough Kubernetes just recently renamed one of their major components from Minions to Nodes. So just confuse everyone. So essentially there's the Kubernetes master and API server. There's the scheduler. There are Kubernetes nodes, which is where the workloads run. And then there's the things that run on top of those nodes, which are comprised of pods, replication controllers and services. So the replication controller, there's also on the master node, there's in addition to a schedule, there's something that manages the replication controllers. So essentially what you do when you're defining the what you want to run is you say, okay, here is a pod is a very simple definition of a particular container. So you can think of it as, here's a container that I want to run. And you can go directly to Kubernetes and say run this pod for me over there. That's not incredibly useful because you could just do that by going into the machine and doing a Docker run or something along those lines. So what's a little bit more useful is defining a pod. Now a pod is a group of related containers that share the same name space. And you might wonder about why one would do that. Well, it turns out Google has a lot of experience in this space and they wrote something a while ago called Borg. And Borg is what they use to deploy all of their internal Google services onto their fleet of machines. And Kubernetes you can think of as sort of a rewrite of Borg with a bunch of lessons learned. One of the things that they discovered is while you want to isolate code into a particular package, like a container, there often are things that need to run together but aren't necessarily deployed together and so you need them to share the same name space. The classic example of this is you could have a database service running and then you could have a backup service running that needs to access that database. Now you might want to deploy the backup service separately from the main database service so you want it in its own isolated container so you can redeploy it separately but you need them to share the same data or share the same network name space, etc. So a pod is a way to define a group of containers that run together. In addition to that, rather than thinking about one copy of something that you run somewhere, you generally want to have the idea of I'd like to run three of these. The one thing that Kubernetes does particularly well is it allows you to go in and say I would like three of these running and you decide where they're going to be and if one of them fails restart it somewhere else and make all the connections between them work. When I show you the diagram of the Kubernetes components that might make a little bit more sense but you do that via what's called a replication controller and the replication controller is essentially run X copies of this pod and a replication controller also allows you to scale up and down the number of copies of that pod you want to run so you could say give me three of them and later you say I'd like to scale up to ten and it will automatically create enough pods to reach ten. If any one of them fails it will restart that one that failed on another node respecting all the constraints that you have at affinity or versus anti-affinity etc. So the diagram of the Kubernetes architecture looks something like this. So essentially the master node or nodes are responsible for distributing workloads across the group of minions and there's another piece which is quite interesting. There's an overlay load balancer internally and this is where services come in. So after you create a group of replication controllers you then create a service that says these replication controllers embody say my front end service or my web service or my back end service and any of the nodes, minions in the system know how to route to the actual containers that are running on the overlay network. So when something says I need to go talk to the back end service maybe three copies running somewhere in the cloud the load balancer will automatically direct that traffic to where the correct back end component is to respond to that. If one of them dies and it gets launched somewhere else then it will again know how to route. And in this case the service always has a standard set of IPs that can be talked to which will automatically be routed on the back end to the right component. So that kind of feature becomes really, really useful when you're running a whole bunch of copies of things. It means you don't, for internal communications it means you don't have to know where everything is and have a discovery mechanism that can associate things together. There are some limitations to this which I'll get to when we talk about some of the drawbacks of the current state of Kubernetes and the container ecosystem. Should we take this time to show Kubernetes demo? Sure we can do that now. So it might help you understand a little bit if I show you kind of how you would get something like this running on OpenStack. This is not going to be in great detail because we don't have a lot of time. There is one piece that's missing currently from how Kubernetes works if you're not running it on Google Cloud Engine. And the piece that's missing is the load balancer works internally. You can run a service called SkyDNS which so you can run a service called SkyDNS and there's a piece in Kubernetes that's an add-on that's called KubeKube2 SkyDNS which actually will essentially watch the Kubernetes API server for updates and when it gets one it will automatically create a service record in SkyDNS for you. Now SkyDNS is pretty interesting in that it does in addition to just doing naming it will also do service records. Now you may not know what a server record in DNS is. It's actually been around for quite a while but not a lot of things use it. The really interesting thing about a server record in SkyDNS or SRV record is in addition to telling you what IP the service is on it also gives you a report. So it actually works really well for things like service discovery. So in this model SkyDNS allows you to instead of just knowing the IP of the thing that you need to talk to you can actually talk internally to different things by using their DNS name. The one piece that's missing in the load balancer and DNS services that are integrated in is talking to the services inside the Kubernetes cluster from outside. So it's quite interesting but on Google Cloud Engine you have this great feature they have a load balancer built in that you can use and they have external DNS load balancer. They expect you to kind of manage that outside. The problem is if you're running on open stack or if you're running internally you don't necessarily have access to those things. So there's sort of a bit of a hack in Kubernetes right now. It's these things called external IPs or public IPs. So if you look at my record here of this service definition which is just a YAML file. I don't know if you can read it. But there's a section there called public IPs. What public IPs does, so remember how I said there was a routing layer internal to Kubernetes? If you try and hit one of the IPs that's the internal assigned overlay IP by Flannel then it will automatically send the traffic to the right node. That doesn't work if you're hitting the external IP of the node. So for example what I have here is I have an install of open stack. You can see that there's three Kubernetes controllers. I ran a three controller environment so there's three at CD servers and they're also one of them is running the Kubernetes master components and the other two are just at CD slaves to make sure that I have a running cluster. So these IP addresses which are the IP addresses of the virtual machines that are running. If I hit one of those IP addresses or conversely if I hit one of the IP addresses of the Minions once those are there, I won't actually get routed to the internal services running in the container. So it's not incredibly useful. I can talk to services internally but I can't talk to them outside. Without a load balancer there's no way to communicate that if it gets traffic on this IP in this port that it should route it into the internal container service. So there's this addition called public IPs. Now public IP says if I get traffic on this IP address in the service record then actually forward it in the same ways if it was an internal IP. So essentially what I've done is I made a little patch to Kubernetes that will allow this Sky DNS service to also create DNS records for external IPs. So in addition to being able to route traffic inside via Sky DNS or via the load balancer you can also get a name from the outside which actually ends up being a neat little way to do it. Unfortunately Kubernetes is rewriting all of these components and fixing the public IP thing so I propose it upstream. They're like well we're rewriting that we'll accept it later which is another thing I'll get to in the drawback section. But I wanted to show you it working just so you can have something. This is the definition of a service record that I showed you here, Sky DNS service. So it's pretty simple. You basically say here's the port, here's the protocol, putting it in a certain name space which in this case is the default name space and you give it a name. Now when you create a service you'll notice these things at the bottom label and selector. So the way that it knows where to route the stuff is basically your replication controllers and pods have labels on them and these labels allow you to group the pods and your service identifies some labels saying anything with this label route traffic to one of those things. So if I have something, a label internally for in this case called the Sky DNS controller then it will route traffic to the Sky DNS controller if it comes in on this IP address. In this particular case I'm doing something quite simple which is essentially just adding the Sky DNS service to be discoverable from the outside. So basically I just say create the Sky DNS service via CUBE control and what's going to happen assuming my demo hasn't broken sitting on my laptop while it was closed is that outside of the VM that's actually of all things running DevStack in a Ubuntu VM on my local Mac because I don't trust networks for demos. So what I should be able to do is ask for a serve record and of course I've lost my resolve comp because I changed networks. So I used this Sky DNS name server 192.168. So what you can see is that it's created serve records for Kubernetes automatically so I can look up Kubernetes to find out where it is and see that it's on port 443 and 80. Unfortunately created at that time. So what will happen now that I've created that new service record is the DNS entry. So Sky DNS is actually quite interesting. It basically uses SCD as a backing for a DNS server and so all that it's doing is there's a little service that's connecting to the Kubernetes API every minute or so looking to see if anything's been created and as soon as it's created actually a little more complicated than it registers for a watching gets notified but as soon as something gets created and it notices it then it goes to SCD and it writes a serve record into SCD and hopefully now you'll see that the SCI DNS records have showed up and the very interesting thing about it is that it supports wild card searches which is something not a lot of DNS servers do. So if you create multiple SCI DNS records so like say I had three different public IPs in the SCI DNS record that I just created I would have zero dot SCI DNS and one and two and three and if I just do a look up on SCI DNS dot nova local what's going to happen is it'll give me a round robin to all three of the services so you actually get a little bit of automatic round robinning in addition to your sort of automatic naming. So this kind of thing is a really useful way to sort of bridge the gap between the outside the rest of the enterprise and the Kubernetes cluster which I think is a really important thing that is still sort of being developed in these ecosystems because you're not going to have everything running inside of your cluster initially. As soon as the enterprise gets involved they have a bunch of stuff sitting there so you've got your existing IT infrastructure and you bridge the gap between the two and this means being able to deal with external load balancers and external DNS servers etc and these are some of the things that Kubernetes is still being worked on. In terms of what it takes to actually use Kubernetes itself it's actually quite simple I mean you basically are defining YAML documents if you look in some of the examples in the Kubernetes source code don't be afraid because it's actually quite easy you just say define the pod which is going to have your you give it a name you give it the docker container you want it to run and it'll just download the docker container and run it for you so you get sort of the same ease of use as you get from running on a single system but it allows you to have this distributed control. We'll come back to some of the drawbacks so some of the problems and why this hasn't despite the incredible hype that it has so far it hasn't totally taken over the world yet so let's go back to the slides hopefully that just gave you a little taste there's all sorts of great demos in the Kubernetes repo itself that you can look at. Limitations? Let's go back to limitations first. So there are a bunch of issues that need to be dealt with the first of all the first issue is that Kubernetes itself needs to be made highly available and there's a couple of ways to think about it first of all there's the XCD cluster right so if you have three or five XCD nodes running and one of them dies that's fine at first but if there's three in one dies you've got two left if another one dies your whole cluster is unavailable so there needs to be some way to recover a busted XCD node when that happens it's even worse with Kubernetes because you have essentially a single replication controller master replication master and that needs to be started somewhere else if it dies so you actually need something to sort of monitor Kubernetes now you can have it be a person you can set up the Kubernetes master and use DRBD and use some of the old sort of high availability techniques but in this case if you're running it this is one advantage of running this stuff inside of virtual machines that makes that restart process automated so you could for example have a heat template that's going to manage it for you you could use the project that comes out of Cloud Foundry which is called Bosch which is actually quite good at managing restarting VMs when they fail reconnecting data and kind of keeping the thing running so you sort of once you've got the Kubernetes cluster up and running it manages the underlying stuff another big drawback of Kubernetes and running things in containers right now is this overlay network and the TCP load balancing that they do tends to hinder network performance so high performance network stuff in this model doesn't really work today it's sort of trading off high availability ease of use for some of the low level performance and the funny thing is a lot of people are going to containers because they're concerned about network performance and then they find out oh okay well there's a drawback here we'll get into some numbers here in a bit on that the other thing is that it's great running stateless services on Kubernetes is super easy like if you have a web service with no state it's like super easy to deploy you can restart you can scale it up scale it down it's awesome the problem is stateful services are not so easy and there's a number of reasons for that but one of the biggest ones is that the volume story is not fully stretched out so once again if you're on Google Compute Engine they have external persistent volumes that you can use and you can back some of your data there and then reattach it to somewhere else when you need it there's no sort of built in volume provisioning service on top of something like OpenStack you can use Cinder and you can mount that into your container and then put your data drives there that's one option but it's definitely something that hasn't really been fully considered as I mentioned before having an external load balancer and external DNS server is not really part of the project at the moment so that's kind of problems that you have to solve on your own the other thing is operational management of these pieces like how do you monitor Kubernetes when it's up how do you figure out when you need to restart things some of the things I mentioned about using heat or along those lines is not really fully fleshed out the other big problem is it's changing really fast so Kubernetes has it's like the early days of OpenStack the velocity with which things are moving is so fast that the demo that I was just showing you for example I did it a month ago initially I had to rewrite the whole thing just in the last month okay so there's this constant pressure to pull in the new changes because there's stuff missing but each new change means that your deployment has to change slightly and this is true of all the layers so at CD is changing Kubernetes is changing so it's very hard to find a stable base that's going to work over a long period of time and that's I think the biggest problem my view on this stuff is it's great for experimental things it's great to look at as an initial for a long-term plan but I don't think there's going to be a lot of production stuff running on this for another year or more because there is so much change happening