 We're not very many. But anyway, I'm here to present Ganeti. It's a product that we developed and we used at Google to do cluster virtualization management on top of Xen or other hypervisors. How many of you have been at some Ganeti.org before? Have you used it? Know anything about it? One, two, some, a bit. OK, so I'll try to be fast in the first part and get to something of internals or other things that are a bit weirder. Although I don't know how much we'd be able to do today, exactly. So first big news, we have a new logo. We designed it exactly in time. Well, we designed it for Google IEO last week. We didn't make it to Google IEO last week. So I'm presenting it today for the first time worldwide. That's the new Ganeti logo, the little thing. It's not too bad. I like it. OK, so let's see what's Ganeti, what are the latest features, what directions we're going towards, how to use it in practice, how we deployed at Google. Then I also have a whole stack of slides about internals. Last time I gave it, they asked me to the whole audience was shocked and nobody wanted to talk to me for half an hour. So I'm not sure I'll give that part, but we'll see. If you want that, there's maybe a recorded version from FOSDEM, or you can talk to me. We can discuss that as well. So what can Ganeti do? Well, basically we just manage a cluster of physical machines and we schedule virtual machines on top of them. We've been doing this for quite a lot of years. And the idea is that we wanted to do everything that you normally expect from a virtual machine manager, live migration, data redundancy over multiple nodes without the need for storage networks or things like that, so native support for DRBD or other file systems, other distributed block devices. Cluster balancing, so easy way to compute what the optimal status of a cluster is and is to repair hardware, so remove nodes from the cluster, repair them, and so on. We want to do this with vent level as low as possible, so just take a bunch of physical machines, deploy Ganeti on them, use them, no need for specialized hardware, specialized external or internal hardware to the node. So just take your old computers, really. That's what they gave us at Google. Anyway, they had their own old computers. They gave them to us and we did virtualization of them. So we wanted to scale quite a lot or not to huge amounts. We didn't want to manage 10,000 of physical machines with it, but we wanted to be able to manage 100 of machines inside the same cluster. I believe the latest version is actually more than 200 as we run it in production. And basically, we wanted to make it easy for harder features, so you don't need to configure your DRBDs by hand. We'll do it for you. The other thing we wanted to do is to be good as an open source product. So we do this very differently than many other products made this way by cooperation. Everything is done in public on a mailing list. You can see the design discussions. You can see code reviews of patches floating. And the same happens if you want to develop a patch. You just send it to the same mailing list. We discuss it when we get to apply it because the actual place we commit it is close to Google People. But besides that, it's all in the open. You can read the archives anytime. And yeah, we want to be cooperative with other people that use it and want to contribute their changes, even at small scale, actually. By big scale. So just a small terminology. Node is just a physical host. Node group is actually a group of nodes. Instance is a guest virtual machine. Cluster is this whole set of nodes, eventually divided into node groups. And any kind of operation is expressed as a job, except queries that are kind of strange. Any operation that changes the cluster is a job. You can get information from the cluster without a job. So we just use normal technologies, like we tried not to reinvent the wheel as much as we could. So for example, we reinvented the wheel for LibVert, but we started before LibVert existed. But for much other things, we just reuse whatever Linux or other open source products gave us. So just Linux, Bridge Utils, whatever normal utilities, OpenVswitch now, KVMX NLXC, as a hypervisor, DRBD, LVM, Sun, so to back your files. Now, latest version has RBD support. I don't remember if that's two, five or two, six, but I'll find out later because it's in my slides. Python and a few models in Python, everything is packaged for Debian, so quite easy. Socat, the nifty tool I was talking about this morning for like exchanging flows from networks to sockets and things like that, from network sockets to unique sockets and more standard IO to network sockets and things like that. And Haskell, it's optional, it's becoming less optional. We have quite a lot of code, especially for the part of cluster balancing in Haskell. We're adding more Haskell as we speak. No, well, I hope not. But yeah, there's quite a part of the code written in Haskell for performance reasons and for verification reasons. So node roles, every node is the same, but not really. So we have a master node where all operations are run. We have master candidates that keep a full copy of the config and can become master and they run these two demos comfortably to get queries from the config and noted to just basically perform the operations. Regular nodes only run noted and cannot become master until they get promoted to master candidates first. Of course, that means that if you lose all your master candidates, you're in a bad place and you wanted to have a backup, we keep usually 10 of them. So hopefully you won't lose all of them at the same time. If you do, you have more problems. Usually you're also going to lose some virtual machines and things like that. You can change that number of course. So you can say 15 or five as you wish. Regular nodes, okay. And then you have offline nodes that are being repaired. So you can't talk to them because they're broken, hopefully. Then you have node roles at the instance hosting levels. Nodes can host machines or be drained or they could actually not be VM capable. So they could not host machine. Why is this useful? Well, I can for example have a master candidate that is in a different data center. The configuration gets copied to it, but I still don't want to migrate machines to it or maybe a master candidate is actually a virtual machine. So I don't want a virtual machine migrated on top of a virtual machine. Although KVM kind of supports that, but still. So that's just to avoid having some nodes support virtual machines. Then we can drain a node say, well, this node is going to get out at some point, let's drain it or we can offline it. And in that case, still can't do anything. New features. Well, I had a slide about 2.4 and removed it. Hopefully 2.4 is everybody. We have 2.5 in Weezy and in backports indeed. So you can use it on squeeze nowadays. This has some support. We have better node groups. So we have commands that affect the whole node group, evacuate the whole node group and things like that. And node groups scale better at this stage. Master IP turn up customization. Until now we had this feature that was the master IP. And basically when the master was moved from one node to the other, Gannetti was always doing exactly one thing which was bringing up the IP address of the master on this node. From 2.5 you can actually replace this with your own command. So if you want to rather than bring up the IP, you already have the IP and you want to advertise it through a routing daemon. You use this script, turn up script to talk to your writing daemon for example, or to update a MySQL database or whatever you need to do when you fail over your master. We have full spy support in this version so you can activate your KVM virtual machines with spies and have better desktops. And we have all support for node health. So if you have, yes. Spice is a protocol that some people invented for a better desktop, remote desktop. So basically it does remote desktop on the server level and that just sends the changes to the client rather than VNC where you actually transmit everything. So if you're over the network, it should work a lot better. It has support also for USB redirection and a few other things. It's a nifty thing. If you go to FOSDM, you hear a lot about it. I think they only talk about it at FOSDM. I have no clue why. But yeah, try it. I use my workstation, like a virtual workstation on my laptop with spies. It works quite well. There are spies clients packaged in Daemon, so. So node health, if you have some kind of health system attached to the nodes, for example, a virtual serial port or your power switch is programmable remotely or IPMI or something, you can use it from GANETI to say, well, kill this node, for example, because I can't reach it anymore and I don't want it to come back and ruin things with a virtual machine that is rogue on it or something. 2.6, we've just froze 2.6. We have RBD support in 2.6, so you can use stuff instead of the RBD if you want. Or yeah. Memory ballooning for KVM and XAN until now it was sealed. Now we can actually reduce and increase. Right now it's pretty much manual with some automation in 2.7 and on. We plan to have more automation for that, as some support for CPU pinning. The direction here we're going towards is trying to make sure we can easily partition a node between virtual machines. I'll explain this later. OVF export and import in case you want to integrate with VMware and other evil proprietary people. And more customization for the RBD and disk parameters. Until now it was all whatever we give you and you could change some constants in the source files, but then those affected all virtual machines. Now you can decide those per VM. Finally, we have policies to better model your resources so you can say, look, virtual machines should be at most this size, at minimum this size, the standard virtual machine is this, and then we can tell you, look, you can only fit this number of your biggest virtual machines, so if you plan to have more than those, you have to do something about it. 2.7 is still open. We're really just starting on it. We have network management and AP allocation that is scheduled to be merged. It's actually an outside contributor's patch. Open V-switch versus small, small initial support. Hopefully we'll be able to do more here in 2.8. Fast queries and split of queries and jobs. So the idea is that we want the queries not to hit where the jobs hit and to be able to perform well, even if there are long-term running jobs on the cluster, so you can always get your status, even if there's something else going on. And in general, better seeding of resources. So if now we always run on a shared and best effort machine model, we want to be able to say, well, we have this big, big server. Really, these two cores and these two disks and this RAM that is associated with these two cores is all for you and nobody's going to steal it from you and we're making sure that this is allocated to you fully, which we didn't have until now. And we want also to provide better metrics for measurement what's going on, how much uptime your machines have had over the last month and all these kind of things. What do we want to do in the future? Full dynamic memory, so dynamic resizing of your machines as we need to add some new ones and things like that. Better instance, networking customization, a rolling reboot to do updates all over the fleet. Better automation, more self-healing inside Ganetti. This has been a problem because we have some kind of self-healing implemented on top of Ganetti inside the Google proprietary part, but we don't have it inside Ganetti and we think we should move some of it as we can inside Ganetti. Better scalability, KVM block device migration and I hear this today that Citrix is doing block device migration or they call it differently, but it's the same thing, right? So basically we want to support it for both, of course. And improve the OS installation. Right now the OS installation is mostly something that runs on the node. It gets given block devices already initialized and put something on the block devices. We want something that is detached from the node at a different level, so running already in the virtual machine. You can do it today, but there's no way in Ganetti to say, okay, this virtual machine is going to be installed, so just run it with different parameters so that it hits the PXC server, for example. This would be great so that you only run the virtual machine hitting the PXC server for reinstall when it's actually being reinstalled and not all the time. Different hypervisors, native KVM when it will be stable and other tools as they come available. So you can easily initialize your cluster, you have your Ganetti node, Ganetti master, then you can perform operation at the cluster level, ask for information about your cluster, modify parameters that will affect your virtual machines, all of them, then you can modify them at the node group level, so adjust the machines on these nodes behave differently or indeed as a virtual machine level, so this machine has these special parameters. Verify that your cluster is all healthy, well, not all, but the parts, failover your master role between nodes and execute commands or copy files to all nodes. This is just really like small helper tools where nothing different than your normal DSH. Adding nodes, well, node add, you need to have an SSH authentication between nodes and the nodes get added, we may want to add some situation in which the node gets pre-configured and can be added without SSH authentication, for example, already from the node D if you have cross node D authentication, but we don't have that yet now, you need to have SSH to add a node. Adding instances, just see what instances you have, install some new, add a new instance, then it comes up, you can ping it, you can SSH it and if you use DRBD, its block devices are replicated on two nodes. We can have per node operations, so remove nodes, modify nodes, with the parameters we were talking about before, evacuate, failover, migrate node, or power cycle them with the OB operations we were mentioning. Instance operation, start and stop, modify, info migrate, access the console, either through the Xen console or through KVM, we actually made it easy to access the KVM console while the KVM people made it harder. Well, tried, but we simplified that later. Indeed, reinstall instances and things like that. All these tools come with full man pages and online help. Online help is always a bit more up-to-date than man pages. We try to keep all man pages up-to-date but sometimes something will slip, right? DRBD, just replication between them, so you have an instance living on a node, you have local storage, replication over the network and then on the secondary node, very replicated data. How to recover from failure? So our node3 died, it had an instance on it. If we knew that it could die and we could know because we monitored, we noticed that the memory had corruption error, the CPU, or some disks were going to be broken. So we could move, life migrate the instance out. In this case, it died while the instance was still on it so the only thing we can do is offline the node, failover the instance, which basically means reboot it over its secondary storage, then replace disks, which means recreate the RAID1 over the network. And now we have a new cluster that is basically independent of node3 and our instance is failure tolerant again. Ginty backup is a little tool that allows you to export instances and basically back them up or check what they contain. Okay, so Htools, this is some tools that we have that allow you to move instances around and balance your cluster. So if you have, for example, a new instance, where should I put it? Well, or where do I move an instance now that the node is dead? HAL is the HILocator, the Haskell HILocator and will answer those questions so you don't have to think about it yourself. Thinking about it, if you have three nodes, it's very easy. If you have 200, it starts to become a problem. Most of the time you don't care, but sometimes you do care and when your cluster says, I'm not N plus one tolerant, so if this node dies, I don't know where to put this instance, then you don't want to take your pen and paper and calculate how to make it N plus one tolerant, so HAL will solve that. H space will tell you how many virtual machines you can fit before you need to buy a new hardware or move some into the cloud. N plus one, Hbal will rebalance it to make it more failure tolerant and indeed allow you to fit more virtual machines or things like that. So you can control Gannetti over a common line. These are the examples. You can use a web manager that was developed originally by Garnet and now is maintained by Oregon State Open Source Labs. The web manager itself talks to Gannetti through a RESTful HTTP interface, which is compatible between versions and we basically is the programmable way to talk to a Gannetti cluster. Finally, there's on the cluster, Aluxy interface. It's just a JSON overunit socket that both the common line tool and the API daemon or the H tools used to talk to the master. They're all quite programmable. Like the common line, you can write bash scripts over them and things like that. Job queue operations, you can check active jobs, instance reinstalls in progress or evacuations in progress. Stop some of them, watch the progress of a job or see why it failed and things like that. So for example, you can start an operation, log out of the cluster and then check the result the day later or submit an operation through the web interface or remote API daemon and then check on the cluster what went wrong if something went wrong. Managing node groups, adding moving nodes between node groups. This allows you basically to divide your cluster into separate groups that are nearer to each other. Gannetti, for example, will never schedule a virtual machine across node groups. It will keep it inside a node group if it has a secondary storage. This helps with things like, well, for example, you have a rack, you have a switch, then you have a link out of the switch which is smaller than, or for example, you have one gigabit between the switch and then one gigabit out. So you don't want all your RBDs to be between this switch and another switch. You want to keep most of them inside the rack. So this allows you to do that. But it can be used in many other ways. For example, if you want to make sure that your primaries and secondaries are not in the same rack, then you can create two node groups that are actually across racks and make sure that this doesn't happen. Or in that case, better modify Hale to enforce some of these policies. So what's missing if you want to run Gannetti in production? Well, this allows you to do a lot. But then you probably want to do monitoring, check host disks, check the memory state, check the load of hosts, like a physical host and a virtual host and move things around or do stuff as things happen. You may want to trigger events, evacuate nodes, send to repairs, read or rebalance. And then you may want to automate lots of things through configuration management. So you don't want when a node comes up to actually apply all the things you need before joining it to the cluster. You want just your node to come up and know what it needs to be and we didn't want to reinvent configuration management, right? And then you want self-service use. So we give you programmable interfaces, but in order to have a full IAAS support, you want your users to be able to add or remove machines or to reboot their machines. Now we don't know who your users are, we don't have access to your database of users or machine ownership. So and we didn't want you to need to use a full stack in which you need to use all our components for everything. So we just decided to keep it simple and allow you to have a component that manages this association outside of Ganetti. So how do we use Ganetti in a data center? Just what's the time? So that we know 24, okay. So should we go through Ganetti at Google or would you rather do internals? Ganetti at Google, up your hands? Internals? Nobody's for internals, good. So how do we use Ganetti in a Google data center? We have this cluster, the remote API and this is H-Axis are the two ways we talk to the master node, which allows us to do operations on the cluster. And then we have, what does that say? Well, those are basically node groups, one per rock. We have monitoring coming out of every node. So monitoring checks the status of every nodes and gets this information out. There's some fleet management. So the fleet management tool are fleet-wide. They run in our closed source evil production environment, unfortunately. And they talk to all the clusters and make sure that virtual machines run and submit jobs. They talk to the Google user database and the association between Google users and Google virtual machines owners. And they allow users to run operations on virtual machines, what I was telling you about before. They also can move virtual machines from a cluster to another. Ganetti has some features in order to do that, but then you need an orchestrator outside that tells the two clusters to start talking. Then the two clusters will do it between themselves. You don't need to move the virtual machine to a central system and then to the cluster, but you need to tell the two clusters to cooperate from a central system that speaks to our API. How do you provision instances? Well, an allocation request comes from a user. This can be a ticket. This can be some other kind of internal Google system. But this talks to a tool we call Virgil that will actually update the Google machine database and will take from monitoring the capacity. So monitoring will know the capacity of all the clusters. These are basically tools that are universally useful, but in this case, we already have them at Google. So we need to integrate with those rather than reimplement them from scratch for Ganetti. So this means that we implemented Virgil to translate these tools and the open source world which Ganetti is in. So Virgil then knows which cluster has more space among the ones that are of the right type. So for example, we have that Ganetti cluster which is of general, any other virtual machine, Ubiquity runs user virtual workstations. So if the allocation request is for a workstation, it will go for the Ubiquity cluster. If it is for a general service, it will go to a general cluster and it will choose one of them according to capacity. How do we run repairs? So monitoring the text that a node is broken, it tells Euripides or Euripides finds out from an alert manager, knowing that there are alerts from a node and tells Virgil, please send this machine to repairs. Virgil tells the cluster that there is a broken machine. So the broken machine gets evacuated, virtual machines get live migrated out if we caught this in time or failed over if we didn't and the machine is actually down. The machine database gets updated and sets the machine to an in repair status and then people with skateboards will go there, do stuff and give us the machine back running without us knowing what exactly went on. The new machine is already installed and we can add it to the fleet. Oh, here it is. So when the people fixing the machine fix it, they market this in the machine database. So they say, this machine is now okay. Virgil notices and just waits for 24 hours, does nothing and just watches the machine because chances are it will break again because they pretended to fix it. Well, maybe they didn't know or maybe the new hardware was still faulty. So we just watch it. If after 24 hours with no workload there's nothing wrong with the machine then we tell Virgil to basically reintegrate the machine. So Virgil runs, first runs DRAILIS which is a tool that basically it's configuration management. It knows about the node. It knows which network interfaces are configured and things like that. This could easily be Puppet. Actually DRAILIS is re-implemented on Puppet nowadays. So it didn't use to be, but that's just historic craft. And finally just through an RIPI it tells the cluster add again this node and the cluster adds it and now we can run a rebalance and schedule virtual machines on top of it. So who runs Koneti? We do. Gernet does. DSA does, you heard about it before. They're experimenting with it to run virtual machines for Debian, FSF France, Oregon State Open Source Lab and actually quite a lot more people in smaller developments. We have lots of requests on mailing list that we have no clue who are they from or what they do with it. We get a question. Then we get another question after a few months. So we guess that they continue to use it and somehow it works for them. It's particularly good if you don't want to invest a lot into building a huge infrastructure. We've all seen the OpenStack diagram today. It was like, you need to set up these 25 components and make sure they all run together and like we all had headaches at some point. This is a lot simpler. Just a couple of commands and you're up and running and you can actually use it. You can check this out at code.google.com.p-ganetti. Search for Ganetti on Google. Try it, submit patches if you can. Talk to us, give us suggestions. We have some big changes going on. The development team is going to change office and things like that so a lot will change but hopefully a lot will stay the same as in hopefully we'll keep it open source, keep it good and easy to use and implement new features that will make it even more useful. Any question, feedback, ideas, flames? Note that if you don't have any, I might move on to the how we did it internals part so you may want to ask some questions. Anybody? No, no, not really. So one more announcement by the way. If you can code in Python and Haskell and you're interested in this, we're hiring software developers in the Google Munich office for Ganetti. If you have friends that are interested in this, tell them to contact me, send a CV. It's actually quite great. You get to work for Google and on an open source product. Not everybody can do that. Don't quote me with a press. Oh wait, I'm on record. Okay, let's see what the time is. It's 32, let's do some of the internals. Why not? Are you sure? No questions? No, okay. So what are the main Ganetti components? We said that we have our users and we have our web interface that speaks rest to this remote API daemon. Both the remote API daemon and the client command line tools speak this Lucci protocol to MasterD. MasterD contains the job queue. Finally, MasterD uses an RPC protocol which is basically a JSON-based RPC, I believe, to speak to the nodes. Let's go on until it's not too complicated. It's already too complicated. It's 5 p.m., let's quit that here if you don't have particular questions. This basically allows you, and if you go back to my Fosdm talk, you can see better where the various components fit into each other, and if you want to submit a patch, this should help you understand what to modify and things. But yeah, done. Thank you.