 Ευχαριστώ για να είμαι εδώ, my name is Van Gaely's Cookies, θα μιλήσω για τον Cinephal. Cinephal, που είναι ελληνικό για το cloud, but ok, that's more or less a standard name for a cloud stack, is a complete cloud stack over Google Ganetti. Have you ever heard about Google Ganetti, have you ever used Google Ganetti? If you heard about it, that's even better. We have a few introductory things about it. And then there's an excellent talk tomorrow afternoon by Michele, who is a member of the Ganetti team, and you can learn all about it. So Cinephal is our cloud stack. We've written a thin cloud stack over Ganetti. Our motivation was a public cloud service that we've built and we've run since 2011. I work for the Greek Research and Technology Network. It's the Greek Research and Education Network. It's the ISP for Greek universities, essentially. We provide a public cloud service for researchers, for students, for professors, for the IT departments of interconnected institutions. The service has been in production since July 2011. It currently runs more than 5500 virtual machines. It has more than 3500 users. In total, we've spawned more than 160,000 VMs. We've spawned more than 44,000 virtual networks. These are pretty good numbers. I mean, I don't know if you have experience with running a public cloud service, but these are pretty good numbers. The presentation is going to be about the software that we've written to support this cloud service, our experiences, and how you could perhaps benefit from this software. If there's anything you may want to ask, if there's something that doesn't make much sense, please feel free to interrupt me, ask a question, and we'll talk about it at the spot, okay? So, what were our choices in building the OKNOS service? We want to build an Amazon Web Services-like service. Provide compute capabilities, virtual machines, network capabilities, virtual internet domains, storage capabilities, virtual volumes for these VMs. One important difference, which is not what most people expect from a cloud service, we want it persistent to the VMs, the VMs which would survive hardware failures, the VMs which wouldn't be volatile, which would be there, which would live migrate between physical hosts, for example. Our clients would be the network operation center of our university. They want to run their mail server on us. How can they do it if the VM disappears when a physical node goes down? How can you do it over commodity hardware? So, production quality, infrastructure, the service, everything we would write would be open source, everything that we have written is open source right now. A super simple UI, our users are students and researchers which don't actually have prior exposure to cloud technologies. How easy is it for them to build a virtual machine and work on it? And how can you do it? What kind of software is there? How can you combine open source components to build that kind of service? Why is it difficult to build a cloud service? Because you need to be stable, you need to have persistent VMs. Many people will say that VMs in the cloud world are cattle. You've got a thousand cows, they go out in the field, they do whatever they do, then they bring milk for you. If one cow dies, then you don't care because you'll get another cow and this cow will go in the field and bring you milk. But this requires the applications to be written in a very specific way. This puts all the weight on the shoulder of the application developer. When they start to use a cloud service, they expect their VM to be there. They expect their mail server to continue running if something happens to a physical node or to a part of the data center. How can you do it? The VMs are pets, people love them, treat them, take care of them. They don't want them to die instantly. They care about them, they have feelings for them. So that's how we feel about our VMs as well. These pets should run over commodity hardware. You can't scale if you have to buy big storage arrays which are single points of failure, if you have to buy big networking equipment. We wanted it to run on as much commodity hardware as possible. Scaleability, the obvious need and manageability. How do you upgrade the infrastructure? How do you roll out upgrades? How do you upgrade the kernel? How do you upgrade the lower layers? Has helped us immensely in that regard. So how we did it and the components about which I'm going to be talking for the next half an hour. We built our own cloud software, we call it Cinefo. This software runs over Google Ganetti. Ganetti is software for managing clusters of VMs. We use DRBD, replicated storage solution over two nodes for managing the actual disks of the VMs. We also use Cef and the the object store of Cef, Rados. The decision is up to the user. We will discuss about how different workloads need different kinds of storage on the cloud and we decided to implement the OpenStack APIs for our cloud platform. So if you have an OpenStack client, if you have a library that speaks the OpenStack APIs, you can actually come and talk to Cinefo. We don't share OpenStack code. We only implement the specification of the APIs and we've tested our implementation against Jclouds, for example, and it works. More about that later on. We support whatever virtualization Ganetti supports, Ganetti runs over Zen and KVM. Our production our production uses KVM. Most of the things that we test we test on KVM. But the components we use they've been tested over Zen as well, but we don't run Zen in production. But if you decide to deploy Cinefo for your own installation or if you decide to test it, you could even go with Zen if you feel that this has a very specific advantage for you. This is a graph of the number of virtual machines on the service since when we went in the production which was around August 2011. The service proved to be quite popular. The graph is a bit obsolete because the counting stops at April 2013 this is about 4,000 VMs. We are about 5,500 VMs right now. It's a pretty big number. We run on about 200 physical nodes. Now the question that comes to mind is why not use a very well-known cloud software. I'm sure you've all heard of many well-known cloud software out there. Okay. There are many reasons for that. I'd like to make a comparison between the best known software OpenStack and our approach. So in building the service we recognized six distinct layers. All the way at the lowest level is the hypervisor. The hypervisor manages a single VM. KVM creates a VM. Then there's the node layer. Some software that knows that there are many VMs on a single node. Then there's the cluster layer. There's some software that knows that there are many physical nodes and there are VMs on these physical nodes and they can migrate from node to node and so on. Then there's the cloud layer that knows that there are many users. There's sharing, there are many clusters. There are relations among users. This kind of thing. Then there's the API specification and then there's the UI. We believe there's a big difference in the mindset needed to implement the cluster and load layers and the cloud and upper layers. There are two different worlds. It's a different thing to manage virtual machines to have to do locking, to do migrations, to do process control, to do all these kind of things and it's a different thing to implement cloud APIs. To do the web stuff, to do JSON, to do XML, to do multiple concurrent processes, to do users, resource sharing, ACLs. You need different skillset and in our approach you need different code. We chose, yes. This is a CPU oriented approach. You mean it's a computer oriented approach. Yeah. But you also said it's a CPU storage and network. We followed a similar approach for the storage part in the sense that we offloaded some very hard stuff to either hardware or to rados for example and then we implement the storage APIs in a much thinner, much easier to debug, much easier to deploy layer. I have more about that layer but it's a compute centric approach. So we choose the OpenStack specifications and we choose KVM for the low level hypervisor and then OpenStack kind of crosses this management barrier and you have the same software, you have the same database holding data for all your VMs and if you want to do migrations you go right into this very specific database which holds data on who's the owner of this VM and how many resources they have and I don't know what kind of image this VM was created from and so on and so forth. These are two distinct things. You don't want the same kind of software for doing both of these things. You don't want to upgrade everything at once. How do you do that? What happens with your VMs when you do that? How do you upgrades from the Bexa release to the Cactus release to the Diablo release have worked? We have been able to keep VMs up and running since 2011. The same VMs have been running apart from, let's say, power failures since 2011. We did database migrations for our software and distinct migrations for the GANET software and this has kept our operations running smoothly. So at lower levels you need software that does VM handling, VM management and GANET is great software for doing that and has been developed since 2006. So it's mature, it knows its work, the administrators can use it independently of our software and our software implements the cloud layer and the user interface layer. That's the main idea of what we do. And it makes things much simpler. If something is buggy at the upper layers, the administrators don't care. The VMs are still up and running. They can immediately tell if the problem is GANET based or Cinefo based. Guess which kind of problems appear most frequently, but anyway. They can upgrade things separately. They can trust the low layer and they can experiment with the upper layer because the upper layer is a much faster changing layer. Cloud technologies evolve constantly. But they can have the trust that the VMs are going to be there and they're going to be up and running. So at this point I'd like to do some live demonstration of the basic concepts of the software so that you know exactly what we're talking about later on when we go under the hood to discuss exactly how these things are implemented. Okay? So I have my virtual machine here. It's running Ubuntu and this is a demo installation for software. I'm already logged in anyway. Let me log out. You can actually go right now and have your own account and play with the system if you want to. So I'm logging into the system and I'll go check out the storage part first. An important concept which I didn't mention. Cinefo provides for unified handling of storage for files, images, snapshots and eventually volumes. We'll discuss about how this happens later on. This is a storage service. I've got my containers here. I've got my directories here. This is all implemented over the OpenStack Swift API. This is a web-based client for the OpenStack Swift API plus custom extensions for sharing. Okay? I've got some nice images here. I can view them. It works. Right? And I can also share files. I can say I want to add a new user. Do I know any users? Let's add this user. Okay, so this user has read only access. Or I can also share it publicly and have a public link to have everybody who knows the link download the file. This same mechanism for sharing stuff I can use for images and snapshots. Because the images that templates that VM get made from and the snapshots of VMs they all get stored inside the same storage system. Files are being shared with me. And there is this user here images at demo.cinefo.org that shares big 13GB files with me. These are the actual images that VMs are being built from. And these are just files on my storage system. If I move to the VM part this is a web-based client that implements the OpenStack Nova Cinder and what else? Glance APIs. And I've got a VM running. This is a Windows 2008 VM. We are running Windows in production. Windows installation get customized from start to finish administrator password disk resizes, everything works. You are given a password, you can log into your VM instantly. So I'll go create a new machine. These options here are the images I can use to create this new machine. And these options here are the files that I've seen before enhanced with metadata like what I can use and what's the description of the image and things like that. So I can go create a Debian VM for example a Weezy VM. And I've got system images which are images provided by the administrator. I've got my own images which I can upload over the storage service which means I can use the syncing client which provides to only upload the differences of a previous image I may have uploaded. Imagine uploading 10GB images over your DSL line or your cell phone connection. Have you ever tried uploading an image over such connection? How can it work? It does work because it only uploads the differences of some previous images you or other users may have uploaded. I'll talk more about it later on. These images are shared with me, nothing. And these images are public. And if I try to use the image that has been uploaded by another user I'll get a big fat warning that you know this is untrusted that you better know what you're doing. So I'll go create a Weezy VM. The system masks me what kind of hardware configuration I'd like to have for my VM and I also have a choice of the storage layer. I can either do standard DRBD stuff. My data will be replicated on two physical nodes. Cinefo does not care about the storage layer. Cinefo uses whatever GANET supports. GANET supports multiple storage layers out of the box. Right? Standard is DRBD. In this specific deployment we've enabled local LVM based storage we've enabled lock devices over rados standard DRBD and we've enabled our own custom storage layer to do very quick clones over files on the storage part itself. So the user uploads an image and no data get moved for VMs to be created. We use the exact same blocks if that's what the user stage wants to do. Let's create a standard VM. I can inject my SSH keys the infrastructure will take care, Cinefo will take care of injecting the keys inside the image. This will happen in an isolated way because this is untrusted user data image data and the VM is being created. I've got initial password that's it. Now the infrastructure will copy image data from the storage part to the actual disks because this is DRBD and this is going to be a long running server this is going to be a mail server so it makes sense for the user to ask for this kind of storage. What if it was it's taking the queue for a bit longer than it should do anyway what if it was a volatile VM what if I wanted to start 10 or 20 VMs it wouldn't make sense to create the must-dbd VMs right? It should start. I don't know why it's not starting. Anyway I've got other nice views like this one here and it should start I don't know why it's not starting perhaps the system is a bit busy right now Anyway let's move on and it will start. Networking we can do current version private networks, layer 2 networks multiple ways of using them how can you do a thousand or two thousand or five thousand virtual networks over physical infrastructure do you use a different VLAN for each user this can scale our very expensive Cisco Nexus switches cannot scale over 300 or 400 VLANs over all trunk ports so how can we do it it's pluggable it's the genetic side, we have pluggable implementations I'll talk more about it later and this has started so let's see what's going on it copied the image this took about 13 seconds and then it starts a special customization VM we can actually see it as it happens to customize the VM and insert passwords and files and enable users and so on and the machine is up and running right so I can connect to it and see that everything works and this is my machine now let's create an Archipelago machine this time I'll do a thin clone I'll provision the volume as a thin clone of an image nothing changes except that no image copying is actually going to take place the volume will be provisioned instantly and then customization is going to run starting image copy at 13 minutes 52 seconds the image copy is finished why? because no copy actually took place and imagine doing that for 10 or 15 or 20 VMs actually you don't have to imagine because I have a demo from the command line that does exactly that this is the user dashboard this is provided by the identity component we've got multiple identity methods I'll talk more about them later let's see how I can access the system I have a keystone URL essentially and I've got a token I'll use the command line client we call it Kamaki to list the servers I can have a is this feasible? I'm not sure let me change it to something more appropriate is it better? this is a list of the virtual hardware configurations I can have so I'll pick this one here which is a small archipelago machine 1 cpq 500 megabytes of RAM 20 gigabytes of disk I'll have a look at the images that are available for me to build the VMs I'll use this one a nice whiz machine so I'll create 10 machines from this same hardware configuration this image and I can even inject my ssh so I can login afterwards and it doesn't work because I should have said server create and Kamaki will take care of issuing the end creation requests and I see that the web interface gets updated we didn't even have enough time but it happened anyway the VMs get created pretty fast all requests are issued in the queue and even while machines are being initialized, other machines are running this happened in like what 5 seconds machines are already running I've created a virtual cluster about 10 machines in what's 1 or 2 minutes over archipelago over our custom storage infrastructure and finally 3 machines remain building 2 machines remain building anyway, I'll destroy everything now how does this all happen and there's another demo we can run in half the time from the development version of the software that can also do snapshots and then clone this snapshot into a running VM so how does this all happen GANETI manages VMs GANETI comes with support for multiple storage backends out of the box it supports LVM, DRBD local or shared files let's say on NFS it supports the rados block device our group has actually contributed the support to GANETI it supports the external storage interface it supports the administrator to write a small set of custom scripts to manage any kind of external storage storage array this has also been contributed by us it's easy to integrate with GANETI there's a nice remote API over HTTP and that's what Cinefo uses that's the overall view of the architecture this is Cinefo on the cloud layer the compute part manages multiple GANETI clusters the storage part has either an NFS or a rados backend and these are pluggable, they are distinct drivers so we've separated the cloud implementation from the lower level cluster implementation and rados for example your question takes care of managing the lower level objects and replicating them and so on the administrator always has a side path to go manage the VMs you agree with that? no everything works as GANETI knows how to make it work if you choose a DRBD backend for your VM GANETI will choose a primary and a secondary and you can do all the replaced disks and whatever you may need to have the VM up and running our administrators go and manage the DRBD from this side path every day and we'll know anything about it as nodes go up and down our identity component you can log in with a standard username and password it has LDAP integration it has Sheebleff integration if you want to do federated logins Google ID, LinkedIn, Facebook first-party providers we've written these as proof of concepts they work, we've actually enabled them for our demo infrastructure you can go check them out there's a single dashboard for users to view quotas their profile, enable or disable authentication methods and so on our compute component is a thin layer written in python and Django over Google GANETI over multiple Google GANETI clusters all the networking implementation is a separate pluggable part that's implemented with scripts around GANETI we've actually tested and run in production virtual networks as distinct physical VLANs virtual networks over a single VLAN with filtering based on MAC addresses that's a nice hack and we've also integrated a custom VXLAN solution again without Kiklaves the compute part knowing it's all pluggable inside GANETI so the compute part you mean what the VTEP is what the virtual endpoint is and why it's sit on the host we've tested with a user space implementation it was a more or less proof concept thing and this runs on the hosts and it uses multicast to control to find out the virtual turning endpoints it uses multicast for endpoint discovery at this implementation no, we haven't run this in production it's our own implementation pluggable inside GANETI but I'm talking about this to point how one can plug distinct implementations the administrator can even write custom scripts very easy stuff to plug into GANETI without the cloud layer or even GANETI itself knowing how do we interact with GANETI when new requests come we issue them to GANETI clusters when things change on a GANETI cluster either our requests take effect or the administrators manage the VMs we are notified, Cinefo is notified and the way up to the user information flows and the user can see the effect of their requests or of the administrator's requests our storage part files images, snapshots everything is stored on the storage part everything goes on the storage part is chopped up in blocks of 4MB and these blocks are content addressable so we can do efficient partial file transfers on the client side if everything is content addressable the client can hash local data can ask for the creation of specific files on the server and the server only has to reply with the missing blocks right, I mean I want to create a 10GB file I want to upload a 10GB file I need it to comprise these hashes the server replies I've got most of the file I only miss this and this part the client only uploads the missing parts this happens with files and images and this can happen in the opposite direction for downloading a snapshot perhaps for an offsite backup or something it's this part here we've got two drivers one is an NFS based driver so you can use your existing infrastructure one is a Rados based driver so you can run over SEF now how does it all tie together with volumes and how do we do thin clones over such images this is the image of a VM it's a frozen VM state and then we spawn it into a VM and then the VM has a life of its own and then we can freeze it back into an own image but it also comes down to storage we've got a snapshot of a root disk and we clone it and then we can snapshot it back into a frozen disk plus custom data how does it all happen over commodity hardware why do that because I'm a researcher I want to run a parallel application on 10 or 20 nodes I can spawn all these nodes over a single golden image that I have created and uploaded partially these are the virtual disks of 2 VMs and they are they contain multiple blocks every disk is a linearly addressed set of blocks how do I go from these blocks to my storage whatever that may be we've chosen to use Rados in production but this wasn't always the case we used to run over NFS for a while because that we felt was more stable for us why if you have some sort of other storage solution how do you go from virtual disks to whatever storage solution you may have we've written a custom layer we call it Archipelago that maps from these blocks to individual objects on the storage layer if it's NFS it's files on NFS if it's Rados it's objects on Rados and you can migrate from one storage solution to another by migrating these objects without the VMs knowing and the actual maps the actual information of how these blocks are mapped to objects it's also stored inside your storage part this is a the current code I think does not have support for this but there's no design reason why not to do this you could yeah so this is a more technical view this is the compute part of the cloud this is the storage part this is the Rados cluster the monitors and the Rados storage nodes the VMs and the actual file contents they are both stored as objects on Rados and the users can upload their image have it become objects on Rados and then we thinly create the volumes that the VMs use but more or less it Archipelago is unified storage for files images and volumes being created from these images and then we can snapshot a machine back into a file that the user can view, can download can copy, can share, can do whatever the user likes to do this is the URL for the software you can read documentation you can download it, you can try it out if you want please provide any feedback if you actually do try it out and another feature that's under development right now is snapshotting so let's go see it work on another development installation this is an installation of the development version of the software it hasn't been released yet I'll use two distinct accounts one is my gRNET account the other is my Gmail account to demonstrate snapshot sharing I'll create a new virtual machine this release has support both for images and snapshots and again snapshots are either system provided my own stuff that other people share with me or public cloud and because we have the same storage layer we can reuse the same code and we can reuse the same sharing mechanism so I'll create a new Debian machine I don't have to write down the password because I've already injected my ssh keys the machine gets created almost instantly it's done so I will connect to it and go do some sort of provided I can copy and paste properly so I'll connect to it and I'll do some sort of customization say I'll go mess with the message of the day and I'll say this is a I don't know volume to be snapshot it welcome ok that's my machine now I've made some sort of custom change I'll come back to the compute control panel and I'll create a snapshot of it and this snapshot will have a name it will have a description like I want to share this with my grnet account I'm logged in with my gmail account right now it will have a nice name and that's it the snapshot got created now where does the snapshot leave the snapshot get created instantly because you just have to copy them up the snapshot leaves on the storage part so I could go and sync it in a dropbox like manner we have syncing clients you put your own files, images whatever into dropbox you download your own snapshots and this is the snapshot I just created it must be this one so I can go share it and reuse the same mechanism that we use for any other kind of file and I share it with my other account now I log off the system login with my other account and predictably go create a new VM I had already tried this before that's why there is a VM already so system snapshots none, my own snapshots one from a previous invocation that's not the one that I just created shared with me this is the one I want to use there's a big fat warning do not do this if you don't know what you're doing it's shared by my Gmail account I'll create the machine when my machine is done it's up in what was 14 seconds I'll connect to it takes a while to boot and this will be a clone of a machine snapshot it by a different user made from somebody who the snapshot got shared to or at least we hope that's what's going to happen and the system shows that this is actually a clone created from a snapshot that another user shared with me so that's more or less it are there things you'd like to talk about are there things you'd like to ask please feel free to do so we have an administrator's guide we have a quick installation guide on two nodes for example it's based on squeeze currently we have the repositories they contain squeeze packages instructions for installing, verifying the operation and eventually having your own compute and storage cloud in one or two nodes and we also have a live CD that you can download and run if you want to test out the software instantly it's all on cinepho.org I think I showed the URL the place so it should work but everything that doesn't work please contact our support mailing list and tell us that doesn't work yes please it depends on the kind of migration achieved you at the GANET level if you live migrate the VM before the physical node goes down then nobody's ever going to notice anything because GANET will take care of live migrating the VM to another node the VM will keep all its state will continue to run happily it may miss one or two packets over TCP or something that's it if the physical node goes down because you've pulled the power plug then you cannot keep the state because there is no mechanism currently that you can run in production that will keep the CPU register synchronized among nodes but then you can ask GANET to fail over the VM to another node so the VM will boot in a few minutes or something but you can't imagine the kind of performance that running as a virtual CPU synchronized over the network I mean the RBD synchronized storage over the network and it has quite some latency right I can't imagine live syncing CPU registers over the network I mean there have been research approaches but I'm sure nothing like that runs production currently any other question this is an infrastructure as a service cloud so if you snapshot a virtual disk and then you create new virtual disks from it there is no way one could synchronize the contents of virtual disks that have their own life with the contents of the original virtual disk because it's all blocks I know nothing about the VMs the infrastructure does not have any direct access inside the VMs the infrastructure cannot SSH into the VMs so there is no way we can propagate block changes there is no way we would like to propagate these kind of changes because stuff would break horribly Exactly Of course and then SSH into the nodes and you install stuff and you use puppet and then you're happy we've got software that's called Cinefo image creator that will analyze your system it will find out the distribution that you run and where your root file system is and what kind of users you have this cannot work because it won't allow you to create a virtual disk from a snapshot that's 4 GB it will never truncate the disk if the disk is bigger if you have no metadata about the snapshot it will just fill the first 20 GB of data let's say and leave the rest empty if you provide the metadata for the snapshot if you say that this is the root file system and it's a Debian machine you can resize your disk so that it fills all the space yes please no that's the good thing Ganetti only knows about VMs it knows nothing about users nothing about the relations between VMs nothing so you can use any Ganetti management tool you like we prefer the command line but there are other graphical tools let's say Ganetti web manager you can do whatever you like as administrator Ganetti keeps its own state it's not kept in our database it has its own state keeping mechanism and it takes care of synchronizing this state among master candidates as they're called in Ganetti it's super easy to scale linearly just by adding Ganetti clusters we're running about 15 clusters in production right now if we had the hardware that's our main problem right now we can scale linearly let's say 30 clusters you just say I want to add a new backend you set up your cluster that's it any other question we run 15 clusters in production right now that's how we scale and you can have distinct clusters with distinct performance or quality of service characteristics you can route your users to let's say clusters that are SSD enabled or that run one core per VM or for routing specific users to specific clusters and it's interesting I forgot to mention this I should have mentioned this running Cinefone production combined with Ganetti we were able to do rolling hardware and software upgrades we upgraded kernels, we upgraded Ganetti itself we upgraded Cinefo the users didn't notice I mean they did notice when Cinefo was down but that was just a control path the VMs from data center to data center we abused the Ganetti cluster to span both data centers at this point and the VMs left Intel machines and found themselves on AMD machines on a different data center we did on-the-fly migration from NFS storage to RADOS storage because they're just objects that Archipelago manages we renumbered all VMs imagine this kind of things happening on an open stack or other deployment what kind of hacks you have to do in the database for example I've heard a few horror stories I've never actually run OpenStack myself in production I'd like to hear more about your experiences with that of course do I have how Cinefo would compare with OpenStack I mean the approach I prefer the Ganetti approach because it's more self-contained it does a simple thing it does it more or less perfectly it does what it promises is other people's problems and you can treat them separately if you have a single software that spans everything from the hypervisor all the way to the user it's much more difficult to manage that's what our experience has been from evaluating OpenStack in the early days and now and running in production but I hadn't actually run OpenStack in production so we were too afraid to do so essentially but yeah but all these modules are around Ganetti they're again self-contained and then do a very specific thing so if something breaks you more or less know where to look so how do you scale you add new Ganetti clusters why is that good because you can have different network and storage backends for different workloads and this choice goes all the way up to the user and that's more or less it if there's anything you may want to ask please find me after the talk and we can discuss all about it thank you for being here