 Hello everyone, thanks for being here, it's a real pleasure to be here today for the last day of the OpenStack Summit. I know it has been a really tough week and all the parties and everything, so thank you very much for being here. I'm Sebastien and I work as a Cloud Engineer at Innovance. Innovance is a multi-cloud provider, so we basically run design and build cloud platforms. We have several domain of expertise, among others OpenStack and Ceph. My daily job is mainly focused on OpenStack and Ceph and I also rotate between the operational, the development and the pre-sale team. Apart from this, I devote a third part of my time to blogging, so here are the details, my personal blog and company blog, so don't hesitate to have a look at them. During the next 30 minutes, we will be discussing the Ceph integration into OpenStack, so I'm going to briefly introduce Ceph or do some review, I'm not familiar with it. Ceph is a unified distributed storage system that started in 2006 during the HBHD. It's open source under LGPL license, so no vendor lock-in for the open source. It's mainly written in C++ and it's basically building the future of storage on commodity hardware, which is quite good actually because we don't have any restrictions, so we can just choose several really diverse hardware to build your first cluster and it evolves according to your own needs and it's also fairly easy to run a park and do tests, so that's quite good. Ceph has numerous key features such as self-managing and self-healing. The main point is it's a really dynamic cluster, so if something goes wrong, if you lose a node or if you lose a disk, Ceph will just trigger a recovery process because there are tons of health checks between each component, so as soon as the cluster detects that something is wrong, well, it will just self-heal itself. It's self-balancing, which means that as soon as you have a new disk or add a new node, the cluster is dynamically load balanced, so all the data are just moving. It's really penniless for scaling because it's fairly easy to add a new disk and to add just add a new node. Thanks to the tremendous puppet modules, Ceph modules and Ceph deploy, so now it's really easy to deploy Ceph and also to scale Ceph. Ceph is really unique because it has a really cool feature called CRUSH and it's just a data placement. CRUSH stands for control replication under scalable hashing. It's a random placement algorithm, which means that we don't do any, well, it's just fast calculation, so every time we want to store an object into the cluster, we have to compute the location, so we don't store anything into a hash table or something, we just always calculate the location, so it makes it deterministic and this is, well, this is pretty good. It's statistically uniform distribution, well, as mentioned earlier, as soon as you add a new node, the whole cluster gets rebalanced, so that's fairly easy to take advantage of the full hardware and it's a rule-based configuration. The really cool thing with CRUSH is that you can reflect your physical, well, you can logically reflect your physical infrastructure, so basically you have a, well, CRUSH is a map and within the map, you have your auto-atopology and the topology of your physical infrastructure, so you have your nodes, you have your disks, and it's rule-based configuration, well, I just said it's topology-aware, and then you can build rules, and within these rules, you can just specify replication count and things like that, so that's something really unique, and the really cool thing is that you can just specify that you have several hardware, you have diverse hardware, you have SSD-based systems, and you have a SATA disk system, so you can just say, okay, within this pool, store all the objects on SSDs or store all the objects on the SATA, so that's what's useful to have a really good placement algorithm. Just to give you, well, the final big picture of SEF, this is how SEF looks like, so everything is built upon the RADOS object store, so everything is stored as an object, and just on top of this, we build several components, so you have several ways to access your data and to store data, but first you have the lib-RADOS, it's just a library to access the RADOS cluster, so you can basically build your own application, and then from this, you can just write and access all the objects, so it's fairly easy to plug with the lib-RADOS because it has several language bindings, just Python, C++, Ruby, Java, and, well, a lot of languages binding. The first component of SEF is called RADOS Gateway, it's just a RESTful API, just really equivalent as what Amazon F3 does and what OpenStackSwift does, it's just a RESTful API, so you, oh, sorry, it has multi-tenants capabilities and it does code as well, you can do, it supports geo-replication and also disaster recovery features, then the second component is called RBD, it stands for RADOS plug device, and it's divided into two pieces, the first one is a kernel module, part of the kernel, so you can basically create a device and then map it on your machine, so you have a new logical hard drive disk, it's, well, quite useful, it's just the same way as ASCUSY does if you want. And then the second piece is just a QMU KVM driver, so you can create images, they are thin provisioned and they support snapshotting and co-pian-write clones, full or incremental, so that's really useful and it's well integrated into Xen and KVM. Then the last part of CEP is the distributed file system, it's called CEPFS and it's the POSIX compliant file system that supports snapshotting as well. Worth mentioning that all the pieces are really robust, well, except CEPFS, which is, well, not really, how do you say this, it's not production ready, that's, sorry? Almost, I think that's, yeah, it's mentioning always, yeah, almost awesome because everything is really robust and already really good, so that's, we are almost there to have a, to have a fully unified storage. Now it's some of the first consideration while building your first set cluster, it's really performance oriented, but it's just a general, just like a methodology when you want to build your first set cluster. The thing is how to start, first of all, you need a use case and well, when it, within this use case, you have to establish several rules like a, well, ideally you might be able to tell, okay, I'm more doing IOPS or I'm more doing bandwidth or perhaps it's mixed, but this will radically change the way that you build your cluster, if you want more IOs, you won't just pass this with more IOPS and if you want bandwidth, you might want a really large network bandwidth, for example. You might want to also establish sort of a granted IOPS for this, so ideally you will like to be able to tell, okay, I want to deliver this amount of IOPS and this amount of bandwidth for all of my customers individually, obviously it's really difficult to tell us, but well, if you can, let's just do it. You definitely want to know if you use SEF as a standalone solution or if you use SEF combined with the software solution, if you use with OpenStack or other cloud solution, for example, because it's, at some point if you have like performance issues, you must really want to know that this is implemented that way, but first of all, you know how SEF works, basically, and you know how it's implemented with the software solution, so if something goes wrong, you know directly how to look, so that's something that you want to consider as well, then you need to establish, well, what's the amount of data that you want to start with, and use both data and not group, then you need to tell, okay, SEF does the replication, so it just, you can specify a replica account, you know, you just have to decide if you want to start with three, three, four or more. Ideally, you would like to establish also a failure ratio, which means that when you build a cluster, you don't really want to build high density nodes. If you have 100 terabytes, for example, you don't really want to build 3D, well, each node, we have 33 terabytes each, because if you look at the node there, well, you have a lot of data to rebalance, so you need to just establish like a percentage ratio of the data that you're willing to load balance if something goes wrong, according to performance, and because as soon as you, as soon as you recover, well, you have to write a little bit more, and client keeps writing too, so that's kind of something that you definitely want to consider too. Ideally, you would like to have also a data growth planning. If you know that, I don't know, maybe every six months you're getting 10 more or 100 more terabytes, this will definitely change the way that you build your initial cluster, so maybe you will spend a little bit more of money, but every six months that's going to be, well, way easier to scale, and well, obviously, you need a budget, so I won't go through any consideration about the budget, but yeah, you definitely need something that backed all your requirements. Things that you must not do, but I just want to highlight that don't get me wrong, this is really performance-oriented, so obviously everything is doable there, but if you want to avoid like necessarily troubleshooting, you might want to follow these considerations. Usually, you don't want to put the red underneath your OSDs, but the OSD is the object storage daemon, and the general recommendation is just to use one disk for one OSD. Cep already does the replication, so it's quite, well, useless to do more replication with that, so you lose space, and yeah, it's not quite efficient. Just also think that, because you don't necessarily only want to do red one array, you can also do red zero if you really want to burst your performances, but degraded red breaks the performance, and then if you don't have the right tool to monitor everything, you might get into trouble, because usually what we tend to do is that the speed of your cluster is the speed of your, well, the slowest disk within the entire cluster, so if you don't want to drop down all the performances or have like spikes, well, just don't do this. As mentioned earlier, you don't really want to build high density nodes with a tiny cluster, because you might have a lot of things to load balance, and then potentially get a cluster for you if you have too many data to load balance. We could argue on the last one that don't run safe on your hypervisors. As mentioned, this is doable, obviously, but at some point you might think that you could get like way more performance, because if you have your storage layer and your hypervisor layer on the hypervisor, then you can just directly access your cluster, so the first it is just really fast, because you eat locally, and the second one is a little bit less, but well, that's my main concern there is about memory and also about consistency on the platform. Usually storage servers do only storage, and hypervisors they only do memory. Cep needs memory as well, because the more memory you have, the better for system caching you can get, so in this case, both of them require memory, so Cep wants memory, and obviously the hypervisor wants memory, so at the end you just end up with a really bad memory, but it's mainly an assumption. Now let's dive into the state of the integration into Evana. So basically, what Cep is good with OpenStack, because it unifies all the components. Originally it was present in Glance, and then in Cinder, and recently in Evana. So it unifies all the components, so you just have this really single layer of storage, and then all the components are plugged into the storage layer, which is quite good, because you don't need to have diverse storage solutions for one component or another. You just have the same abstraction for storage, and that's quite good. Evana's best addition, well first of all, there were a complete refactor of the Cinder driver, so now it uses the lib rados and the libRBD. This is really cool, because we can get a better error end link for that, but that's something that we had to do. Thanks just for doing that, by the way. We have new features like flattened volumes while creating the snapshots, because what happened on the background is that if Cinder detects that Cep is also the back end storage for Glance, when you create a new volume, then it creates a clone from this. So if you don't really want to have too much dependency on the chaining of the snapshots and the clone, you might want to just flatten the snapshot every time you create the volume. And we have also a new policy about clone depth, so that's also what I just said. As soon as you create more clones, at some point you just want to say stop, and then flatten the original image, and then continue cloning everything. Then Cinder backup was already present in Grizzly, but the only back end was Swift, so now we can do backups from Cep to Cep as well. This is, well, the way we can do it is you can backup within the same pool, which is not recommended, because it's just the same machine, so you don't isolate anything between domain failures. But if you have different pools, this definitely points to different machines, so well, you can isolate that. Well, ideally, you do DR with this feature, so you have one location, and you have another Cep cluster running on another data center. It supports RBD stripes, and the really important thing is that differential, so we already do actually incremental backups when back-upping Cep from Cep with another cluster. I know that yesterday we had a discussion, we've seen the guys for implementing an incremental API for backups, but it's just already there if you use Cep. And one of, well, at least for me, one of the biggest additions for Havana around Cep is the Nova Libre image type. Originally, this flag is set to file, which means that every time you create a new virtual machine, you get a file on the system under World Libre Nova Instances, Instance UID, and this file is just the root disk of your virtual machine. You also have a second implementation with LVM, so you specify your volume group, and then every time you boot a machine, it creates a new LV, and then it attaches to the KVM process. Now what it does is you specify a new pool of Cep, a new Cep pool, and then you create a new RBD image, and it just connects the pointer to the KVM process. So you just directly boot all the VMs within Cep. This is definitely in a further decision, so the client or the user doesn't know anything about this. This was a really huge requirement from the community and from our customers, too, to just can I just boot everything within Cep instead of always doing boot from volumes, for example? So it's kind of hard to automate boot from volumes and to item everything. So now we can just directly boot everything within Cep, which makes operation, like, live migration way easier, yes? It's only for KVM, yeah. The other question was, is it also compatible with Xen? And, yeah, sorry, but OK. It's not a Nova, well, it's part of a Nova and a Cinder addition, too, so now it opens up support QOS, which is quite good, because Cep doesn't do any QOS. At the moment, so every IO request are just restricted from the hypervisor itself, so this is quite useful to allocate a certain amount of IOPS or bandwidth from your hypervisor itself. It's bound with Cinder volume types, so that's good. That's the big picture of today's VANA integration. So what we do is we can just boot a VM, so it goes into Cep, and then we can attach a volume, so it's also calling Cinder for doing that. And we can also do Novaya Evacuate, yeah, that was the point before the question. Live migration is made easier when you have everything into Cep, so because you just have to move the KVM process and then you just reconnect the link on the RBD image, so it's just really, really fast. It's also fairly easy to trigger on Novaya Evacuate. If you lose a compute node, you can do either Novaya Evacuate or Host Evacuate, and it's already on the Cep cluster, but if you have this directly on the hypervisor, it's quite hard to reboot the virtual machine. And it's just like the workflow, as I just explained it earlier, we have the multi-backend capabilities, so as soon as we create a volume, we just do a copy on white clone, and we do just our RBD incremental backups on the second location. But the question is, is Havana the perfect stack? Well, unfortunately, we are almost there, I would say. We are missing some really tiny features. The problem is that we were about to submit a new patch, and then the patch for it was just rejected because we were just after a feature freeze, so now what it does is you just create a new VM, the compute, well, Glens downloads the image and streams the image into the compute node, so you have to download the image on the compute node through Glens, and then you have to import it into Cep, which is quite inefficient. But Josh has already a patch, and it's already on the pipe, so the idea here is just to do the same thing as we do already with Cinder, so when we create a new VM, and if the image is already present in Glens, then we just do a copy on white clone, so it's just really fast to put a new VM. So that will probably be for the bug we release for Havana. One of the things that is not implemented is the Cep snapshotting, so now every time you want to, even if you put a VM into Cep, and if you take the snapshot of the instance, what happens is that, well, this does the really common snapshot with QMU, so you have to download the image, well, just snapshot the instance, and it goes locally on the compute node, and then it goes into Glens, but in the future we could just call a Cep snapshot, so the operation is quite in one. You can do this instantly. If you're in a hurry to go into production and if you really want to patch everything, I think there are just only three bugs, and George already built a new branch for that, so this is really if you really want to fix everything already. A little bit about the road map, Icehouse road map and beyond. This is, well, at least for me, this is really personal once again, but this could be the Cep integration for Icehouse, and maybe for Jane. What's missing is something that you might want to do is to have the ability to store snapshots and images into different pools, because at some point you just want a replica count of two, for example, for the images, but potentially snapshots contain customers' data, so you might want a higher replica level, like three, for example. This is something that we just want to trick. Well, the Qco implementation is already on the pub, so it's not worth mentioning it for the Icehouse road map. Things that we would like to see is the migration support because currently Cep doesn't support the volume migration on the cinder. Basically, the volume migration is when you want to migrate from one backend to another, NFS backend into whatever other backend, but it's not supported when we use Cep. Something that we could easily implement is the Nova bare metal function. The bare metal is basically when you want to boot a new VM. It's actually not really a VM, it's more a compute node, so it's a dedicated type of, well, bare metal machine for your customer. Thanks to the kernel module, we could just easily enable the kernel module and then create a new RBD device and just map it to the physical host, so that could be really easy to do. There is also this LFS implementation going on, which is just a restful, really a, well, agnostic restful API, which can talk with Reddls and with a Swift cluster. So ideally, you will have this LFS API that talks to your Cep cluster, so it's not really a replacement for Swift or things like that, but it's just that basically for the OpenStack, you could also use the object store from Cep from the dashboard or whatever. So you will get a complete unified storage solution because that's gonna be just like everywhere. And well, potentially the Manila support. Manila support is an initiative from NetApp, I guess, and it's the distributed file system as a service solution. So we could also just add a new driver for CepFS and just create a new distributed file system for one of our customers because it's also a really huge requirement for legacy applications to use a distributed file system. This is, well, this is the Heist House roadmap, but it's just basically what I just said earlier with the picture, but it's just for you guys to have this kind of reminder there. What's coming up in Cep for the next release? We don't have that much like really new fancy features for Ampros or we just directly jump for Firefly, which should be landed in February, 2013. We have the cheering functionality, which is the basically have this notion of cold and hard storage, so you can have like a pool that has a bunch of SSDs and then everything goes into this pool and then, well, periodically we just flush everything on the back end with SATA disk when the data is less requested. We have the Erasure coding with it, which is just, well, more or less like a red five on the software defined storage fashion, so you can just like have a really large compression of your data. ZFS support for the fast system of the OSD, it's quite good because we, but it's really, really deep detail for this, so it's just a good thing because we can use red parallel modes with the journal, so we just do the same thing as we are supposed to do with ButterFS, but since it's not production ready, we can't use it, so that's definitely good that we can have the ZFS support, and obviously we will do, well, all the efforts that we can to fully support the OpenStack Icehouse release, it's both Intank and Community Roadmap, so that's it. That's it for me, so I'd like to thank you for your kind attention, and if you have any questions, it's time for questions. Made it pretty fast here, so we have like 15 minutes for questions, so yes, yes. It's more nova, but yeah, yeah, but well actually it doesn't support QCAL too, every time we do this, you must ensure that the image is in a raw format, so if the image, well, the image is already in glance, and it's already a raw format, so yeah, but the question is, yeah, it does QCAL because we do a clone, so it's copy and write cloning, yeah, because what we do is, as soon as we store the image into glance, the image is snapshotted and protected, and then from the specific snapshot, we just run a bunch of clones as soon as we create the new virtual machine, yes, I'm not really familiar with LFS, but as far as I know, and if someone knows more about this, feel free to jump in and explain us what LFS is, but basically it's just a RESTful API, agnostic RESTful API that can talk to several whatever object storage backend, so you just do a request on LFS, and then you can have either a Swift backend object store or a Rattles object store, there's just a way to unify all the APIs and just to have a single abstraction layer and no matter which object storage you have, and if I really have no idea, I don't think so, and I don't really know the status or know the progress of the implementation, so you should ask the GloucesterFS guys because they initiated the initiative, so yeah, yes, yeah, second is you and then yeah, yes? Encryption, I know that there is something, but it's a SAS specific, and there is also something on encryption, but I barely looked at it, so I don't have the answer, sorry, about the encryption. I know that there is a module that might have been needed for a VANA, but I'm not sure if it does anything. Yeah, I just don't know what to do. Yeah, it's always really complex because you don't really know where to store the key and everything. I know there is a project on this, like Barbican or something that does the key management system. Okay, that was you and then, okay. Well, that's a good question. Yeah, I mainly work on Debian-based systems. I don't know about the any kernel recommendation. Maybe you know Sage? Well, I think Sage is definitely the person to ask if you have any more questions. Yes? What is the single point of failure? Four? No, there is no single point of failure because it doesn't work like Swift, for example, where you have like this proxy that requires the object storage. You directly talk to the object servers, so you don't have any single entry points to retrieve your data. You just have to talk with the object servers. Okay, the question was if there were any spoke on SAP? Yeah. Largest installation? Largest installation? Well, as far as I know, it's five petabytes, the most or more or less, yeah, five petabytes. Any more questions? Yeah, sorry. What is more commonly used? Yes? What do people mostly use that for? Is there one use case that is a predominant use case? No, no, it's really flexible, so no, there is no specific use cases for that. Yeah, so what's the amount of, how many disks should I put into a single machine? Well, you have to take into consideration many things. What's your network bandwidth? What do you want to achieve also? If you're like really high ops, well, from a high ops perspective, you can pack a lot of high ops on a gigabit link so you can do like half an SSD and then your gigabit network bandwidth is full. If you have 10 gig, obviously you are like limited almost, but in terms of bandwidth, if you want to achieve performance in terms of bandwidth, just keep in mind that a single SATA enterprise hard drive disk can just fulfill your world gigabit bandwidth. Then if you go with a 10 gigabit network, you have to think about, so well, you can mostly deliver, well, one dot two gig. And SEF has a really specific design, which means that first you hit the journal and then you flush the data. So basically this more or less split into two, the IOs and then the bandwidth. So general recommendation is 12 disks per machine, but you can go to 24 if, well, that's only the theory on the bandwidth. If you go with 24, you just can just fulfill the entire bandwidth and do all server. We'll just make sure that the red controller supports it. But yeah, that's, if you want to use an SSD, for example, just to burst the performance and don't have any impact because as soon as you hit the disk, you would eat at 50 megabytes, for example, and then you flush at 50. But if you use an SSD, the SSD just absorbs everything. But one more time, if you don't put too many OSDs on a single SSD, so based on work calculation, you have like an enterprise SSD can deliver 500 meg sequential writes per second because the journal is only sequential. So if you just put like four OSDs, you go around less, a little bit more than 100. So you need to establish your own ratio for that, but traditionally I recommend like six OSDs for a single journal, which leads to, well, 12 disks in the end. And then you already fulfilled your gigabit bandwidth. I don't know that much about Infiniband, but I've been doing that. So this will change everything if you go to Infiniband, but with gigabit link, yeah, that's the answer. That's gonna be personal on the internet, yeah. I'm gonna post it on Slideshare and that's gonna be tweeted by Inovance, definitely. So you will find it anywhere. I'll grab your email if you want. Okay, that's, yeah, gonna be soon on the internet. Yes, the region support? Ratio, yeah, okay. Oh, the interracial coding, yeah, and that's what you're all asking about. Okay, and I'm sorry, what's the question? Are we guys doing? I'm not doing anything for it. I'm not the main developer of the interracial coding and I'm not sure it's like there, I don't know, but I mean, yeah, once again, Sage knows about it. We have two more minutes. If you have any more questions. Okay, thank you very much, everyone.