 Okay. Hello, everyone. What a fantastic turnout. Thanks for making it This presentation is about open stack and stuff yet again Six months have passed since Hong Kong and well, apparently I still have something to say so that's good for everyone So let's let's get into it For those of you who don't know me, I'm Sebastian and I work for Inovance. We are a cloud company My role at Inovance is being a cloud architect. So I mainly build design Build design and will have maintained cloud platforms too My daily activities are mainly focused on both open stack CF and also performance aspect of both of them Apart from this I devote a third part of my time to two blogging. So These are my personal blog and company blog. So don't hesitate to to check them out. We have a pretty good We nice content My assumption for this talk is that you kind of already familiar with both open stack and staff But just to let you know stuff is unified distributed massively scalable open source storage solution that allows you to Access store and consume your data through several ways such as object blocks and file system So well for this presentation, even if you're not familiar with stuff This is basically the only thing that you have to be aware of I'll do the rest during the presentation Well, this presentation is about the end state of the integration of safety to open stacks So well, I prefer to start with some kind of a bag news To give you a little bit of background During the the event a cycle we introduced a new driver to store virtual machine disks Because by default when you boot VM The the root disk of this VM ends up being a file on the fast system So there are several mechanisms inside Nova to change this behavior So the first one is a file. The second one is it was LVM so you can directly expose an LVM block to a virtual machine and Then doing the Havana's Havana cycle we introduced a driver for RBD So basically you can seamlessly boot a virtual machine within Seth The drawback of this implementation was that we were still using the old fashion way to boot a VM Which is you have to you have to take the VM from Glance, which well in this case is part of Seth so Seth is a back-end for Glance So you have to pull the image from Seth Stream it through the compute node store it into volume Nova instances and just go base And then you have to re-import it into Seth So that's that's extremely inefficient that makes the world process the world booting process really slow and This is why we ended up with Another another mechanism that is that that we are already using with Cinder for example When you want to create a volume from an image If the image is already stored in Glance By default when you upload an image into Glance and where the backing store is Seth This image is being snapshot and protected So now when you want to create a volume from an image what we do is we simply do a copy on white clone This is extremely fast and while we we can save a lot of space from this We tried in Havana to get this code, but we well we failed basically and Yet again and fortunately during the ISO cycle cycle. We we didn't make it in time The code went through feature freeze, but handed up being rejected because of really tiny bugs So fortunately Dimitri has a branch for this so if you take the branch Take the last comments you can apply all the patches and you get the copy on right cloning working when you boot a virtual machine There are packages already available for this we tested them well What proven so you can easily go into production with these packages? we this implementation is Has a Is using a really really tiny portion of the overnova code so it's not a big change for the entire Nova so no worries So if you're not using really precise or trusty you have to go through this external repository So then what's really new in ice house? Well, not that much in mainstream, but During the while this cycle we we had a kind of an issue while using a firmer images or creating a Volume from an image. Let's say you want to achieve a boot from volume We clone the image, but if the image Doesn't have a quick a row format then you will end up with something like come boot and can't find any device to boot the virtual machine It's just because Seth doesn't support QCAL to Seth already has his own implementation of sparseness for seen us of the images. So Basically, you always need to upload a row in row images Into glance so you can take advantage of the copy and write cloning It was well somehow a problem because from a public cloud perspective You can't really force all of your users to To upload row images, so we had to find a workaround for this. So now when you want either when you want to use the fmr back end or Create a volume from an image you What we do is we if we detect that the image is not into a row format Then we on the flight while converted. It's not really on the flight because we we have to download the image and then re-import it So this is why for private cloud usage. I'll recommend to always upload row Roo format images. Otherwise, you you can't really you can't really benefit from the copy on my clones Basically, if you always if you always import QCAL to images Then we we have to create a new image from this and it's just a flat volume. So that's really inefficient in terms of space another another addition that That that came with the with the eyes us really is is the ability to use a specific user To to control Nova well to access than the The staff storage so you have a specific right-limited user that has only access to a specific pool And then all of the images will be stored in this pool. So just from a security perspective. This is This is quite a good thing Unify all the things We I had previously had this this picture when I gave my presentation in Hong Kong It's just to you like for you like a little reminder that We have been doing a continuous effort to integrate self into open stack since the very beginning So as you can see it was already there in Diablo then Came with sx for Cinder well actually it was already available with the old Nova volume then as mentioned earlier we We made this work for for Nova during Havana and now I'm really happy to say that we finally closed the loop because Now we implemented this at the swift level. So I'll get into this into a second for more details, but The cool thing is that I've been wanting so much for this feature You have now you can have this unified storage layer for the objects for the block and you only have basically one technology to maintain So that's for me. That's incredible Now getting into swift Swift has a one multi back-end engine functionality By default as most of the object storage system it stores all the objects as the as a file in the first system using XFS or whatever But thanks to the Gluster guys that led the initiative to push a new what this multi engine functionality They that they called disk file. We were able to plug the rados back end. So quickly rados is the object store for first step and Well, that's that's basically how we implemented it You want if you download the latest version of swift you won't find this piece of code just because the swift guys want to renounce and enforce the the api support and they just They just say that If you want to add an extra back-end it has to live Outside the the main swift repository. So just like Gluster FS if you want to get this code you have to grab it from from stack forge But then how does it work? It's fairly simple. Um, we haven't done any modification at the swift proxy or everything Is operating at the swift object server level. So Well, basically, it's just a multi engine that is plugged Under swift object server So you can still take advantage of every swift function every swift functionality such as Meter wires api support if your application is already swift compliant, then well, you won't notice anything The main difference is the main difference here is that Uh, the way to store the object has changed obviously so You have to configure swift with a single replica because you will let self handling handling the replication in the background some of the pros and cons uh about this implementation, um Obviously if you not everyone start starts with uh an object server, uh an object need actually so you might have a production open stack cluster with With objects with with block story So you might have a already an up and running self cluster that act as a back end for glance for cinder for nova And then well all of a sudden you have a new application coming and then Well, you have to you need a swift For that so you can reuse your existing cluster, which is one kind of kind of a good thing In terms of distribution distribution supporting velocity, um The guys from the boom to with trusty made a really nice job To integrate seph uh into cloud archive So basically the latest version of swift of seph sorry that came out last week firefly the lts version Is part of trusty and will be maintained. So, uh, you just benefit from the open to ci all the tests and everything so Sometimes some operators. They just don't want to add external repositories Then in this case, that's so that's a really good thing for them. Of course, uh You must well, you definitely want to use the iris recording from seph that came with the latest release So if you want to build an api rest object server With swift you connect it to seph and then you use your iris recording Pool so you have a really big compression of your all of your data At some point you might want to have an object an atomic object store. So Bronson counts once again for me, it's kind of a good pretty good advantage to have an atomic object store It's not you don't always use object stores for archiving So and even if you do you might want to be sure that all of your data are well written and consistent as well Um, obviously you only have a single storage layer to maintain and then you don't have to hire Swift engineers and seph engineers if you have a Pretty good seph team already then you don't need to hire any other engineers for that. So that's so That's a good point as well and a single technology to maintain One thing that I didn't mention about seph is crush Crush is definitely what makes seph so unique Crush stands for controlled replication under scalable hashing It's just a deterministic algorithm that That decides where all the objects should be stored So we don't do any look up on hashing tables. Everything is done by calculations So this is first extremely efficient and you can always retrieve the location of an object even if the cluster is moving Crush is also topology where so Basically, even if you already have a seph cluster running, but you have object needs You can buy new hardware, but integrate it into your current seph cluster So and you just tell swift you just tell crush. Okay. I have to spool this set of hardware that have Really high density Really a lot of discs a lot of space and then I have an ssd of ssd machines for for cinder so you can say that's going to be for swift that's going to be for for cinder Someone maybe disadvantages, of course, you need to know seph and performance But I'll get into the performance and some of the benchmark that I run into a second About the state of the implementation of this new backend I'm quite happy to say that we have 100% coverage for functional tests and unit tests I've been playing with this feature for several weeks now and To me it looks like really production ready. So we are just about to start some pilots with some customers and well So I would put this into production We identified several use cases for this If you only have one location one data center you You just need to You have your seph cluster you have your swift entity And then you just need to configure swift with a single replica and you just need to configure seph With n replicas. So in this situation you let seph handling the replication second use case we identified is that You have several locations like real geo replication One cluster in the US one cluster in Europe one cluster in Asia In this case the configuration is a little bit different. You just use the Swift geo replication capabilities for this. So you configure swift with three replicas in this case and then you have standalone clusters in each region And they are configured with either one or two replicas Will be more efficient to have two of course. So you don't have to fetch an object if If one of the good goes down Another use case that we we identified is that you might have an existing Swift cluster, but you don't have any compute now and you just You really want to use seph because you made the good decision You want to use seph for your block for open stack? And well, you really want to migrate everything. So you can basically Start by by tweaking all of the swift object servers and connect them to your seph cluster And then play a little bit with the ring and then the buttons everything. So that's That's one of the use case that we have A little reminder before we dive into performance considerations You might be aware that swift is eventually consistent. So well Basically, it's not atomic And one storage solution is atomic seph is atomic seph is synchronous And the other is not the right method that they use and this is why one is Not atomic is that swiftly using berford aisles It's just like a really common operation when you write on your Linux machine When you do an IO it goes through the page cache, which is basically the memory and then it gets flushed by the kernel later on With seph they use all direct For the sake of simplicity, I'll say that they use all direct, but it's a little bit much complex than that more complex than that I'll say that they use all direct. So basically they just Bypass the page cache and they directly write into the disk A little bit about the object placement Swift does all of the placement thanks to the well hash algorithm also And they use the the proxy for that so all the proxy just routes all of the IO requests to all of the Swift object service In seph the design is quite a bit different because that's the client that decides where They will put and store all of the objects. It's just because they locally have They locally have the crush algorithm and that and crush once again is responsible for that About the acknowledgement If you start with three replicas Swift will Wait for two replicas to be written to say hey, okay. I'm done and I'm good Otherwise seph will wait for all of the replicas to be written to consider the operation the IO operation has done This is the platform that I used to play with this implementation and to run some benchmarks So I had one swift proxy and one seph monitor and then on the back end I had five Five burnable machines with six OSDs each. I know is the the OSD the objects storage demon So it's just responsible for storing and replicating all of the data I had a total of 30 OSDs and well When I we used a swift bench for that so the the main problem was The the swift proxy was the bottleneck in this situation So we weren't well, we weren't able to to get all the platform capabilities from this benchmark and even by replacing one object server With a swift proxy and adding a lot balancer. We weren't able to saturate also the storage side So we were kind of kind of stuck between 400 and 500 poods per seconds And well, that was a little bit frustrating for us So we we ended up with Another benchmarking tool So we were we wanted to To ensure that we didn't introduced introduced any latency and any overhead by by implemented rados for For backing all of the objects. So basically what we did is We took one swift object server and we assigned it one disk One replica we did the same for sef. So we had also a swift object server That talks to rados on the same machine with one monitor one OSD one replica And then we started to directly inject requests into these two processes Separately of course, and this is what we got first of all, of course, we measured the performance of The native the native performance of the disk. So we got 471 IOPS per seconds for 4k writes With sef, we we had almost let's say 300 For swift default, we had almost three times The sef number But once again as mentioned earlier, it's kind of obvious that you would get more performance by using the swift local storage Than the sef because it's using o-direct. So Obviously running into the memory. It's always faster than running into an order of disk But once again to ensure that We didn't introduce any overhead. We just wanted to make swift behaving just like sef So what we did is we just modified the code from swift And just you used o-direct for all the operations Then what we got is almost the exact same number Then we had with with with sef. So yes, this will be slower if you use sef for storing all of your objects But in the meantime, well, it's kind of a consensus to you If you want synchronous, uh, atomic transactions, you have to pay the price for it So for me the good thing from these from these results is that We don't have any overhead by using the the RADOS implementation and now And now you think okay. Oh my god. What is guy saying is so cool? I like to try this or how can I test it and then the good thing is that While playing with that we also built ansible playbooks for that. So If you go there, you just clone it and you just vagrant up. We use the vagrant provisioning system for ansible and then you'll got One machine well the almost the exam the exact same setup that I showed you before so one swift proxy one sef monitor and then end storage nodes, so Yeah, that's that's pretty cool Then if you want to work with real men on bare metal You can almost fight the code available on stack for it. It's currently being under review now Should be hopefully available next week. I don't know but in the meantime we we have a repository over there for this So you can just play with it. It's just a couple of files. So it's not that much to set up Now from a from a design perspective If I had to build such platform, I would certainly Go with such design Uh, so how we use usual components such as keep alive d to maintain the virtual Virtual ip and then hf proxy to load balancing and route all the requests And then I'll collocate always swift proxy with a sef monitor obviously You want to start with three nodes because uh, sef monitors are based on paxos. So paxos is a consensus algorithm. So when you want when you speak about consensus that that means crew room as well and You always need a odd number for this And well on the storage side you can just start with n storage servers But once again always collocate swift object demons with sef osd's Because as I said, uh, since the well crush algorithm is deterministic We had in a previous setup like 30 osd's Then we more or less had one chance over 30 to have a local a local heat Saying that the request goes through Wow, basically we can imagine it like this You have the client over there somewhere and then it goes through the vip Through this hf proxy the hf proxy talks to the swift proxy and then the swift proxy decides Okay, I'm going to store the object on this swift object server And the object server based on the name of the pool the name of the pg and well a lot of things from sef a lot of calculations You might get these disks this disk storing the object. So that's um Well, you can easily benefit from that And yes that that was well Uh A single a single data center, uh, as I said before a single location A single replica for swift and and replicas for sef. So sef is handling the replication in this situation Now if I had to build like a geo replication cluster I will basically use the exact same setup But individually in each location and I will just add some dns magic in the middle just to to as a client, uh Write at the the nearest possible, uh, swift proxy So as mentioned as well before The setup is changing a little bit here because swift is now handling the replication So you have to configure we have three zones three regions here. So What you do is you just set three replicas And you let sef on the background handling the replication and obviously you can still take advantage of zones and infinities with swift because Everything is happening at the proxy level Some issues that we have currently Just to be clear and don't get me wrong This is not a problem from our implementation, but it's just more about the current state of the disk file inside swift Swift needs account and db's to store metadata information about The objects and where they are stored so Basically you still need to to to set up arcing to just replicate everything across all the nodes Uh, there is a patch going on already. Uh, that is supposed to introduce a multi backend functionality as well So eventually we will be able to store to store all of the objects Well, not objects, sorry, but db's and accounts Within sef as well. So we will have the complete integration and well everything will be stored in sef Dev stack sef Have been struggling since january to get sef into dev stack Basically what I had that what I got is a please refactor dev stack and You will get your patch merged because dev stack is not flexible enough for now What we want to end up with is a directory when you where you have a new driver And then you just put the file there and then you got your new driver. So dev stack can do this now We have a session tomorrow to discuss this and hopefully we will end up with something pretty good during the geo cycle In the meantime this Uh, this works. So you can still get this patch. It will configure you glance It will deploy a sef cluster an olacool machine, of course Configure glance with it sender send the backup Eventually nova when we will get this merged So well feel free to play with it. So at least if you want to know how to configure this for another cluster The roadmap a little bit about the roadmap. We had Fantastic session on tuesday Three hours long session to discuss what we would like to see and well to be honest Back then in hong kong. We were a little bit Too much optimistic about all the features that we wanted to see into iso. So this time we tried to To be a little bit more realistic in terms of the feature that we want to To see in in juno. So these are More or less the commitments that we made and hopefully we will we will success so Dimitri has an ongoing work from for the q-cow clones. So if we just continue on this and maybe start something else afterwards We bladic will take care of using the arbitrary snapshots The thing is now when you want to snapshot a virtual machine. This is kind of a disruptive Intervention so Basically it stops the vm and then it copies the disk Locally and then it uploads it uploads it again. Well stream stream it into glance and goes into saff again We want what we want to achieve here is more or less the exact same thing as The copian right cloning so Yeah, even if it's clone you can snapshot it and then directly store it in lead So everything is happening at the seff at the seff layer and nothing is going up and forth From glens to nova and so and so on I'll continue my work on dev stack seff and hopefully we'll get this done before juno Soon as i'm ready for with this I'll try to build the ci system Within the gate just to to have a Seff job that's running and of course based on dev stack I truly hope that thanks to this we'll be able to get our patches merged more quickly And finally the this is the only feature that we are missing In terms of compatibility with all of the features available from cinder. It's the the volume migration. So josh will take care of Of this so the goal for us is if you use the multi backend we have like there are two ways to implement it Because there are two use cases The one is you use the cinder multi backend functionality and then you have multiple seff pools with several capabilities And then you want to migrate from one pool to another that's kind of the easy one and the other one is like migrating from lvm To seff or solid fire from seff or whatever Uh, merci This is a little hint just to remind you that the next open stack summit will be in paris and so you you better get ready Uh Thanks for stopping by, uh, thanks for your kind attention and now i'll be happy to to take questions So if you have one question you can go on the mic Please one question And by the way, I have my swift backup over there. So if uh, I'm not able to answer any swift question If I don't have any question, I will just assume that that was a really perfect presentation. So Thank you very much. Oh, you have a question. No, no, no question Yeah Yeah, that's kind of epic fail. I was about to do a QR code, but I forgot So you were supposed to scan this and get the presentation The presentation will be online Right away after the presentation Yeah, I'll tweet. Um, look at the innovants twitter and you will see the the link Going So how does one question? Okay, how does the um, swift, uh, gateway functionality for object compare with the native rados gateway How do you see that? What do you see the rolls there? I mean, uh, the thing is, um They are like come to to complete to complete separate functions. Uh The When you want to store an object with swift to have to go through this is a path through so you have to go through the the swift proxy and then through the object server, but um With rados gateway everything happens at the proxy at the well, let's call it the proxy as well And it directly stores all of the objects from here. So you have one on up less, uh, at least But we all agreed also during the session that this should have been implemented at the swift proxy level So if you can just Implement this at the swift as the swift proxy level. You don't have to go through all the machines or other processes at least Oh, okay. So you mean as far as um, just changing something within the swift implementation and being able to get more directly into the Yes, yes. So each if we could just directly connect the library that in those all of this Directly at the swift proxy level. We won't have to go through an object server and then Have to To store it afterwards. Thank you. Thank you. Don't see any more questions. Sorry. Thank you