 Hello everyone, I'm going to talk about OpenStack, the ZenServer and gan amser hynny y cael'i gwrthgwaith yn siaradau'r uchydigol? ac yn holl i gael i eich cymdeithasol yn ymwyng. Rhawer, yw'r cyffodus? Ond yn fawr ar gwaith o Gwaith o'r archyfodol maen nhw ydych chi'r cyfrifiad o Gwaith yn ei, oes i'r cyfrifiad o'r cyfrifiad o'r cyfrifiad o'r cyfrifiad o'r cyfrifiad o'r cyfrifiad, mae oedd yn mynd i chi'r cyfrifiad o'r cyfrifiad o'r cyfrifiad o'r cyfrifiad diwg. Rwy'n enw i'r 10 o'r 20, 11 o'r 21, mewn cyntaf i'w ffroedd ar amlwyglonellu cactus. Rwy'n eto, rwy'n meddwl iawn i'r cyfoedd ac i'r meddwl healadau. Ond ydych chi'n cwedlelu iawn wedi'u benderfynu o'r wneud i ddod i'r cyfiniad Cymru o'r fânffordd i ddim yn ychwedol. Fyddwn i'r cyfiniad cyfiniad cymryd i'r PC-team yn y rackbath. getting ond byddwn i ddod i'r cyfiniad Cymru ac y pwysig mewn gwweithio. The crux of this talk is to set the scene for the both later. We're going to try and discuss how we look at evolving the Zen service upgrades and deploy the particular problems with contributing such code and this kind of thing. To set the scene, I don't want to talk about how we deploy OpenStack Rackspace. A brief overview. Do gydym yn fywod ar y cyflau'r gyflau ymdegi'r hynny. Yn bod y fhori hynny, mae'r cyflwysau rhywun yn ei fyddion ac mae gyflwysau cyflwysau. A wedi gweld, mae'r cyflwysau rydych chi yn ymgyrch ac mae'r cyflwysau cyflwysau yn ymgyrch hynny yn y cyflwysau cyflwysau. Felly mae'r cyflwysau i ddim yn gwneud ymgyrch yn ei ddangos yng nghymru ac mae'r gwybodaeth yn cyflwysau gyda'r cyflwysau ar y cyflwysau. Yn ydych chi'n meddwl i ymweld a meddwl y gallwn нет. First of all, maen nhw dwi ddim yn ar y bodi'r codi. A dyw'r cullwys rai o'r cysylltu yma. Mae'r cyflawni'r maes yw dwi'n gweithio cwmfigirau. Mae'r hawdd i'n mynd oedd orcsyniau a'i ei phobl yn fan�� hefyd. Mae'r cyfeil hwn yn y rhan fydd o'r rhai a'r cydwyddiadau. Felly dyma'r cyflawni'n meddwl yma. Mae'r cyflawni'n meddwl i'n meddwl i'r cyflawni. We have a pre-production environment to test everything, and in there we're deploying basically a Python virtual environment that contains all the Python code to run OpenStack and all its dependencies. So we know the Python code we can rev quite quickly and that's contained in this table. So in terms of actually distributing that to all the different hosts, we actually use BitTorrent to push out this table to all the different machines. So at that point we can actually have a directory with the old code, a directory with a new code, and we have the ability to switch symbolings and switch between those as and when the rest of the system decides to do that. Cool. So the next thing is, cool, you've got new code and old code, but it's not really doing a lot, it's just using disk space. So what we need to do is have a way of manipulating this switch over. I'll go into this a bit more detail, but basically we have to kind of turn the old stuff off, we have to upgrade the DB and turn the new stuff on. It's kind of painful right now, so we're working on sort of mitigating those issues with an OpenStack, and I'll touch on those a bit later. But fundamentally we're using M Collective to control these nodes, to turn off the new stuff, switch symbolings, and when we're ready, turn stuff back on. And OpenStack, the way it's built, it's quite loosely coupled with queues. It can actually deal relatively well with doing this kind of thing. So there's some interesting properties of the system that we're making use of to make that to actually be a plausible thing to do. Cool. So that's a kind of vague idea of what's happening. So I wanted to sort of look at more specifically how OpenStack is sort of modifying the Zen server and what that kind of looks like. So that's where we dig into this picture here. So this whole thing is like a representation of one of the Zen server boxes in our private sort of cloud. So I want to see what that looks like. So first of all, it's a Zen server box. It has a DOM zero, which is good. This is what you expect. This is generally speaking pretty much an installation, albeit pixie-beautied and automated, installation of Zen server. So OpenStack has a sort of problem in that it needs to do certain things to the Zen server platform that aren't in the Zappi interface. So we use the Zappi plugins, which are Python scripts, to extend what Zappi can do. So we can run code within DOM zero to do really useful things like download the VHD image and put it into the SR. And talk to the guest agent through ZenStore. And talk to ZenStore live before that was possible through Zappi. And all this such. Well, let's be honest, they're workarounds. So we need to work out how to get this kind of thing upstream into Zen server and get the innovation sort of happening in the right place. Because in a similar way to how Zen server goes, oh hell, we just need LVM to change this tiny little bit of not writing. So we'll patch LVM. We just need this tiny bit of extra functionality in Zen server so we have a Zappi plugin. Which is kind of cool. It's separate, but it's also Python 2 file code that we can't really unit test very well that sits in the Nova tree basically pissing people off. So we kind of need to do something about that. Sure, we should probably unit test it. So we need to do that first. But we've got to work on that. So that's kind of the DOM zero piece. That's what we're doing with DOM zero. We're putting some plugins in. So it's not too invasive, generally speaking. Apart from we might install Python 2.6 because one of them might need Python 2.6. But let's not talk about that. So the next piece is right now in OpenStack each hypervisor has its own Python process called Nova Compute that does the controlling of that hypervisor. It listens on a queue, an RPC queue, generally speaking rabbit or some similar thing. It's talking about RPC waiting for its request to come in such as start VM, shutdown VM, suspend VM. You get the idea. In Zen server world what we do is we actually have this in a DOMU VM. So standard VM, standard PV VM usually in the running on the Zen server host runs the Python processes, some supporting monitoring stuff. The great thing is if we for whatever reason crash the DOMU, we don't lose any instances, right? So there's a good separation there. It's almost a bit like stub domains apart from it's not actually talking to Zen directly. It's actually just talking to Zappy. So it doesn't have any special Zen privileges such. It's just running there looking after the world. So that's where a lot of the table stuff will come in. So we'll get the new code that goes in there. We restart Nova Compute and it starts talking to the new RPC interfaces. It starts coming alive and at the same time we'll be talking to DOMU to update the Zappy plugins that are living in there. Cool. So when I was putting this presentation together, this is taken from some other slides. I was playing with Prezi because it's kind of fun because you can zoom in on stuff. So I'm trying to represent how we kind of layer our systems here by zooming in a lot. So I'm sorry if people get travel sick. Anywho. So this I said was a Zen server running in our private cloud. We call it Inova, I remember why, that runs our control plane. So inside here, we actually, if I remember what I did in my slides, inside here, we actually run the control plane in the DOMU's that are running in that private cloud. So our rabbit queues, the MySQL servers, the API nodes, the cells controlling things, the detail that's a little bit irrelevant, but there's lots of different services running there in the DOMU's. Now, basically what that means is when a customer API request comes in, it goes through several things, gets to the Nova API, it goes to the Nova cells in a standard way, and eventually it gets down to the Nova compute node I spoke about. So these Nova compute nodes don't run inside this private cloud because they have to run on the hypervisor of the particular hypervisor we want to control. So everything apart from these Nova computes down the bottom is running in the private cloud, and then the Nova compute is running on the same hypervisor that the customer VMs run on. Hopefully you're kind of following me in this sort of inception style view of RDC. Anyway, let's assume a bit further because it would let me. So inside here, we're actually going back to that original picture because this is a Zen server host that has the DOM0 plug-ins in and has the DOMU, compute DOMU, and that's communicating to the rabbit queues and the databases that are in that other private cloud. So that's kind of what's happening. Are the blank expressions seeing that make sense or just that I'm talking crazy rubbish? Bit of both. Awesome. So what you've got is you've got this several collections of hypervisors, all in cells, so you've got a selection of hypervisors, in it you've got plug-ins and DOMUs. Over here, you've got another selection of hypervisors that are basically the same as this other one, but on there you run the control plane that's controlling the other ones. And there is somewhere a machine running the control plane for those ones, but I don't really talk about that because it's confusing enough anyway. And to excuse the next bit was basically because Prezi would let me again. When we test OpenStack, we actually test it by running instances, spinning up QMU, talking Libvert, and those instances run on the HP and backspace cloud. So in fact, in here, we have another version of OpenStack running the test cases. Just to confuse everything, really. That was the travel sickness thing. Sorry. So now I've kind of established what we're doing today in this kind of world. So when it comes to deploying Unova code, we pull quite regularly from trunk, so the stuff that's in our DCs now is not in the next release of OpenStack that we just started developing last week sometime, but it's kind of on the last RC builds that are maybe released now or about to be released, I can't remember. So we're kind of on that level in production. We try not to slip more than a month behind because that way it's much harder to get the latest bug fixes and harder to get the latest features and everything else. So we're kind of aiming to redeploy production probably about once a week. We don't quite manage that right now because as I said, we're pulling down services. You have downtime for the DB migrations, but if you wait too long, that downtime becomes a really long time for all the DB migrations with huge instance tables. But anyway, the idea is to kind of move this speed. So when you look at the DC, every time there's a difference between the Zen server that got deployed here when we launched a couple of years ago or so, and then you look at the Zen server here, which we deployed last week, which is a new version, there are many differences here. And kind of the thing I'm thinking about is how do we work with the Zen server platforms to try and reduce the differences to a small amount as possible. So when we're trying to have a new feature or release a bug fix across all these versions, how can we change the platform in a way that we can make this more consistent? It's an open question. It's here to discuss this. I don't know where the slice should be. Should the slice be at the Zappi layer? And we just update different versions of Zappi that put fixes possibly? Should we between certain versions be able to actually keep the same version of Zappi on different versions of Zen possibly? But that's the kind of thing. So I'm thinking let's go dreaming and let's see what we can do. Okay, hence the painting of rainbow kind of thing. So I wanted to keep as positive as possible, but what I would love Zen server to be and I try to come up with a term and a platform for innovation or an innovation platform is the kind of thing I was thinking. So how do we stop treating Zen server like a black box? How do we make it easier for people to start contributing and using different bits and pieces and maybe mixing and matching? And there's ideas here and I think some of us know a lot the ideas and we're getting there. I think we have a request again on how do we get more involved and how is it easy to get involved? So I think we're talking about this in the sort of Zen session, really. But it supplies the Zen server too, right? We don't know when the next release is. We're not really sure what's in it. I mean that's kind of always true in software but we know that's a very extreme case of this, right? I don't want to start working on something to find someone else that's working on it. I can check on the Zen deval mailing list and I don't want to keep firing off ideas. I need to find out which ones are working which aren't, someone's doing something and we need to get better at doing that, I think, as a community and we'd love to get involved in that. So open roadmap, open bug tracker, that kind of thing. You Google for an error. GitHub tells you about the slowly commit message and it tells you that this patch is going to fix your world and it tells you it's because of CA, something, something, something which involves effectively trying to achieve a back channel to get someone to really bits of the ticket that are public. Which is nice, these things actually do kind of work but it's a bit unfortunate, let's say. So we need to work as a community getting better at that. And then it probably comes to sort of a set of things which is more about as an open stack community consuming Zen server, how can we be better citizens in the Zen server world? Right now we're sort of hacking around the edges going, we've found a new way of making it work faster and it's really cool. And look, it keeps breaking, that's odd. This is sort of a problematic way of going about things, right? So we, as a community, we are now working, I sent an email to Dave Scott earlier because I forgot to send it before now but we're sort of reviewing ways of getting this disk import and export working. This is one way we can make the interface with Zen server an open stack much cleaner and we need to make a way in which our two communities can collaborate on this, much better. It's not quite working right now, it just feels wrong. Another thing I suggested to the mailing list but more in a, I've been thinking about things, it was a nice shower this morning and I've got a really cunning plan kind of email which is not very useful. I need to get more involved in that way but storage plugability is kind of an issue in the open stack context. So from the rack space perspective, in this sort of immediate future we've sort of got the bits and pieces we need in our current product which you can tell because it's working today and you can go buy that now but from an open stack perspective lots of people are coming in the KVM world and writing Cinder drivers for all these exotic storage platforms and this is cool. Thing is it kind of works with upstream Zen it's sort of bits of it in QMU so it just kind of works. Bits of it are just exposing raw luns but right now we have to go through TAPDIS and take the performance hit and there's kind of all these little mismatches that are just not really aligning properly so there's ways we can, it's about making it an innovation platform with Zen Server Core, there's great promise because we can now look at using the upstream kernel modules to get that lun connected to your crazy who knows someone invented last week's storage system that you can then sort of start playing with and really getting the innovation going. I mean we're working to make sure in the code base in the Zappi drive we don't have stupid roadblocks so right now the volume code is a little bit messy we want to refactor that so you can easily plug in new things so it's easy just to sort of wire stuff up. It's not just we need Zen Server to change we need to collaborate and get this interface working so we can get this innovation happening. So one of the things when you're playing with networking is Zen Server is using OBS which is a great platform for innovation it's coming with all sorts of cool things but one of the problems is you start setting up these tunnels and everything else and Zappi kind of thinks that it knows what it's doing for you sometimes and then you try and do a live migrate and then you have to go set up everything else on the other side and it's not quite talking as friends here a lot of the time I try to say that Zappi is a master of the universe kind of problem. There's a few cases where when Zappi was built and the way it was originally built it totally made sense for Zappi to make sure that you didn't have loose VHDs about the ones that were needed and just looked after stuff for you it's really cool like that but when you come to try and import a VHD underneath it goes and deletes it for you because it doesn't think it should be there which is unfortunate but it's sort of getting that it's that kind of interface discussion that we need to get into and work out what's the best way for both sides of the party right we don't want to make it so that it's an innovation platform where no one can install it no one can actually use it you have to do all the kind of looking after and housekeeping yourself that's crazy right but we need to sort of layer stuff so we can make all those pieces fit to get that's kind of my miniature rant I'll get for the soapbox possibly so the specific thing that's in my mind right now is how do we deal with smoother upgrade we're sort of battling this problem in Nova and I'll come to a few things we've hit in Nova just as a nice comparison sort of example to look at and see what the concrete problems are but when we're sort of running in the cloud we've got our customer VMs on Hypervisor installed some time ago and our customers seem to complain if we turn those hosts off reboot them and then restart their VMs right it's kind of rude and we want a way that we can update the whole Zen server but without any guest downtime so that's when we start thinking well we've got this hotfix system and we can start patching bits of Zen server so we get the bug fixes through the hotfix system and this is step A some of the hotfixes want to reboot your host so you have to kind of be a bit insistent not to make it do that and we're sort of fighting the system again and things are a problem because in the cloud world right the person who owns the Hypervisor doesn't own the VMs on it and that's kind of one of these obvious but it has knock on effects of this implication so it's making that work better we need to get better at doing that and as I was saying before a lot of this is about the drive of consistency any differences mean that we have to have those differences in our test environment then we have to check against those differences and the matrix just starts ballooning in all directions right so where we can reduce that and reduce the risk of all these things it seems like a really good thing okay so I'm an over guy, I'm in the room and I figured I should probably show my wares and describe how Nova is sort of working and in the spirit of fairness I'm going to rip Nova apart and how it falls apart when it tries to do upgrade because we're trying to evolve this as a community so let's share our ideas here so to look through this let's just imagine what happens if you try and make a request so you come along as someone who wants a VM and it's all very good and you call the CLI or go to the control panel and request a VM so they're kind of is this actually visible, vaguely apologies on the video, that's completely invisible probably to you but anyway the request comes in it's a rest request that gets processed, this all makes sense and goes to Nova API so we're doing quite a good job in Nova of making sure that we don't break the back of compatibility of the API kind of obvious thing to say I mean we're working on the next version so at some point there's going to be a breaking change but we're trying to make that clear and bring new functionality so that makes sense that you'd want to change to this new API but rest API versioning people roughly agree on what's happening just like concession that we make in Nova is that there's lots of people writing to the XML client and they get really annoyed if you treat it like a JSON thing and start lobbing extra attributes in because it kind of breaks their pauses sometimes with strict pauses so you have to make sure that we anytime we add stuff in we try and expose it to the end user so right now we do that by saying there's an extra extension and it's going to change the return format by the way so you can actually at least query it in advance if you were being that sort of paranoid way in the v3 API we've just versioning we've got point versions it's much easier so each individual extension has a version that we can bump and we'll make it clear what that means so that's what we've learnt from our mistakes in that front mostly because adding an extension is another maybe 150 lines of code that's really just not needed anyw so what happens next is that novr API makes a request an RPC request to the scheduler to say go find me a node so you've requested this new VM that API request has been recorded in the database but you make a RPC request to the scheduler when you're ready deal with this message go find somewhere so this is like a an internal interface and originally we were just saying that everyone will cooperate we're all consenting adults nothing will possibly go wrong because we're all Python developers so there was no versioning or anything that was a disaster so all these RPC interfaces are now versioned in a nice backwards compatible way so you've got a major version if you bump that everything's broken behind at least we've got the option if we screwed up something normally it's if we change the a way in which there's certificates on both sides so you can tell where that request came from it must be someone with my key that made the request but I know there's too much detail the idea is we're versioning those APIs so we've got the ability to update different bits of these components out of sync so we can update the APIs first potentially assuming that you're and so you can now want to do that but we can find orderings in which we can upgrade these components out of sync and leave it working that's the theory the kind of blocker is is the database because you either need all your code to know about a certain number of back versions of the DB or we need to do online migrations and that's an open question it's a big issue mostly because you don't want the downtime while you're waiting because it's significant that's where the problem comes so there's a whole more RPC calls with exactly the same problems and then we get down to the ZenServer case now one recent patch from Citrix, you're still working on OpenStack and ZenServer which is great one of the patches from Bob there is actually to add versioning into the those plug-ins that I was talking about so now we've got a version interface between the plug-ins and the Nova Compute code and before that there are all sorts of weird errors where there's got out of sync and people just got confused you've got missing parameter errors and nothing worked anyway there's a few kind of issues we've seen in Nova that I kind of wanted to share and some of the things we've done to mitigate those and some of the ideas so kind of things we're sort of struggling with is that the looseness means we have the options for these things we just haven't really followed through yet so some interesting properties of this system is that while we've turned if you wanted to rev Nova Compute and while we've turned that off the RPC queue to a point can queue up all the messages so that when the compute comes back on it just picks up where it left off and sort of works with its queues so that's what I was saying about having the API nodes, in theory the API nodes could be talking as a DB and putting messages on the scheduler queue but we may choose not to consume that all the way down but people can kind of have read-only access and update queues and queue work just by using that the inherent properties of the system those are the kind of things we're playing with anyway this certainly gives you some leeway and eventually you get a lot of server in the open stack world there's not just Nova we get stuff we get our images from Glance we get our block device connection parameters from Cynder we get the way in which the OVS needs to be configured effectively by running a neutron agent on the hypervisor there's lots of different interactions there but they're a little bit more standard generally through REST APIs that are versioned in the way you would expect okay so this is kind of a quick lead-in to the boff with some ideas I didn't want to come here with loads of complaints and no ideas let's get talking and have a think here so I'll just rush through this quickly because it's basically the next session the first idea that I was mentioning earlier was having a new version of Zappy an old version of Xen this may be a completely crazy thing to do but it may be a really interesting way of trying to keep bits of the being able to rev bits of the tool stack without having to interrupt the execution of your VMs in the cloud right so we can get new tool stacks it's not easy today because it's kind of hard to work out where the versioning is and I know people are thinking about versioning a different components but I guess the question is if we do have an old version of Xen a new version of Xen API where do we draw the line I don't know do we draw the line at LibExcel and we trust LibExcel upstream to be back as compatible that kind of sounds sensible maybe that's too low that's clearly not all the story because there's also TAP disk and talking out to that and talking out different bits is right that I've not listed here what do we do with Storage Manager do we leave that alone, do we patch that I think we patch that but we need to decide what that means or do we leave Xen OpsD as being an old version that talks to the old interfaces so right now we're not on LibExcel so do we make all new Xen OpsDs of a certain version and talk to the older versions of Xen OpsD and potentially patch those so we can and then we can use Xen OpsD to tell them what everyone wants to talk underneath I don't know, it's an idea so the other thing I was thinking about is I kind of alluded it already do we update the dates path what do we do that we already upgrade OBS today which is possibly a little bit dangerous but the newer OBSs have all sorts of goodness that we need to use at our scale to be able to scale further and further the newer OBSs are exactly what we need but if we have driver domains, could we have the old dates path, the new dates path, do local host migration and wave our hands a bit and hope that kind of works it's just an idea if we go crazy, can we try and upgrade then without the guests knowing well probably not but why not ask the question, eh cool so that's kind of the crux of what I wanted to talk about the next session is going to be the boff kind of digging into this a bit more but are there any questions about the whole how we deploy OpenStack ZenServe and all that kind of thing or any ideas if you want to go into that world I've got just a basic question about one of the things you had up there when she had her early on was about this important export are you talking about virtual format to virtual format transposition or are you talking about P2B, I mean exactly your requirement or thought that's a very good point so in disk import export terms what happens today in OpenStack is that we get hold of a compressed VHD file and we pop it in the file system and then we tell Zappy to re-scan and hope it finds it so what we really need is just a way of extracting, importing and exporting the virtual disk images because Zappy collaborates with us so an API that allows us to do that Dave Scott started some work on prototypes in terms of conversions from my OpenStack perspective I don't worry about that being in ZenServe so much because it's kind of one of the tools we're trying to make in Glance so we would love Glance basically Glance is going to you import your images in there and it records where it's stored so you can go to Glance saying this is a VHD, I actually need it to be a QCOW now I mean there's a whole other world of problems right in importing just in conversion but in terms of just getting data from customers existing VMs into the cloud it seems like a really sensible way of going about things even if it's got kernel and everything we don't boot it's just getting that data in so that kind of thing I think is going to be covered by what Glance is doing having said that it's kind of hard to do those conversions today with the existing tooling that we've an API's we've got available so the ability to have libraries or letters easily convert between VHD and RAW just sort of there but not really documented and RAW to QCOW or QCOW to VHD or whatever VMD key in the other ways having those available and to play with much more easily would be really useful for the Glance but that's kind of a follow-on thing I guess Any other questions before starting the actual off portion? Cool, oh, thank you all for your time