 Okay, so hello everybody. We're here today to talk about the InfraCloud, which is a community cloud managed by the project Infrastructure team. Next slide. So who are we? So hi, i'm Ricardo Carrillo, many call me just ricky. So i work as a no-pistak software engineer at hb, and i do mostly Infra and uncivil stuff. So my name is paul Balange, i work at redhat, and i fortunate enough to work day-to-day upstream in the openstack infrastructure project. My name is Colleen Murphy, and i've been working about two years in openstack on the openstack public modules team and the infrastructure team. So for people who do not know, the infrastructure team or Infra is responsible for all of these things that you see online, or on the screen. That's everything from Zool no-pool, Garrett. So if you are a developer or contributor to openstack, we are the humans behind all the services that help keep that running. And our talk today is not really talking about that, we've talked about that before, but i just wanted to lay the ground rules, or sorry, lay the ground of what infrastructure is. So how infrastructure works, there's supposed to be a nice little graph there. Hey, there we go. So it's a very complex piece of machinery. We have node pool, we have Zool. These are parts that are used for the launching of the resources onto the clouds that are donated to us that we use in the CI pipeline. And those resources are spread across multi-clouds, such as OVH, intranap, the OSEC cloud and so on and so forth. So today we're going to talk about the cloud that we've built on some hardware to provide resources for that. So cloud resources, so at the moment like I was saying, we have Rackspace, OVH, Bluebox, Vexhost, intranap, they have donated all of these cloud resources and we're very appreciative to them for it. Basically our quota before InfraCloud was about 600 total VMs that we could run, and that's about 20,000 VMs a day where we're launching and creating on the clouds. So basically as it happened back about two and a half years ago HP had this hardware that they were willing to donate to the infrastructure project and basically I want to say in Portland, a mid-cycle, is where we kind of hatch the plan of maybe it's time for us to run our own cloud. So this is kind of the mission statement basically the InfraCloud's mission is to take donated, the raw donated hardware and expand the capacity of OpenStack Infra for the purpose of Zool, which is the testing nodes that developers use and consume and so on. It was basically in a HP data center is where the hardware lives and our secondary mission statement is basically to start dog fooding OpenStack in general. So basically the timelines, I'll just go through this quickly basically in Vancouver in 2015 we got the hardware, we inventoryed the hardware this isn't brand new hardware, this was used hardware so there was hard drives, you didn't replace misconfigured networks, boot loaders so on and so forth. In Tokyo we started hacking on everything that's running, writing public modules, Ansible manifests so on and so forth. Between Tokyo and Austin we had a mid-cycle where we brought the first cloud online and we ran a job on it, but not shortly thereafter we actually had to move the hardware to another data center. So by the time Austin came around unfortunately hardware was still relocating, we had a kind of a blip in the radar. So between Austin and now we had a mid-cycle in Germany and that's where we basically had our first cloud online which we refer to as vanilla and as we are today we have chocolate which is our second cloud online almost I would say two weeks ago is when we brought it online. And our third cloud which we're going to talk a little bit more is called strawberry which is pending and we're in the phase of bootstrapping that cloud to bring it online. So basically right now as it stands we have two clouds online, two regions with node pools already consuming those clouds. The next step is basically to upgrade to Newton because we're running Mataka and basically that's it right? We have all this and we're done. Well I mean I'm done speaking but I'm going to pass it off to Colleen to kind of dive a little deeper on the actual implementation on some of this stuff. So I'll talk a little bit about the technical details of water cloud kind of looks like we use pretty much the bare minimum as much as we can to get a basic compute cloud running. We use Ironic to do our bare metal provisioning and then Keystone, Nova, Neutron and Glance for the cloud itself and then we use basics like RabbitMQ and MySQL you know no fancy message cues and no replication for MySQL, just basics for that foundation. Some of the other config choices that we've made are things like using Linux bridge with provider networks or using config drive versus EC2 metadata services. Our choices are all about keeping things as simple as possible, running as few services as possible and minimizing that surface area of debugability. Some people have asked me why aren't you using Cinder or Swift, don't you need some sort of storage in this cloud? The answer is for our compute driven workload we really only need ephemeral storage. We use the file backend for Glance at some point we might want to try to use something more performant but for now it's working just great. We do have package mirrors but the mirror contents are stored in other clouds and we use a distributed file system called AFS to distribute that data into this info cloud. I don't understand how that works I think it's just magic but I'm told that it does work so for right now we don't need any storage in this cloud and we don't use anything like we don't use any HEA technologies either right now we don't have HEA proxy or pacemaker. Our high availability is based on having redundant clouds. We have lots of different clouds so if one cloud goes down we have other clouds that can still support the workload so it doesn't really matter if one node or one cloud goes down we have lots of other clouds so that's kind of why we're not focused immediately on making this cloud super robust that's obviously on the roadmap but it's not an immediate priority right now. So we'll talk a little bit about how we deploy the cloud. We have two basic deployment technologies the first is Bifrost and that is a deployer for Ironic written Ansible and it runs Ironic as a standalone service so we don't need Keystone credentials and we don't need the Compute API or the Neutron Compute API and what it does is it installs Ironic it builds images with disk image builder it rolls all the hardware into the Ironic inventory and then it does the deployment of Paul Inventory and if you're wondering this is a Minecraft representation of Bifrost from North Pathology it's a rainbow bridge so I thought that was fun and then our main control plane is deployed using the open stack public modules that's one of the deployment tools in the big tent the public modules by default use the Ubuntu cloud archive packages so we're using the stable versions of those packages just following the release based upgrade model right now not tracing master at the moment you might wonder like there's lots of different deployment tools under the big tent why would we choose puppet and I have a strong background in puppet and I could talk about all the great things about puppet but really the main reason we chose puppet is because that was what we were already using for the rest of our infrastructure we didn't want to choose a brand new technology or brand new distro or config management tool and have to learn that and also have to learn how to deal with running our own open stack so that was the main motivator for choosing puppet so I can go into a little bit of detail about how these two deployment tools fit together because we have Bifrost written in ansible and puppet so for we have for historical reasons a machine that's called the puppet master it doesn't actually run puppet though what it does is it is the orchestrator for ansible so what it does first is it calls out to the bare metal server runs puppet on the bare metal server which installs Bifrost Bifrost then is of course ansible so it is going to kick off a Bifrost run and that does the hard work of installing ironic building images and rolling all the hardware nodes and then deploying all the nodes and then once we have that we once again have ansible call out to puppet run puppet on our controller and that is going to start setting up our control plane with my sequel Rabatimq, Keystone, all the cloud services then after that is done we start running puppet on the computes and that installs the compute agent and the neutron agent and ansible is great because we have that ability to tell it run the controller first then the computes because the computes need the Rabatimq and the Keystone services and all that setup before they can start running properly and to dive a little bit deeper into how this is working, this is a simple version of the puppet code to deploy our Bifrost server and it is probably the most interesting of our puppet code because we wrote it all from scratch and all it is doing is installing ansible installing my sequel, installing Bifrost source code, setting up the config file setting up our bare metal inventory so that is that file and then it just runs a shell command to start off and kick off the Bifrost run and the actual cloud compute the controller plane is actually a lot less interesting because it is just gluing together the open stack puppet modules so what this is doing is it is just laying out some of the basics, setting up the package repository, setting up the databases and then setting up the basic services, our Keystone, our Neutron, our Glance, our Nova and then with the compute nodes it is even simpler, we are just setting up the compute service and our Linux bridge, Neutron agent and then after that we have just our ansible playbook which just kicks off everything one by one so it is really pretty simple ricky will tell us about how we actually use this cloud in our ci infrastructure now we have our input cloud but obviously we need to configure for some way to ci usage and as you can imagine we need to configure typical things like flavors, images, that kind of things we like to standardize as much as possible to treat all our clouds the same and that is the main reason why we wrote the tool just to do that to create open stack resources that will be used for our cloud for the ci and we call it cloud launcher it is actually just an ansible role that processes a YAML file that it models all the resources you want in your cloud and it allows you to define those resources in a per cloud fashion or you can also create a profile containing resources and then you call that profile on your cloud so you can reuse the same layout of resources in multiple clouds and this is how it looks like that YAML model file so we have a profile section where you can have a list of profiles in this example so we have a profile for creating projects this is a standard in all our clouds so we have two projects one for the ci plane that contains our mirrors and then we have a project just for node pool to spin up the beams that will run tests and in this other example so we have a profile that defines our flavors in this case so this is how our node pool vms look like from the flavor perspective so this ram, this vcp use so this is the same flavor for all our clouds so that's what i said before we try to standardize everything and then we have a section about okay so what are our clouds and what resources we want in those clouds so in this case so we have our info cloud vanilla which is made of the resources defined on the profiles of the show day earlier obviously i mean i could go too much detail into this we could also have per cloud specific resources but i'm just keeping it simple for this talk and how you run it is super simple so we have this big YAML modeling all our clouds and we just fit it into our launcher run from our puppet master which is really our untold control machine and it sets the state of all the resources in all our clouds which really means that we have a repeatable way to set the state of our clouds in any time and it's really a good way to configure them also so once we have that we want to monitor our clouds and we use cacti it's really basic monitoring we monitor stuff like you know network interfaces swap usage disk space so on and so forth we also have analytics and metrics for our clouds and info cloud is not an exception if you go to reflana.opstack.org you will see a lot of dashboards and we have dashboards specific to our node pool and fair provider in our node pool this is how it looks like for info cloud so we have metrics for how long it takes to create the server to delete the server listing that kind of operations and now when we have that everything configured we're monitoring we have metrics we have you know the boot strapping of resources in our clouds we need to configure that cloud for node pool usage so node pool is not black magic so every time that you developers you push a change the test runs and beams that they are created by node pool so node pool is a multi-cloud application that talks to multiple clouds and it maintains a pool of defined resources and nodes that it has you know access to obviously node pool has access to the cloud providers that all mentioned that we are donated ova, intranet, rack space i do not want to really forget any of them they're all great and we're very grateful the node pool service yeah so we have all the authentication details to those clouds so they can access and spin up nodes and how it really you know looks like it's not really super complex i really you know simplified how it looks like for the basic things in node pool it's just a yaml file it's made of three main sections we have a label section where we define okay so what are our nodes that we want to test this stuff so we want to test the stuff in zineo we want to test the stuff in centOS in precise or trusty whatever so in this section we define okay these are our nodes and we want the minimum of these nodes in the whole node pool capacity and we want them bound into these cloud providers then we have a disk images section this is where we define okay so we define in our labels what are our nodes but how we want those nodes to look like so we use this image builder to create those images and here we define what elements we want those images to be made of and finally we have a provider section which is where we glue together to our clouds so in our case we have the infra cloud vanilla with the OSCC infra cloud vanilla containing the authentication details the maximum service it can spin and what the images is made of and how it can connect to those images you know the private key and that kind of stuff this is kind of an historical screenshot obviously we do not have Jenkins anymore but back when we were hacking on the infra cloud in the Fort Collins mid cycle i think it was February this is actually the first run we ever got in production in infra cloud which is why we're keeping obviously yeah today we have a telnet interface to show the the job progress and if you want to know more about open stack ci i really recommend that you do not miss this sessions there's a great keynote session tomorrow about demoing how notebook works that our colleague elizabeth she's going to be presenting and also we have jimbler on thursday who's going to talk about the roadmap of zoo and noble which are the main parts of our ci and i really recommend that you attend that because it's quite a bit of a change and it will simplify a lot the way developers can test their code and now i'm handing over to Paul who's going to talk about the current status of the infra cloud right make sure to take a look at that. So basically just to kind of wrap things up here this is how it kind of looks today so this is just a very simple graph and graphon that kind of represents what infra cloud did over a period of time of i want to say a bit of week so you can see on the bottom right here or the middle bottom is vanilla which was the existing cloud that was in production it's not really busy right now because we were i guess taking breaks because after they've done a release or something on the left hand side what i wanted to show here is this is the representation of when we brought infra chocolate online so you can see i should look down here on october 15th is when rickardo flipped the switch to bring it online for ten nodes and then about four or five days later we said okay we haven't run anything on it because we just have too many cloud resources now let's kick it up to a hundred and let's start building stuff so you can see we started running some jobs you know on the top left is all of our launches and when things are coming in line so at some point in time yeah we had about a hundred nodes going at one given snapshot period top middle is basically the representation of error and failures these are the things that we have to dive in deeper to say why didn't we launch a node we're kind of getting into the ops side of things right and then finally top right is time to ready how long it takes us to actually launch a node what i found interesting here but this one you can see sometimes we're taking about three minutes or four minutes to launch a vm this kind of leads into like the next slide you know these are things that we're seeing in open stack so because we launch or we upload images every day when you first launch a new vm a new compute node you know you have to go to glance to download load these images which may constrict your bandwidth or you know timeouts and services and nine times at a time for us it fails the first time so as i'm talking about this i'm looking at people in the audience going yes i know what you're talking about we as operators and since this is all open would love to go back to the community and help us as a team and as a project kind of fix some of these things so something that we didn't really talk about in the beginning of why we do it this way is all of our configuration is open like the open stack way everything's in garret everything's under code review all of our changes go through the same testing pipelines that open stack does so anybody in this room today could see a problem that we're probably glazing over and you can propose a patch to fix it maybe in puppet or an ansible or in our configurations for node pool but yeah to the future these are some of the things that we want to talk about like obviously we want to upgrade to newton data collection we really don't do data collection using open stack you know i know Colleen has some things that she'd like to see us do with info cloud some of some of the one of the earlier goals was to start chasing master and actually do the more ci CD approach things like that are things that we like to to do to help bring our use of open stack and the developers closer together and rickardo so i'm looking forward to take food even more stuff in our you know info cloud because this is the good thing about this that is that we took food everything so whenever we find a problem we can fit that fit back into the teams whether it's nova neutron so i would like to maybe i don't know check out you know the monitoring projects and use you know some open stack projects in that area and yeah yeah and i mean i'm sure other info team members have certain areas that they want to focus on but i think the real thing is is that there's a lot of opportunity for us to do it and if somebody in the community wants to kind of help in that effort or things that we should be doing something else you know you know how can i help kind of sets up perfectly for this slide you know who was doing this one yes everything we do is an infrastructure as code is done with an infrastructure as code approach so if you you're thinking well you know you should really be using this new feature or you guys are doing this all wrong let me fix it for you or you know every time my job runs on info cloud it breaks in a weird way so much that you can go fix that we have everything up as code and it's all open except for passwords and private keys so and donate i mean if you have some cloud to your employers to donate hardware please you know talk to us we would love to have more resources so we can run you know even more regions grow the current regions and really make this you know even bigger you know initiative within the info team and i think that's about it so i i think we're right on time so i don't know how many more minutes we have but if there's questions we're cut wow we shot your hand up really fast yep can you repeat your question oh forward oh you want this one sorry can you ask your question again in your environment right and how old is the hardware that you receive as a nation and do you plan do you plan to substitute the hardware after how many years and you know all the cycle of receiving the hardware using it and then get rid of it and if you have any any plan about how to get rid of it properly and you know where do you where do you take where do you take it yeah basically and you know all these sort of questions so in terms of how to tackle the variants of hardware so right now the info cloud is made of three different server models so one of the reasons why we're spinning up strawberry is because we want to have one region per model so we don't have you know some fluctuation of performance because maybe your job went through you know faster server maybe you know next job when you know the slower server in terms of life cycle of hardware i can really respond to that because we don't really deal with it we were given that and we're hosting it we're happy to have it and i mean we do not have any kind of information whether we're going to get that bumped or you know an upgrade on the hardware so i would expect that we're going to get what we have now though i mean i would be more than happy to have hardware vendors that are in the ecosystem to donate more hardware to the project yes i'm asking because you know i would like to do something like that you know i was wondering how old is a piece of hardware to be considered you know to be donated and anyway the other question if if there is any link between this working group and the the massive open-stack deployment working group no but we would be happy to sync up yeah i mean a lot it goes back to i think colleen's point of why we chose the tooling that we've chose it's because that's kind of how infra has done it and in the past infras really had some unique ways of like we're using ansible to launch puppet to then launch ansible to then launch puppet and i seen some eyebrows of people like what you know why would you do that but it me as crazy it sounds it kind of works in our process because all of that change management is open and allows us to you know stage things properly but i think that's part of the interesting thing is we now have this clouds i mean i'm not a cloud operator by any means i would love to work with existing cloud operators or installers to hey this is our version dot oh one how does it compare to your version dot one what should we do so if you can get us in touch with those folks and you know you know people from that group we would be happy to talk to you if there's any other questions yeah we can get a mic okay since you had my class you get to run it to the guy i do so i'm actually expected you guys to use something like a clean or cloud in it to configure networking other than the htp so what did you go with the htp for ip addressing yeah we do use glean yes we can talk about that yeah so which part of ip addressing are you referring to so the htp you have the so you have the htp for for vms to pick other ip addresses right now it's not configured config drive for instance so if you're talking about the control plane like the compute when we deploy them we have a static ip assignment so if you go to our system config repo and you go to our hyra folder you will find there an infra cloud that contains the bare method of json blob and you will find for every single compute the ip address for public and the the provisioning or private ip address for that we that we assign to it when we when we provision the machine okay so it's a static dhcp assignment for the control plane yeah and then i think calling at the vms is dhcp v at neutron right right hi guys i was kind of curious if so calling mentioned that you were chasing you weren't chasing masters you were just using like a stable deployment is there any kind of trickiness if you have multiple clouds like let's say chocolate was on neutron and vanilla was on the taco or something like that it is no pool just not care i don't think we're using any such advanced features of an open stack that it really that zool or no pool really cares which cloud we're running on because we have different public clouds too and they're also writing slightly different versions so as long as we have the basics you know we can get a compute note up with the same api and we can get an ip address it doesn't matter that much which versions are which following that so that's i mean personally one of the cool things that i see that you know we have logically split that the the servers in two three different regions but though they were in the same data center is that we could keep you know vanilla for production running in no pool and maybe we could have the newest strawberry that is coming to maybe do what clean suggested of hey let's have you know i a cd pipeline of master we can certainly you know experiment a lot of things because i mean we're managing that hard so okay thank you how much longer do we have to ask questions because five more minutes five more minutes is that right or okay so i have a question so obviously using these scripts you must deploy some sort of reference architecture right do you guys have any slides or any documentation as to like what that looks like and in terms of like what you guys use for packaging or anything else right is it just pipped the regular pipeline packages or do you use like a distribution specific package what do you guys have for that we use the distribution specific packages from ubuntu and then yeah everything everything in and these code repositories is kind of what makes up the reference for yeah gotcha okay so is it like flat l2 like one rack or like what's it like yeah we had so our implementation is very basic like we have a single controller and the rest are compute nodes i think but we actually have a spec an infra cloud spec that we started with to kind of drive all of this and what's it called it's like specs dot open stack dot org slash open stack infra it's not up here that kind of explains how we wanted to do our reference implementation but like calling said it's very basic because our needs of redundancy are we'll just go to another cloud you know right and i think your need specifically for open stack infra is testing out the open stack code not so much as reference architecture yeah exactly and yeah i think that's a secondary but i mean i think that's some of the things that call in and rickardo i mean we're going to have sessions about the infra cloud this week and definitely the ch part of things is going to be you know a topic i know for example nfp folks over there you know they're interested and you know i'm happy to talk about the ch so i mean it's not like it's set in stone i mean we wanted to you know i was just trying to ask to see if there's a direction you're heading right yeah where would you want us to head because that's kind of the approach i think we're taking okay so i'll give you some background you know my colleague you're from AT&T and we have our own cscd infrastructure that we've open sourced yep and it involves burn metal and dpdk and some of our own stn technologies and stuff but it may perhaps be a little too specific for our nfp cloud use case right so in terms of something that runs directly on master right one of the difficulties that we have is in terms of packaging right it's a little difficult but no we just wanted to talk and see what the community had in mind but definitely for large operators like multi region multi rack right depending on how you do your underlay fabric it all makes a difference and yeah specifically with open stack it's never really the python code that kills you right it's everything under that so yeah my advice is that you talk to that girl on that guy with glasses third row you can say hi patty so they are nfp folks and they're using our code for deploying their clouds and i also encourage you that you attend our infracessions about the infracloud i think we'll call this one the last one because i think we're so my question was there are quite a few installer projects inside open stack triple low fuel etc is there any plans to use dog food one of those rather than doing your own unsupportable combination of wizardry great i would love to i think so i can tell you my personal opinion and i think everybody has personal opinions on this so i'm not talking on infraclubs we have i think that would be a really cool thing because when we first talked about this the infracloud we were really short on cloud resources so the mission statement was we need more cloud resources for developers now we're pretty plush with resources like we were up to i think about 2,000 vms now we've doubled if not tripled i think those are very because those help the open stack community but there's also projects like the osik cloud i think that's kind of their Purpose is to give you the hardware and then as a team you Would but i think infras policy has always been not to Bless a project in open stack and say this is the way to do It you know like triple low this is the we should be using Triple low or we should be using cola i can never say Them right but i mean that's my opinion but i think it's All bound to the resources that we have i mean we have a good Amount of service but you know it's not a massive if we had More hardware i would personally love to maintain an Infracloud for prod and maybe have you know some subset of Another infracloud region that a so this is we deploy it With the triple low this one we do with open stack and Cable or with cola so because my my main interest is that The cool thing about this is that we are duck fooding so we could Go to the triple folks open stack and say hey we're hitting This issue i mean is this a bag or something so it's all about Resources yep my feelings are that we have a thing that Works and it's done in puppet and ansible and that is Something that the whole infr team knows and it's Consistent with how we do everything else so if we had a Ton of people resources yeah we could try to use triple low and Fuel and try out all these things we don't have that many people Resources so as long as we have a thing that works and everyone's Pretty comfortable with it this is probably what we're going to Stick with for a while. And i think that's the most valid is we'd love to but we're Resource constrained let's say. Yeah i think now that we have a cloud working we've had people In the past say we want to give hardware for bare metal and our Response usually has been we have we're struggling to get this Cloud online like let's have that conversation after well it's After now so i think it's all about history also i mean when We as paul said when we started this we have a really Limited ci you know capacity and all of a sudden you know we Have trillion nodes so it's more okay in front of the Cloud it's cool but you know we have you know this capacity so What are we going to do forward within the cloud so questions I think just about out of time thank you all yep