 Okay, we're ready. Thank you everybody for coming. We're going to do the first presentation today. We're going to talk about what Word is doing on OpenStack and a little bit of our journey of how we're going to deploy OpenStack in production, some of the challenges that we went through. I would like to talk a little bit of Word Day, but first, let me introduce myself and my colleague. My name is Edgar Magana. I'm a cloud operations architect at Word Day. I'm also a core developer for Neutron. I've been doing OpenStack since 2011 at Santa Clara Summit, so I have some record on this. I want to let my friend, Imtias, introduce himself. Hi, I'm Imtias Choudhury. I'm a senior software engineer at Word Day. Right now I'm involved with Edgar in deploying our private cloud. So, Word Day is a SaaS company over as a service company. We build everything in our cloud. We provide ham and resources app finance applications for all our customers. We have a number of customers like running every day, their payroll, their recruiting, all number of activities and applications that we provide for them in our cloud. And we are taking this journey into providing OpenStack to have more elastic system for deploying all these applications in our premises. So, we actually have a lot of operations challenge. So, we're going to talk about them. We're going to show you our architecture. We're going to talk about specifically the CI pipeline, the environments that we actually have to create for our development team and deployment team. And we're going to talk a little bit about the key techs await and we're going to have time for questions. We are going to also show you the life of the CI system all back in our data centers in Portland if the connectivity is good. So, let's talk about the operational challenge. Any OpenStack deployment requires some customization. In the case of Word 8, it's a little bit even harder because we wanted to deploy multiple clouds because we have multiple data centers because we want to actually provide it to different teams, different areas inside of Word 8. So, it was very important for us to have an automation system to do that deployment and I think everybody will have that. Identity, because all these number of clouds, we want to have the identical configuration across all these clouds. The hardware configuration went the same, the network configuration went the same, so we wanted that the software that it was running on those data centers were exactly the same in terms of configuration. Security, that was kind of like the nightmare working in the OpenStack deployment. So, you may have already security on your endpoints. That's actually we have certificates for all the communication on all the projects, right? Talking to Keystone, talking to NOAA, Neutron, et cetera. So, that was the first step that we have, but our security teams and compliance teams, they also wanted to have SSL enabled on the combination to the RabbitNQ message queue and also on the MySQL communication. On the top of that, they want us to have IP tables configuration for all the bare metal deployments. For the virtual machine, so the overlay network that we're going to talk a little bit about more later, we have started using Neutron that we switched to at the end. Obviously, they wanted to have a stable production for a company, like what was very, very important, the stability. We have year after year over 95% of customer satisfaction. We don't want to go lower on that. We actually wanted to keep the numbers higher before we wanted to even increase them. Production readiness, our operations team, the team who's going to be the first line of defense if something was broken or something was not working, they want us to have good login information. So, we actually enabled syslog and we send it to syslog server to collect all the logs and use some elastic search to identify some potential bugs. Monitoring, they want to have part of the NOx system. So, they wanted to have good dashboard set on alarms to detect when something was not going well. Any person that actually runs on the real world in the data center, they know that we cannot enable three or four networks on the open stack. You see any kind of documentation that will ask you to have like a management network, an API network, a data network, how to band network, that is not possible in the real data center because that will require a bunch of top of the rack switches that you won't be able to maintain and it's going to increase the cost of your data center. So, we have bond interfaces. We have all the communication through one interface and the second one is just for the out-of-the-band and all that it has HA. Obviously, it has to have a multi-tenant event that is a private cloud and we have a very, very good enforcement of policy groups. I was telling you before our security team asked us to have very specific configuration of communication set up between the VMs and we were able to get that through SDN. So, with that just to give you a little bit of the overview of the high level architecture of what we're deploying, as I say, we have two top of the racks switches in each one of the each one of the racks, each one of the servers they have a bond interface that actually have two communications going to each one of these switches that actually will enable HA. We started using Neutron and then we switched to OpenControl because of the security requirements that we wanted on the Overlight Network and then we also decided to split some of the components of OpenControl in two different servers. We wanted to have all the analytics information which is very, very extensive in one individual server and all the configuration and the control in a different box. The reason behind it is because all the analytics is information that we don't want to impact your control plane at all. And having everything in the same server could actually impact it. So that actually improved the performance and the reliability. Then we have a set of compute nodes in the production system that we have. We don't have storage nodes yet as they are. We have our team working very hardly in actually getting the SEF implemented. So I'm expecting to have like four or five weeks where we are going to start adding a bunch of storage servers into this cloud. So with that, I want to pass over to Imtiaz to start explaining what was the journey of the CI-CD system and why we have to use it. Thanks, Edgar. So how it all started Our goal was to deploy OpenStack with community cookbooks and community resources. We wanted to deploy OpenStack and we wanted to make sure we can leverage from community what's already available from the community. At Workday we use SEF as our configuration management tool. There are other options but Workday already is invested so we wanted to use that. The first deployment we started with the community cookbooks. It's an OpenStack Chef project in GitHub. We started with that and we soon realized the project had a very simple implementation. We could bring up an OpenStack cluster with one controller and compute node with two vagrant boxes and we were evaluating at this point so we could do this on our laptop using virtual vagrant and we soon realized that this model has its limitation. For example, it didn't have a separate Chef server. It could provision using vagrant Chef but there are a lot of features, for example, search that doesn't work in the vagrant option so we needed to have a Chef server that replicates what the same Chef server that we have in our production. So we need to realize that we need a separate Chef server and we soon realized once we use a Chef server and let's say we have this simple deployment with a single OpenStack controller and compute, we don't know after it's set up whether the provisioning is right, whether all the functionalities of Keystone, Neutron, Glance, they're working or not and how do we do that? The best way is to use Tempest so we realized Tempest would be nice to have. Then we realized it'll be, we wanted to build a continuous integration pipeline and one way to achieve that is to add Jenkins so we needed another server and then as Edgar already mentioned we instead of using just Neutron with ML2 plugin we wanted to use an SDN solution and that required yet another server and finally like to benchmark performance we need a Drally so there are all these servers that we needed and we are at this point still evaluating on our laptop and building all these things on separate virtual machines it's very computationally intensive and it doesn't quite work and also to complicate things further we also wanted to be able to share some of the components that we don't need to build over and over again for example Chef server, Rally, Tempest, Jenkins we don't need to bring up a fresh virtual box and configure and install it, it takes a lot of time so we wanted a solution that will let us reuse these components that led us to containers like what are the drivers for using containers, first of all it's lightweight instead of creating six, seven virtual machines on our laptop, we could bring up all these things in a single container on a single VM second thing, it's reusable, the components that I showed, Chef server, Jenkins Rally, Tempest people don't have to spend lots of time building those and bringing it up we can just create them once, put it in a Docker image repository and reuse it and it can be shared with the entire community that's what we did, this is shareable and that led us to our Chef development framework this is our first take at it and we are again still evaluating all the community technical sources at this point and for our virtual machine we used a Fedora 20 Vagrant box in production we used CentOS for using Fedora, we need a newer version of kernel for getting many of the Docker features to work and on that virtual machine we bring up Docker engine and on top of Docker we bring up all these containers, so this was our development environment where we proved that the continuous integration model can work, that we can bring up a Chef server, Jenkins and bring up an entire cloud on our laptop and all the developers who were involved in this project initially we could share the same dev environment and we were getting the same results, everyone got the exact same number of Tempest failures so it was a very reproducible environment and it was very useful for us to get to the next phase of our development and to make things even better we added a few other containers DNS we added, we realized that in Docker as you delete a container and recreate it you get a new IP address and sometimes with Chef it gets a little difficult changing your IP address over and over again, so instead of doing that we used a DNS service and addressed all the containers by their hostname instead of IP address, we also did some LDAP to keystone integration and again, doing that thing, it's a little challenging but having an LDAP container that mimics our production LDAP was very easy and we were able to do that in development as well So if you just let me add something in the previous slide, so I really want you to realize the potential of this environment, so you have a system running in our 4GB VM in your laptop that is actually exactly why you're going to run in production and minimal scale at the beginning a lot of the people in the company were accepted about this because they didn't realize the potential of this framework when we started running Tempest tests across all the developers in our organization one thing was very common as Imtia said the number of Tempest tests that we started failing were exactly the same in every single laptop doing the system and you would say like why you're so proud to have Tempest tests failing in your system it's not about if they were failing or not, we didn't focus on that part we were focused on having an automated system repeatable that it was exactly a mirror from each one of us running in different environments, different laptops, etc this is an amazing environment because A, you are mimicking what you're going to be running on your data center in production and B, think about it for a worthy the journey that started with a bunch of developers that they didn't even know what was open stack they didn't even know what was open control they didn't know a lot of things that involved a lot of the learning and that was the perfect sandbox for them to do it, to play with it, to experiment to kill it, to destroy it, rebuild it again and do it in a very safe, quickly manner everywhere, they don't need to have connected to the network they don't need to do that, when the system is open running they can also contain itself isolated and they can do whatever they want that was the perfect mechanism to drive our team to actually learn on open stack as well yeah thank you so our first iteration, now that we've heard the developers they would first start a vagrant VM as I said, bring up Docker then we bring up a Jenkins container Jenkins was configured to bring up Chef server and we bring up open stack controller well, initially it's just a blank container and Chef provisions is as a container sorry, open stack controller, next we bring up a compute an entire open stack and finally we get bring up Tempest, run Tempest and get the results and all the results you can see from Jenkins and the good thing is also that you have entire log, if in case Chef fails you just go to Jenkins, you have all the logs from the previous runs in the next iteration, we started with Neutron with ML2 plugin which worked fine and we wanted to use OpenContrail plugin and to do that, there were some changes that we had to do and we created new virtual machines and one reason why they could not be containers was we were actually testing IP tables for our SDN controller and it's not really feasible to do it in containers Docker has its own IP tables, we didn't want to make it too complicated and that was one of the reason why we had to split it and this is the development environment so now that we talked about development environment the laptop environment and we proved that community cookbooks are good enough, we can tweak it to use it in production we went to the next phase, we built our continuous integration and development using OpenStack and OpenContrail on virtual machines and this is so our continuous integration on virtual machines we took some concepts from TripleO OpenStack on OpenStack so we have our OpenStack running on bare metal created a bunch of virtual machines that pretty much mimics our development setup that I already showed and the goal was we wanted to have a very disposable test environment developers will check in something, you can bring up an entire cloud do the testing, throw it out and people are doing it in parallel so we wanted many of these small cloud instances and we run OpenStack ISOs so our under cloud, our bare metal environment was OpenStack which was also you set up the same manner with Chef and through the automated fashion and then for we used RubyFog library to interact from Jenkins to the OpenStack controller I mean there are other options, it's just worked as pretty heavy on Ruby so we continued using that and the way it works, our development pipeline is developers, they work on Git, they will check in their code when patches, which triggers and create a garyt review gary triggers Jenkins build Jenkins then talks to our OpenStack controller and brings up a number of virtual machines one of which will be a Chef server and we bring up a few other machines which then Chef provisions them as OpenStack controller SDN controller, compute and Tempest and then again running Tempest result and Tempest test and get the results so that's the same idea but this time everything is done on virtual machines this is the workflow and as I was saying from Jenkins when a developer submits a new patch Jenkins tells OpenStack controller to launch a Chef server in this case we actually created a Chef server image which makes like so that we don't have to provision Chef server to begin with and once the Chef server comes up it goes to our Git repository it fetches the patch set that the developer submitted and it uploads it on the Chef server and then it creates other virtual machines OpenStack controller, SDN controller, compute, Tempest all these things can be done in parallel there's a little bit of orchestration required OpenStack controller needs to come up first before the SDN controller and compute comes up even between SDN controller and compute there is a slight orchestration required like one has to come up before the other so those are taken care of on the Jenkins side the script that we use to bring up the entire cloud it takes care of that part of the orchestration and we run Chef Client and then all of these things are provisioned and then finally we run Tempest and get the results so now I'm going to ask you again to look at this picture and remember the previous one so in the past we have a very great environment for developers to test everything they wanted to test in their laptops then because of the requirements on the IP tables and the SDN testing the laptop environment wasn't good enough and actually was slowing us in productivity so we created this first cloud with that first version that we have in OpenStack and then we create N number of tenants each one of these tenants will be one of our developers and then we have N number of VMs that actually they will deploy through the Jenkins servers that we have to actually deploy all these components another Chef server, an OpenStack controller SDN controller, Compute, Tempest, Rally all the tests and actually dispose all that environment over and over so we were moving from that simple environment to a little bit more robust and then on the top of that we were testing our production system even before going to production as is was Public Cloud so that was actually getting a lot of good feedback inside of the company and also it was giving us a lot of the good information that we wanted for the benchmark in running the Rally test so if in your company you are struggling to actually how to convince the management team going to OpenStack this is a great environment Thanks Edgar and one other thing that this model let us do is also to test how the bare metal are cloud performance because once we started using this thing developers were checking in and we made this a parallel Jenkins build so as people were creating new patches we had times where there would be creating like 20 of these jobs at the same time so we see all this VM virtual machine clouds are getting spawned at the same time so we were actually exercising our under cloud a little bit and see how OpenStack performs and so far I mean we've seen some issues and the philosophy we follow is eat your own dog food and this let us resolve some of this minor issues at the beginning and so that brings us to our road to production so first as I mentioned we started with our development setup which was a simple vagrant and Docker container with Docker containers we were building and testing so people developers are still using it but they're gradually moving into more into the virtual machine thing as because as Edgar was saying as we added more and more components it was getting a little difficult to work in that environment then we moved everything to virtual machines and finally we also have a bare metal continuous integration system so what happens is once a code is submitted and it passes our virtual continuous integration test we merge that code and then it gets promoted after doing a full systems test so we run everything with all the cookbooks not only just the community cookbooks we also have work the cookbooks on top and once we test everything the community cookbooks and the work the cookbooks together and we don't find any issues we promote those set of cookbooks we call those our bill of material they get promoted and we know like these are ready to go into production and then they go into like one Chef server and that Chef server is used for the promotion and like for maintenance we use that Chef server to deploy our bare metal open stack and on bare metal we are also running continuous integration it's not triggered by Garrett as opposed to the virtual machine continuous integration it's a scheduled and the difference between virtual machine and bare metal integration is we don't dispose our bare metal installation like we have a working cloud which is up in operation and production and as we create new patches we just like run Chef client again to add the new patches without throwing it down we cannot do it like that will require throwing out everything so that's not possible in our bare metal it's an upgrade mode as opposed to virtual machines where we do from fresh install every time so that's one of the differences but bare metal continuous integration also runs all the time so some of the key takeaways so it took us a lot of iteration to get where we are it wasn't done in one day as I said like we started with the development environment and even there we first started with just neutron with OVS then we added more and more components and then we moved on to virtual machines and eventually to bare metal the Docker and Vagrant it proved to be a very powerful Chef development environment and it helped us doing rapid prototyping without it I don't think we could convince management to adopt OpenStack with this open community cookbooks the containers were also a lifesaver initially when we're prototyping we didn't have any like machines to even try it on so we needed something and the Docker and virtual machine was the way to go and sharing with images was another thing container images made things much easier for us and by building a continuous integration framework we improved our developer's agility quite a bit developers could submit their patches and not everyone now needs to test their code to create a patch it passes Jenkins they know it's good enough then it goes through review if everyone likes their code it gets approved then it's merged and then we have another build that makes sure the system build passes and everything and this gave us very predictable outcome so everything that we see in our virtual CI was very predictable of what outcome will be in our bare metal if it didn't pass we would see the same results on bare metal so we knew exactly what to predict and so far we haven't really run into any issues where things work in virtual machine doesn't work in bare metal there are maybe few minor things that we found but most of the cases virtual machines were a very good indicator of what we would see in bare metal so we're doing very good in time which is great so I wasn't sure about the time so we're going to be able to show a little bit of the demo what we're talking about here so we're going to show exactly what is our overcloud you are so as you can see here in this dashboard I'm actually accessing my overcloud server my open stack controller and as you can see I'm actually signing with my own username and password right this is connected to LDAP so actually I'm just using my corporate credentials to get into this system and obviously this deployment will take over I don't know one hour 45 minutes almost two hours because you have to deploy all the VMs then you have to create a chip server then you have to blow up all the code books and once all the code books are actually uploaded you actually pull all the packages to create another VM which is probably the open stack controller VM actually get all the packages into that VM start running Chef Clients when we start all the packages and you do that for the SDN controller for the UI part and the analytics part of the SDN controller etc so that will take a lot of time so we actually just simplify here a little bit what we want to show you so as you can see this is the overcloud the 2138 we have a bunch of VMs that we already automatically created through a simple CLI command so we have our chip server we have our SDN controller the open stack controller the open controller controller the analytics part of the controller obviously you want to have a less one compute node for testing we can actually add more VMs and add more compute nodes if we wanted we have our Tempest node and we have also the Elk node which is actually where we're sending all the syslog soup if I show here so this is our Kavana dashboard we're actually collecting all the logs here I'm intentionally not putting all the logs in a readable fashion for security concerns but actually you can get a set of alarms and a set of systems here configured to actually know what's going on in your cloud you don't need to SSH to any other compute nodes everything goes here this is what it needs in a production system you don't SSH to the computes to know what's going on or the SDN controller or something you actually could provoke more damage that actually what you want to fix so we actually get everything here we actually have an also monitoring system and coming back which is a Nagio server coming back to this over cloud so what I'm going to do now is I'm going to run the IP address 109668.3 for my OpenStack controller so I already have another tab here so you check out this, this is the 1096683 so this is my dashboard this is my OpenStack controller on my over cloud so I'm running an OpenStack and OpenStack so think about how powerful is this you can actually test any hot patch in your system be sure that you run an OpenStack test that everything is working properly we're not talking about here DevStack and I'm happy with that but this is for reality this is for production systems and in this one you can actually create any crazy patch that you want to explore as a developer as a system integrator and see what is going to happen actually for this one I'm going to join as an admin to show you a little bit more and obviously the network is down what happened with my network what happened with my network yeah, the Wi-Fi is down it went back so in the meantime that I'm actually looking in if you have some questions so you can actually start going to the micro what I'm doing right now I'm actually creating a tunnel all the way to my data center where my over cloud is connected, it's created there you are, I'm connected so I'm going back here so I'm actually some of these network connectivity issues that we are actually going all the way to the data center at work day so here we have this is my over cloud I just have one compute node so it's going to show up here and I've been playing with this server the whole morning so I'm actually I'm actually destroying things creating things so as you can see we have a CI as the end compute node and exactly the same that I was actually assigned on my over cloud here this is the compute node which is actually what I'm seeing here in this over cloud and actually I could have a controller server actually if you want to get familiar with the control policies this is a good way to actually test it so you can do whatever you want in this over cloud you can actually has a lot of potential right now so Imtiaz and Ivy didn't mention this part we're using the Fedora RDO packages for open stack deployment so we don't want to mess right now with the Python code so if we identify an issue we try to backport it directly on with some of the Fedora or the real packages but the data tomorrow has a lot of potential because we want all our developers to do Python changes in order to do that we can build our own packages and we can actually retest in this over cloud and we can validate if we are actually breaking something not just from the Tempen site also from the rally side so we can actually compare the benchmarking values from functional cloud to the next patch to the next patch and so on we are scaling patches every two weeks right now and to be honest the first time that we scaled the first patch through the chip server we were a little bit scary everything went fine we did not find any issues we did not lose connectivity at all in any other VMs so that's actually we scaled like two hours and maintenance windows we did like in ten minutes or less so that was actually very cool with that so we are done so we are going to ask if you have some feedbacks or some questions so we have a couple of minutes for that thank you you managed the life cycle of your artifacts within these pipelines so for example you said you used the RTO packages and if your test take one hour during the pipeline how do you ensure that a new release or on a chef cookbook is not coming in production that was not really tested in development how can you freeze the versions he's the master of that I'll take that in production we are not taking anything from the internet in our data center has no internet connectivity for security and everything the cookbooks that we picked up from upstream they are copied over locally and they only get promoted and go to our production chef server after they go through this phase so first they have to pass our continuous integration test when they go through that end system test once that's passed their versions are bumped so let's say we start with cookbook 1.0 everything passes and then we dump it to 1.01 and then we label that version of cookbook as promoted and then we have a list of artifacts Jenkins keeps all the list of artifacts these are the set of cookbooks that are already tested with our systems test and we know and we can go to Jenkins and say this is the list of tests and I want to use this policy group which is a JSON file with all the cookbook versions listed and it just takes that one, uploads on the chef server and we use that to do our maintenance and the same for all the libraries that we have we have our internal mirrors for all the RPM packages if we want to explore a new version of the library we put first in a quarantine or testing repo we create this overlap pointing to that repo we verify that everything works and we promote that specific library into our production Thanks guys that was just fantastic I think that's really valuable I just wanted to ask how consumable is this outside of the workday environment is this something you can give back because I see it as something very valuable for us as contributors to be able to get this complete workflow from your laptop on the train on the way to work to be able to get patches in and upstream so I just wanted to see how is this consumable outside the workday environment So it is not yet so we recently as a consequence of the Ops Meetups that we've been having so word has been very active on those operations Meetups so we've basically created new repos onto the kit account that is called Ops Tooling Ops Monitoring and there's another one that I don't remember the name so we are talking to the and JJ Asgard to actually push all these codes once we clean a little bit of the URLs you can imagine today all the URLs for the jump configuration is pointed to our internal data center so we want to put it back into the environment file so the policy groups that we are using for this deployment and then it's going to be consumable for everybody but that's going to be the repo where we're going to push these codes so we're going to go with OpenStack contribution upstream from workday so we really want to get back to the community what we're deploying here Thank you guys There was a lady waiting in the micro so you want to take the micro I have a question about if the code is related to the hardware for example it's a driver of some data store so can this code can be tested in the virtual machine on the docker or you have to left the test case to the mental side it can run in both in the virtual machine when you're running all the tempests and when you're running all the rally things obviously rally doesn't really make sense to run it on VMs so we actually create a new cluster that we call the perf cluster we have our compute nodes in that one exactly the same configuration that we're running in this over cloud for the benchmarking part so actually to have realistic data it will work you will get results actually you have a very low numbers and performance I don't know if you want to add something on that part No Did that answer your question or did you miss something I don't know if the machine in the docker can connect outside to to the storage device that is the different kind of storage device that outside of the of the open stick maybe a storage device like you're talking about SAF or something Yes, this is a physical storage device You can mount storage as docker permits and then you can add storage or some volumes to your host first and then mount it on docker and then we expanded our model which I didn't show we tested SAF on docker as well and that also works it's not easy we have to do things at the kernel level some hacks in there to get it to work but it works and you can actually mount volumes but again on your laptop you're limited to do certain things but it is possible We just have time for one last question so could you just take the micro Okay, first SAF we proved that we can bring up SAF with it gets provisioned the same way with a Chef server so in docker container we don't want to bring up an entire Yeah, the second one actually we create three new containers and we actually deploy the Chef client with the role of a SAF server there is a little bit of an extra orchestration needed for SAF you need to have the first SAF server to be running before running the other ones we actually realize that yes, an orchestration level is needed for all these components to run at the same time on the other hand having Chef client automatically running on the back end of the system will fix it sooner or later so let me give you an example if you run at the same time Chef client for an open stack server and the compute node the compute node will finish faster than the open stack controller because there is less configuration and packages to be deployed so what is going to happen, that compute node will look for the open stack controller server and that server will not be ready on time so the compute will fail will not be able to actually subscribe to the Arabian queue on the controller however once the open stack controller is completed the Chef client in the compute node will keep running we actually have a configuration to be running the next 50 minutes it will run again and it will try to subscribe back again and then it will find the controller so if you deploy everything at the same time things will not work but Chef client will actually let you fix things automatically at the back end to be honest we really don't want to do it that way it seems a little bit messy we want to have some kind of orchestration level on the top of that so we are investigating things like M-collective for these Chef environments I think we are on time so I really want to thank you everybody for coming here this is the first session of the day after a very good party yesterday night so that was awesome to see you all here thank you so much again thank you everybody in case you have any questions you can come to us and I'll be happy to answer any questions that you have