 Good morning, everyone. My name is Angel Tomala Reyes Today with my colleagues Esteban ideas and Pablo Barquero are going to talk about the IBM cloud first factory Which is a three zone? Open-stack based environment that we created an IBM to encourage innovation. I Hope you enjoy This is basically a story That is literally a point in time and some of the things that we're going to be talking about today May not be relevant Today based on the newer versions of of open-stack, especially with ice. I have ice house But you know our experience I think I think it will be valuable to some of the guys that are here so our agenda for today, we're going to cover a little background to talk about Why we ended up creating the cloud first factory how we got to that point We're going to talk about the cloud first factory itself Building the cloud first factory the lessons we learn as we build the cloud first factory Also, we talk about the upgrade the upgrade process that we went through The pitfalls the good things the bad things and then we'll wrap it up with a summary so First of all who here has heard about IBM smart cloud enterprise? Just a few people, right? so About 2009 2010 time frame IBM decided to create its first public Cloud offering and it was called the smart cloud enterprise Myself and a few other developers were tasked to create this offering It was based on top of IBM infrastructure IBM technology But one of the problems with with se I'll call it like that, but se the really never Had enough of an ecosystem it never developed one of the problems that we had with se was that They were missing the apis for you know critical resources like you know features like scas for example some of the problems around se as well where that some things were not you know most of the things were Capable to do you know using the UI but not necessarily API and today We know that everything is all about automation automate automate automate So you don't you're not going to have machine automate clicks on the screen for example, right? So Then in early 2013 IBM announced that we were going to support open stack and we're moving to open stack Why because you know strong ecosystem is the Linux for the data center essentially, right? At the same time gts the organization that I would that we work for Was looking to develop skills and experience And also provide an environment for IBM and partners to be able to develop solutions An experiment its experiment experiment and find their solutions through a set of environments that we created for for IBM And the cloud first factory So the plan was to from IBM's perspective was to take a CE and adopt open stack on You know on on the IIS layer, right? So we were going to adopt the open stack API's and therefore Have the API's that we were looking for that we didn't have we would actually enjoy the ecosystem the community and so on so forth So that was a goal So like I said, you know why open stack? Obviously, I think everyone here knows why you know open source We have an Apache license Everything really revolves around elasticity Scalability main goals here are scale every up. It's share everything distribute everything and Enjoy all the goodness of open stack. That's pretty much well said I mean, I don't think we can cover too much on on this slide over here, but So from IBM's perspective, we're looking for a very simple three-tier approach, right? So We on the bottom we had all of pure flex systems from IBM The common components from open stack will manage all of those resources and then on top of that We would actually install or have smart cloud entry and smart cloud provisioning smart cloud Orchestration added value to the customers. That's all we were looking for just added value to What open stack had to offer Again, some of you already seen this before and you know know this inside and out But you know just for To cover you know more ground we'll talk about a little bit of the architecture So as you know open stack has several components that are very loosely coupled You can pick and choose what components you really want to use within your environments And in our case, we wanted to experiment with everything. So we still compute Block storage network dash, you know the dashboard horizon image management by a glance keystone and also swept Right, so what is the cloud first factory? So all we wanted to do was to harvest the community our community was IBM internal third parties and the open source community We were looking for offering services components and any asset that we can actually bring into the IBM smart cloud enterprise Our strategy was to engage a cloud first product the cloud first factory projects so anything there was in the public power anything there was for Anything with an IBM essentially, right? Eventually will become some sort of a Public offering and we just wanted to curate all of those technology and all of those projects through the different environments We were about to create Experiment experiment experiment. That's what really we're trying to do is it's just allow people to come in and add Their projects and continue to experiment and move their applications to the different environments based depending on how You know and it depending on development cycles so Se the cloud first factory was sort of like the guinea pig for se se was doing their development in parallel as we were trying to Learn about a little more about OpenStack. We wanted to experiment with better hardware We wanted to Upgrade our networking and also use OpenStack Right se was literally concentrating on porting all their stuff But we wanted to learn a little more how can we make it faster? How can we how can we add direct access perhaps to to disks for example for some of our analytics work that we were trying to do? So we were literally enabling innovation that se couldn't do at the time so Like I said before we were looking to have Some experience around OpenStack, you know our team had basically zero knowledge about OpenStack But we had experience on s with se We have minimal python experience and our team was to split between the US and Costa Rica We used at the time when we began doing this work. We started with Folsom I with that was the GA version at the time and we installed our OpenStack on top of rel 6.3 One of the issues that we had was even though we had this cool hardware We really didn't have a lot of open switches in in our racks, right? So when did I happen to you know, we started simple, right? So we started with a single network node with the know with with the knowledge that eventually we would actually start to scale out and Move things around So Let's see we added services as they were needed and we added computer compute nodes as they were needed And we started from a big basic install. It was literally just like the install script, right? So this was our hardware. I replace DX 360s We have Melanox Ethan of ports we had 100 128 gigabytes of RAM six cores 12 3 terabyte discs on each one of those machines and also a Raid it's an M 1050 controller on these machines. So it was pretty well stacked up So like I said, so here we're talking about Our community customers third-party community open source business partners The IBM internal community was research GBS GTS Software group so we were looking for them to be able to start playing with our stuff, right? Here the three the three zones integration partner and client and we were literally moving from left to right Innovations go in and cloud solutions will come out. That was the whole purpose of the cloud first factory So to talk about a little more of the process, right? So we had a constant inflow of projects coming into the the integration zone We will test it we install it people will go ahead and make sure that it works, right? when they were Good enough those applications will then move to the partner zone We would actually invite partners to work with us and begin to experiment with the applications that we have been working with Now eventually the supplications will make it into the client zone So you literally graduate those applications and to possibly eventually going into production or the IBM smart cloud enterprise Some of the projects that we worked on This is just a short list About 80 I'm sorry 82% of the the projects were created by IBM GT the software group had projects like DBAS virtualization as a service big-an-size clado e which is now called the Blue mix of project blue mix, right? And there were basically our biggest customer as a matter of fact We learned a lot about open stack when Blue mix began really hitting our servers, right? We learned a lot about network use how avoid, you know, how to avoid problems with network How do I avoid problems with sender? and pretty much Blue mix put us through our paces. They were like I said, they were our largest customer Also our third-party vendors were like melon extra Travella and also fabrics with Kaltura. So we did some video work as well, too some of the work that we did around video or even Data mining right? I'm sorry a big insights or big data Let us into trying to figure out ways to make this as access a faster at the time There was no way to have direct access to disk. I believe now actually we do have the capability But one of the good things that we were able to figure out was that some of the changes that we were doing allowed us to be able To be able to get about 80% performance to bare metal Which is not bad considering that, you know, you're working with virtual machines the whole time Okay, so our requirements, right? our primary delivery was a public cloud and using industry recognized APIs. We needed to reach a big ecosystem right Over the session was KVM for storage. We needed to have ability to be able to connect the tape local storage object store and block storage We leveraged Open stack to manage all of our infrastructure And a cleverest factory will provide operational insight into se again We were essentially the guys who will go and discover what are the things that we could do with with with open stack What was possible and then make the you know and then share what we learned to the group that was building se or se 3 that oh so there we go again hardware then open stack and then You know, I guess it before like clad oe or blue mix some of the projects that we had and a big insight So our philosophy was following again start simple We started with a single control node network and compute single senders to our volume of volume Installation was based on just a simple script install We used xcat to deploy our player nodes move confuse dedicated hosts We disable the compute control node and validate and add additional nodes So with that i'm going to transition to my partner here paulo and he'll talk about How we're you know how each one of our zones were created and designed Hello everybody Good morning well Now you know how what was the mission of the cloud first factory, right? So how we did it Pretty much we built up three different zones So the zones that what referred in the previous lights become a physically Separated zone each of those with its own open stack Installation and ecosystem Well, what's the point here? We have one zone inside of our firewall and that was the innovation zone That's the place where the projects hit the first time in order to be created in order to be, you know developed so Actually, we we held some hackathons on that zone and it was very productive That was the incubation zone for the blue mix project that we have already as an offering then A Yellow zone where we Can enable partners to get in right so the projects when it was ready in the innovation zone passed through the integration zone as we called In order to some partner partners to be able to see them and enable them and touch it and refine them Right so after that we got decline zone Which basically pretty much we enabled some clients to see what we have been Doing through this process of innovation Okay, so the the blue zone was the development zone the integration zone is where we Actually test with some other partners What our innovations and the client zone is where you know some clients can play with it can test What we have innovated in this zone so What about the blue zone Well, this was the first one that we built Pretty much our first experience Right, we built it simple We pretty much have one control plane note Pretty much everything runs there except for the You know the storage part of the open stack We took out a cinder and swift at the beginning but more over than that that control plane Pretty much runs all the processes of the open stack We relied as well on on xcat, which is pretty much A piece of software that enables you to control Machines to control hardware Right, not necessarily through the network connection, but with a pixie connection. So We relied on the xcat for some Purposes such as remote installing a new node remote building a new node. We had xcat as a main repository server so We use it pretty much in order to have you know all our nodes in the same Versions pointing to the same repositories and being able to remotely Build up a new node from the ground. Okay, so this was an internal zone. This was inside our firewall So we didn't care that much about security. Okay, so that's what pretty much our Blue zone as we called it What about the other two zones? well The other two zones Represents to us a new challenge Why? Because those zones Will be heated through the internet will be used through the internet. So we need to harden These zones pretty much so Well, we did different here And I will go through the different points that we face Uh with a public Cloud that we have built here. Well pretty much we use some We we split some components of the of the open stack control node On the separate virtual machine Okay, pretty much horizon and cupid Was moved to a separated virtual machine The other services though As well my sql. Yeah, the other services though remain on the single controller node, right? As you can see we had no high availability At all We have a single control node, right? so In that control node as well Run the network Part of of open stack. We relied on nova network. Remember we started in Folsom, right? So we relied in nova network in order to do all the billing tagging and all that stuff that is needed In this sort of installation. So That was pretty much our yellow zone. So As I told already That was challenging In terms of we are given this Open stack environments to be able to be accessed through the internet so What did we do? Using nova network we did A bill and tagging we didn't use a plain dhcp configuration. We use a bill and per each of the of the Tenants or projects We pre-built those billings on the switches. We preconfigured the switches Either either the rat switch and the core switch Okay, so we narrow it down. We didn't bring a whole stack of Billings we preconfigured that on our switches Of course the ports started closed pretty much so the Policies was very restrictive or had to be very restrictive The bill and to bill and communication as well Was pretty much restricted. We used the security policies as well for the floating IPS In that regard as well We did not allow a huge amount of floating a piece, right? You know, maybe the customer will say, oh, I need Tons of floating a piece. I need every virtual machine to be hidden from the internet But that's not The reality Of what we need so of what they need. So we decided to close it down and have a limit number of IPS so again Everything went through the controller node. That's what the know the network node at only one Um, how do we admin this trade this? These environments. Well, we had a jump gate a jump box, right behind an open bpn Right. So the only way to administer this Was using this bpn to hit the administrative Side so well Again, this is pretty much the picture of what our network layouts Was built, right pretty much the public network. We had a firewall Behind that firewall, we have the controller on the network node, which was the very same node Okay, we have the guest network storage and admin Okay, the guest network is the bm to bm communication And the storage network is of course for communicate with see it with swift and cinder So those two networks was 10 gigabytes nicks Okay So there were plenty of bandwidth for that the management network was a one gigabyte Nick so this was our Pretty much our network configuration What are the challenge did we face? Okay We don't want The external user to have admin rights We don't want them to be messing up with our open stack or playing What they don't know about Right, so we built two portals one internal and one external the internal Horizon was unmodified was Out of the box and it was used For admin purposes only So we needed to open the the the vpn Connection in order to get there to administer the the entire open stack Cluster, okay the external one We hide the admin tab Okay, we did not allow any external traffic to hit any administrative tasks Through the portal, okay What as we did We risk in the portal as you can see in the picture Right, we use our own style. We use a a Terms and conditions in order to to the external users to get into the portal We disabled the image of loading we disabled the bnz We enable HTTPS of course for the external portal and we did Burner abilities scan on horizon As well before we put it In front of the internet, right and it passed it passed pretty well. So What as we did? Another concern another thing to harden. So What we did we implement a reverse a secure reverse proxy for both for horizon And api so all the HTTP traffic went through this proxy hardened, right We moved the horizon ue To a bm to a bm, right? We didn't have it in bare metal. We had it virtualized So it allowed allowed us in the future When we can you know scale this to use Or easily deploy a new horizon vm and a load balancing in front of it What is we did? With this reverse proxy then we Hardened it we as well In the administrative point of view had only one port to to care about right, you know open stack Services are placed in different ports, but we try with we Pretty much use this proxy for all the traffic and then it just Sends that traffic to the proper port inside Our installation bad From the internet will be only one port images on the internet Well, you know ibm It's about compliance. We have a very strict security policies inside our company Then We cannot allow any image to be placed there Okay, internally. We have a policy which is called itc as 104 which is pretty much a lot of Of rules that you have to apply in order to have a computer Hardened Inside our company. So we needed to have those Images itc one of four itc as 104 compliant We applied password policies We preferred the ssh keys method in order to log in Only ssh was open on those images Right, we built or had previously as semi automatic patch management system and we integrate this infrastructure with that We scanned periodically all the ip addresses and Of course, we disabled the image upload functionality So the users will not be able to You know To put there any image that they want The other thing that we needed to do in order to enable this environment to the internet We needed to know who were using it, right? So we enabled a boarding process okay We enable a boarding process where Well, we use a IBM formed experience builder to grade this custom workload We enable single sign-on With some of the of our with the two methods of out there authorization that we have and I'm going to talk a bit Later about it So pretty much what we did on this onboarding process You come in You register your project and we create one special user With an special role That will administer that project so that's A modification that we did on keystone as well as in in horizon in order to enable this functionality And that's what we did first of all horizon Has the supposition that the only One that will create or at least at this point of time Again, I know that it has been evolved pretty much, but we started in falcon and it to really recap that horizon things that only the admin can create users At that point of time and that's not the way that we can Enable this in a public cloud, right? So as well at IBM we have two authorization app apis the abm.com and the which is external and the w3 Which is pretty much our internet our internal directory. So we needed to integrate keystone to those two Authorization apis So we took keystone we extended it in order to be able to delegate the authentication On that those two apis. So we did a piece of code And extended it added the extension to our installation and we were able to properly authenticate with these two Our authentication apis so What we did then in order to change this workflow or this assumption from the horizon perspective In order to be able to create more users, but not giving the admin Privileges to everybody On the onboarding process we created one user Which is the project admin then What we did is we included in keystone This project admin role And we added In horizon a new tab that you can see on the screen Which is called project admin. So this new role was able to create users But only in his own tenant So with the onboarding process we created a new project for this Person that is being taken to the through the onboarding process Then this person will have will be allowed to have The project admin role and then could add new users to its project So we hardened it that way The user access we did not allow everybody to have the admin rights and we We're sure that the users that we they are they were creating Was only for the proper tenant that they were working on not having The entire access to all the tenants So that was pretty much Now we go with the lesson there. Stevan will be presenting that. Thank you. Hello everyone So I will go over a forensic List of things that we learned and an operation that we had to complete in order for the the platform to run successfully at the very beginning Well angel and paulo discussed that a little bit. We were using nova network We didn't have the They went well after we did the upgrade to grizzly. We didn't have the The needs to run new turn. So we went straight with no network One of the first challenges we faced was the the need We were running on a Floating a p-segment and we needed to extend that because we we were run out of ip's And we kind of found out that We needed an adjacent segment. So that Needed some tweaking in the routing Also getting to the VMs by using the floating ip required some The the VMs were not basically routing Inside the guest network So we had to get into the code and add the ip tables rules for the outbound traffic to add Also one of the Decisions that we made at the beginning was that we were using red hat all the way So in all the computers and the controller So we thought okay, let's go with it on the on the cupid side as well So we were running Cupid on red hat And that at that moment in time it was not that well tested So we started to see some issues and some performance issues and and some memory leak issues that I will discuss later and that impacted the operation As paulo were saying At ibm we have a set of rules for security and compliance. So we Decided to go into a process of creating this q-cout to images Based on specific needs so disabling Certain ports and because we will the the images Will be deployed by customers and they needed to to Make them as much secure as possible also another issue that we had was the using the you wound to 10 images we Most some of you might might have faces as well The resizing of the root partition When when occurring during the instance provisioning it was not working properly. It doesn't resize so it is Due to an issue with the cloud in it package. So we needed to Get rid of the wound to 10s and and we went to the wound to 12 images for modularity tweaks We went into the discussion of Using vm's for services and splitting Horizontally the the controller node advantages of that approach for those of you that use it The vm's are easier to use and cleaner to upgrade so you don't have to be Bearing yourself with with bare metal The moment that the vm doesn't stop stop workings In in stateless services you just go ahead and start a new one They are easier to move between hosts. So today you can have one host Sorry one vm running on a host and the moment that you need to do a maintenance window on that host Just move it and the vm will follow or will continue working That works for resilience as well If you like that approach you have to take into consideration that It will be an impact in in terms of performance. For example, things that are intensive in input output Don't good don't do well in in virtual machines. That is that is our experience For example, uh, to be more more specific my sequel on a vm runs perfectly the The broker the either cupid or rabid it works well on a on a vm You just have to use the distribution that everybody uses It works better the horizon. We ended up putting on a vm behind an engine x Package so it performed very well. We had to in fact so that we we didn't we didn't Have the problem of losing the the the portal maybe Tending to know and definitely know swift and sender and glance it depends on the workload that that you run If it is intensive, I wouldn't recommend it because you have a bottleneck there. Okay resiliency considerations also that that that we applied For the operating system. We just had a simple raid one I have heard some conference here that Do not go into expensive hardware. Let's say that these these racks already had all the controller for rates. So, uh, It it continued to be commodity hardware, right About the operating system, is it that important you can reload it fairly simple really easy, but The things that the operation will be impacted right and and it doesn't take much For the compute notes in terms of the rest of the when you don't have centralized storage We do recommend having a ray 10 so the you have At least the opportunity to recover from from certain losses and the operation won't won't be impacted so because at the end we will see a monitoring tool that Is it start to telling you in advance? So you don't have to get your boss on the on the call or or or something telling you that You just lost a compute note for glance and sender. We also follow that approach. So At the moment that we were doing the upgrade we rearchitect to do this ray 10 configuration So it it was kind of a just in case architecture, right? Now we'll see upgrading we we went At that point in time we went from polson to grizzly The very first thing that we found was that the All the documentation said that upgrading an in place Open stack installation was a A huge amount of work and it was very minimal, right? So we decided to go on board with that um It it's important to take into consideration that you have to be clear about what you are expecting to get out of the Out of the upgrade not just because it's the greatest and latest Uh, we had some some features that we wanted for example the conductor service that uh eliminated the The access from the compute notes to the database that that uh was impacting our performance And also the opportunity to rearchitect was a a very good reason behind upgrading We split the control note into a number of bm's like i already explained We scale out the api services and we scale out a horizon This is a list of uh Important facts that you might face doing this upgrade or or doing from grizzly to habana So the the first is the approach We strongly believe that in order to do an upgrade an in place upgrade you have to rehearsal You have to do it over and over with we create a test environment not exactly the same of course And not with the same data as well But the the main goal of this test environment was to go up Be operational be functional and go down because Of course you want that in production right you you want to get there at the moment that You face start facing strange issues and if everything just fails you just go back, right? Uh another challenge that we faced was the database schema. So it changed Quite a bit. There were some New tables new just new columns in tables. So this this was kind of crazy because We ended up for example having a particular issue where They introduced a new member role in in the in the code and need The moment that you started to make it work We faced an issue where When an admin added a new member role It started to delete the rest of them just because the the the names were different. So that that was very interesting We also faced an issue where in within the the hypervisors There were some orphaned VMs that When the package the new packages got in They tried to start those VMs and those VMs Present in in in the hypervisor were not in the mysql database. So the the moment that Nova sees that the two lists are different. It starts going crazy. It was A manual process to clean up. You had to actually go into the tables See what VMs were in in the in the open stack database with valid status and compare that with the with the actual inventory in the in the hypervisors also The db sync on certain databases failed Uh, you had to update the the layout manually And we had some issues with the bnc services. We Uh enabled those for certain environments And there was an incompatibility with between the open stack console oth service and the the rest of the services Okay, on this slide important Uh We tried at the beginning to do too much. We Try to do the upgrade. We try to move glance and we try to move cinder We found that it was too much to choose So we we ended up doing the upgrade to the platform That is not that is just not a jume upgrade You actually have to do a lot of steps and then we did the glance and the cinder stuff Um We important aspect of this we Integrate horizon and and swift from the community version into the into the e version that we were running So that was uh a good step Other than that Like it says the upgrade process was straightforward you just have to rehearse and prepare yourself for for For the issues that will come they will show up no matter how much you you try it This is how the the architecture looks today. I wanted to highlight that we use for next for for All the operations we split the management the images the communications between bm's and The outbound routing so It's it's our reference architecture that uh for for those of you who saw the The software and ibm deployment. It's fairly easy to to get there Lastly, I wanted to mention the monitoring For those that attended the previous conference You can have or you will know in advance what is wrong with your platform just using regular checks We use several ssh load my sequels And there is a market for this so you get Plugins to check almost everything. Uh, this is the way that that it looks so the moment that you see a red Sign there you will get an email all your peers and and that's it. You will you will be informed Okay, so I leave it to Andrew. Thank you everyone So in summary, right? Um, again open stack is still remains an emerging technology One of the things that we found out throughout the whole experience is error handles alert is not robust enough Sometimes we find yourself Either receiving too much information and too little information Also, uh, there, you know, as you as you know, the valium rpc calls have actually increased from you know from release to release You know, we also kind of lost count already of how many messages we get now But uh, also logging is not it's not really optimal. One of the issues around Scaling out This is at each control node has its own Logs, right? So trying to figure out what happened from point a to point b becomes kind of difficult when you have a lot of Compute nodes or kind of a lot of services running on their open stack You need to be look at you need to be willing to be looking at code That's one of the things that we you know that we learn I guess the hardware, right because Many times we just have to just step through the code and try to figure out why things break or what, you know Why things, you know behave the way they behave and that becomes a little challenging um Let me see one of the things also to that You know that You know that we really realize is that you know upgrade as a step and say right? It's not really It's a straightforward process if if everything works well, right? But it's not it's not a single it's not in a single integrated Uh step, you know, you can you can go and do a yum upgrade on some components But you still have to go and do something different for the database for example um So We can probably use some improvement in that area and I think it would be great if uh, we can begin to start Contribute into that into that part Let's see. So we talked about ip tables, uh, you know ip table configuration in flora ip So, um, I think we're pretty much out of time, but uh, I would like to open up for any kind of questions and From anybody Yep, do you guys experience some attack? Excuse me like the ddos attack Actually We experienced it wasn't a ddos the ds attack per se But I I mentioned that we were working with bluemix, right and bluemix uses a piece of technology called bosh on the on the background and um Bosh is pretty chatty And it asks a lot So we got an instance where we were getting I think it was like about You're laughing back there because you know it is, right? I think it was like about three or six six nodes that we're requesting um Every minute every second they were making about 10 requests per second To our system, right? So it really bugged us down and that sort of thing But you know, but out you know outside of ibm Nothing like that at all. So It was it was our own boo boo in other words Yes So it's traced with each rate. Um, you had mentioned making some or extending Yes, uh keystone and adding some adapters for on the authentication What are the challenges you faced with trying to Integrate that code is going through upgrades uh Well, it was it wasn't bad actually, uh, I think uh, one of the issues that we ran into was because There was some some code was probably, you know, some ip addresses were hard coded But for the most part it actually it worked fine It you know, we didn't really have to make some major changes to the code On our end Well, you're able to make it modular abstracted enough so that uh, you know the upgrades when those come along You're able to upgrade the bulk of the code But then also kind of bring in your adapter exactly, right? So it was the code was not the changes were not embedded into the into the actual open stack code Instead there were right, so we just brought in brought in those those adapters in right so you modify your I guess your configuration right of keystone We we actually developed our code in a modular way So what we did we was pretty much telling keystone in the config file Use this adapter to do the authentication So we ended up using the same interface in order to communicate properly with keystone and doing a module Aside from the main code of open stack, so that's why the upgrade wouldn't hit us at that point. I got you. Thanks appreciate it. Welcome Yes, sir. Um, how much for your teams and how long did it take you to get from? Um, fully adopting open stack and you know say that again. How long how big were your teams? How big is the team? Um, well you're looking at it So yeah, so it was it was three people at most I think at one point we transitioned somebody else But but at most with four people but for the most part it's about like a three and a half Anyone else? Well, thank you everyone and I hope you enjoy the rest of the conference. Have a good day