 To self-service virtual environments based on open stack at Deutsche telecom For those of you you attended just the keynote from Mark shuttleworth to what any confusion We are not the part of Deutsche telecom that he was mentioning. We're using Ubuntu, but We are only working in a small and internal environment. So what I'm going to talk about is a short history of our project we're calling it a Volk Zeben, which is a German the German term Volk is is the Translation of the English cloud and when we say Volk is even it's basically what what is cloud nine in English So a few words of my to me. I'm Alex Schaverk 34 43 years old and I started to work for Deutsche telecom back in 2001 and for three and a half years now I'm working as a cloud architect and I'm working with open stack since the Diablo release so in the end of 2011 there was a Organizational change in Deutsche telecom and we've been formed as a team called infrastructure design and we started to look at automation technologies puppet chef and also infrastructure tools like like open stack and And about a year later we had initially Found two basic use cases So the first one is plain infrastructure as a service. We wanted to provide virtual machines to internal customers for their testing purposes To just to be able to have machines quickly get up get up things running in a very high and fast pace And the other use case we had were managed environments Kind of platform as a service so We've set up a complete CI CD environment. I'm coming coming to that later Both use cases were served from one cloud environment So we have one open stack installation and part of it is provided as plain infrastructure as a service and Everything that is running in the in the past environment is is coming from one specific tenant So as I said, we are serving only internal customers. So we have no public offering that is available to to and customers and In the initial design phase we had two basic Principles that we that we committed the first one was we wanted to stay open source or no proprietary software and Open standards open source and the other one was we wanted to build our own thing and not get in any any partners to help us deploy and set up So a few words on the infrastructure we're running we're running on super microservice Basically two two models the first is the 2u twin which is a two unit machine Compromised of four single servers That's what we're using for compute nodes and for basic infrastructure and for storage purposes We use plain to you 12 disk machines Everything is running on you going to precise because we needed the longtime support version and We're currently looking into migrating to tar But we're not we're not sure when when this will happen We're using upstream you've been to open stack packages from the Ubuntu cloud archive So we have no internal development for for our own packages. We're just using the upstream The same goes for our storage. We're using safe packages from from the ink tank repository And for the bare metal deployment. We're using FAI. I don't know if anyone of you knows this It has been initiated at the University of Cologne And it basically is a pixie boot server that that sets up your your your hardware and It goes up to the point where the machine is able to talk to our puppet master and the rest of the deployment is done with puppet and We're monitoring all this with the majors and the check mk plug-in So just to give you an idea of the size of our environment. It's really cute. We're running 12 compute nodes with Two six core inter processors each and we are running three storage nodes that make up about 60 terabyte of net capacity Cisco switches for for the 10 GE network and also for one one gig for for administration So When we came into planning and then set up of the environment First thing we realized was that setting up open stack is quite painless. They are very good official docs available and It's very easy to to get open stack up and running if you should follow these docs But we needed a bit more. We needed to set up a complete data center So we had to find a location for it We needed connectivity some basic infrastructure services like DNS mail proxy and of course all the automation stuff around it Which is our install server and puppet and monitoring and so on so this was a little bit complex setup another thing is we have some interesting security requirements that force us to deviate from from the official docs that are very basic in in terms of networking segregation and There we had we had to to implement many more networks many more VLANs and separate the administration from the storage from monitoring and so on and What took us the most time was the adoption of the puppet modules that we took from puppet labs To to make them work and install open stack the way we needed it so That gave us the the ability to provide infrastructure as a service via simple horizon dashboard which was a little bit Adopted to our needs the the most notable extension was integration of open ID for authentication This is also a security requirement at Roger telecom that systems must use a two-factor authentication so We just took an an internal platform that was able to provide two-factor out and Attached it to two keystone which was pretty easy so We had the ability to give customers self service virtual machines the way they needed it and The next step was set up virtual private environments. That was our idea of giving Customers the ability to have not only a virtual machine deployed with a with a simple mouse click But a complete lamp stack or a web server a top cat installation, whatever The first thing we did was we took a few simple cloud init scripts published them on our internal website for To see a patchy my a squirrel or tomcat So if a user starts a machine he can simply copy that script paste it over into the dashboard into the user data field and A few minutes later. He's up with a lamp stack or with a my squirrel install or whatever So this is still full self service all the control of the machine and the responsibility for complying to security restrictions is in the hand of our user our customer But by providing him with these scripts We can assure that users if they follow our scripts. They are compliant with our internal security So this is an example of the script that simply sets up As it said here. Oh, sorry. It's a it's an Apache worker So basically there are two packages installed and then we we disabled some modules and make some Adjustions to patchy configuration to meet our security compliance Then the next step was that we decided we want to provide the complete CICD environment for Java development Because that's the main language used in our in our part of Deutsche telecom so we've set up a mixture of a git repository Jenkins server and Tomcat application server with an Apache so that The user simply checks in his code in into a git repository and then goes we are post hoc to to the Jenkins And is checked for errors and if everything went well, it is simply deployed to the Tomcat as a jar file So the users don't have to carry about machines or scaling. This is fully managed by our test and development team The environment is set up with a combination of bash and Python Botoscripts So this is not yet fully integrated in into the dashboard But it's it's a process where you order it via email and Maybe there has to be a confirmation by the manager and but we are still very fast because The pure deployment only takes only takes half an hour So maybe it takes a day if you don't get the approval within an hour or two But we're still way faster than before where we had to wait Three weeks or four weeks simply for the machines to be deployed Not mentioning the software running on it So that's let me just get back. That's what we've been running since March of 2013 so This is still using Nova Network So the customers had no way to say we need a segregation between front-end machines and let's say back-end our application service databases and But there was many many much much demand For neutron on one side so very fine networking on the other hand This environment is only reachable from our corporate internet internet, so It's very well suited for Application development testing and so on but there was no way to make those machines publicly reachable from the internet because The the networks are completely separated. So we decided we need another environment with a internet connectivity That is then not reachable from corporate internet only only via proxy and from the outside and As we had some Different requirements. So for example, we need we need did not need These these development environments we started again with a plain infrastructure as a service setup and we started this in the end of mid of mid 2013 so We have moved to a new location because where we had our first environment running there was no not enough internet bandwidth available, so we needed to move to a new data center provider and Again, that's mean We needed to set up a complete data center with all the services around it because we the only thing we got from our provider was to wrecks power cooling and To network cables for for the internet upstream so this took us Around five six months until we were ready. So ready for production. We had end of April start of May just a few weeks ago And currently we're looking for a dedicated operations team because what we've seen is that we are pretty well able to to develop The environments to to set them up to get them running, but we are not the guys to to make a 24-7 operations So there have been some negotiations last week with Colleagues inside dutch telecom, but I'm not yet informed how how was the outcome of that? As we switched to the newer version of open stack we're running Havana there we have heat installed and It's usable by customers so they can use their own heat templates to install Complex application stacks, but we were not yet able to port over the old bash and python scripts that we had to deploy our CI environment to to the new to the new platform so Going from Folsom to to Havana and from Nova network to Neutron brought a few infrastructure changes that were Sometimes difficult to to master the first one is There's no more dedicated network nodes. So we have all the DHCP and L3 agent Distributed amongst our compute nodes So we're trying to balance system utilization by having them Provide compute power on on on one side and the networking Functionality on the other side. There is one central Neutron server that does all the scheduling part and so by the the packet Transportation packet switching is done inside the compute modes. This is not very well supported in Havana yet So we had to write a few python scripts that monitor the availability of our agents that check if they are responding if they're Irratable and if anything goes wrong that they just take the agent and reschedule it to to another machine so that it stays up and This works But we have no long-time experience as I said we just started in in the end of April with deploying this hopefully there will be Better ways to handle this in in Juneau and I'm I'm pretty confident that that it does there are some interesting blueprints. So What we currently do is looking into ice house, but We had the idea to to just take the Havana environment and migrate it to ice house because there are no structural changes between between the two releases that would have Make trouble for us, but major issue for us actually is the operating system so as I mentioned we are running Ubuntu 12.04 and There are no packages available from from the ice house release for for this for this version So migrating from Havana to ice house, which should be pretty easy In the same time means for us that we have to migrate from Ubuntu 12.04 to 14.04 Which is something we we have not yet been able to because the sets has to been tested and Maybe we're going to do this in the next few weeks So new environment brings brings some new hardware, but there are no no real changes we've changed the switch render from from Cisco to Alcatel Lucent and The only thing is that we've blown up our storage capacity to more than 300 terabyte net, which still is quite cute. I I know so What we have learned from from all this in our journey with OpenStack The first slide lists some technical points. We've seen infrastructure deployment is really a tough thing if even if you start on a green field where you will simply have enough power cooling and an internet upstream cable so there are many many things you have to care about and There's no real good documentation available for for for this in its in its whole complexity So you have to take a piece here a piece there a piece there Which was very hard because we decided we're doing all of this for ourselves, but In the end it was a lot of fun and then we all learned quite a lot The next thing is that if you deploy OpenStack and you are not able to fix the code in case there's any There's any bug you're running into that makes This thing very very difficult and sometimes really frustrating because you know there is a bug fix It's available, but it's not available for false. I'm it's available in in grizzly and There is an existing backport, but that had never made it into the official packages. So as we are a team Without real development power. So we're all infrastructure guys. We're system guys and We did not want to to to provide our own packages and simply rely on upstream that was sometimes very hard and in the end with our new environment we have started to patch the packages where we need them which It's nothing where we're doing with passion, but it simply it had to be done and the old environment has some Tiny bugs where we need how to how to get around them, but sometimes they hit us and then we were really into in trouble one example is a Suspended machine that lives on a hypervisor that is rebooted is gone afterwards So you can't you can't delete it. You can't you can't unsuspend it and We have found out that it is possible With resetting the state to active and doing a hard reboot which was mentioned Yesterday in in one of the talks I was in that might fix it, but it didn't work all the time So we have in fact lost some machines due to unexpected hypervisor reboots Just because there was there was no way to to resurrect them It it has something to do with no one network and bridges and VLANs not recreated probably if you if you unsuspend the machine and In new environment we find out that ha for neutron is a very complex beast as I already mentioned that's not really working in Havana there are some steps in in the right direction and as I said, I hope that Juno will bring us relief here and Last thing from from the technical side of you deploying multi-tire applications Without heat is no is no fun. This is this is difficult because you have to care very much for for synchronization. So start your web server and Connect it to the database and the database is not yet deployed. So it all breaks or you you you just just start to Create your database without the engine running. So you have to take all the all the Dependencies in into into considerations and and have to carry yourself for it We've had some tests with heat that look very very promising simply simply in that area The next slide is some organizational lessons we've learned where we Seen kind of problems arise the first thing is we need a really proper internal stakeholder management to to ensure proper funding because We've seen the problem that we're providing a platform that enables other units inside Deutsche Telecom to do less cost intensive production and but the unit that does the investment for for providing the platform is not a unit that Finally gets the cost savings. So we have to to to take all stakeholders into one room and and Get them to to agree on on on the project and on the funding because It sometimes causes interesting effects in in large enterprises Another fact is many customers don't know the difference between virtualization and cloud computing We've seen many many cases where customers came over and said hey, you have a platform It's really cheap and we would like to use it and get our machines on it. How about running a rack cluster on it? So they have their legacy application or a correct cluster or whatever which relies on the availability of one single database instance all over the time or Simply is not able to to switch to another Application server is one is not responding. So it's simply it's it's not stateless. It's not cloud aware and They come to you and say hey, I have this we need to run it on on your platform make it go and Then you're trying to argue that it's not possible that you must Redesign the application at least make some make some assumptions that that may not be present in the legacy code and Then only thing you hear is we have no time We have no money for for a redesign or for adoption of of our application make it run Sometimes this is even enforced by by management. So that really gets you into trouble and in the end you end up with An infrastructure or an operations team that is not happy Because it has to handle lots of lots of pets instead of instead of using Cattle-based approach and on the other hand you have a customer that is not happy because it's not running the way he expected to do That's closely tied to to the next point where you need a kind of change management because The switch to cloud computing touches 95% of your operations roles at least so You simply can't take a few guys that have managed classic unique service Linux service clusters for for Some of our colleagues more than 15 years and Get him into a cloud environment and that hey manage this Because everything is is completely different. You have no time to to make week-long test cases for new releases for for new bug fixes and If you do of course you can do in the cloud environment But if you do then you lose all the agility and the speed advantage that that you get from such a platform and Last not least we found out that for for large and even in homogeneous teams Scrum is not the right project management method. We've started out as a scrum project with I think 14 people in the core team and Imagine you have you have 14 guys in a 15 minutes print and everyone Just wants to to give a short statement what he has done What he will do today and what's move possibly blocking his work So it's it's simply not possible and then the other point was that the spectrum of of Tasks we had to do was too too wide for for scrum to really work because Scrum kind of relies on everybody being able to do everything that that's that's coming up in the team. So it's really A useful method for for software development in teams with I don't know five six eight guys But but really not more and only if Every one of those guys is able to say hey you have problems. You're not coming getting on very well Let me aid you So that was not possible if you have one guy that is responsible for infrastructure services like a mail and a proxy server and on the other hand you have some some puppet ninjas and There's no way they they can help each other so What we ended with was something I say Kanban like where we had our daily meetings, but We we had to to kick scrum because it didn't work very well for us That's it from my side If you have any questions, I'm happy to take them Hello. Hi, my name is Charles King from a restock. So I have two questions one is Is there any plan to use open stack for NFV in doge telecom and second one is that you mentioned about the you Decide to change the switch vendor from Cisco to I got a recent due to the excellent capability So are you actually using the hardware of excellent capability of our content lucent in current almost an environment? Regarding the first questions. Yes, there are plans to to use NFV inside doge telecom, but Not with products and innovation where I am working, but I have colleagues that are that are working on NFV projects related to open stack second question is The the decision to switch from Cisco or even further the decision to use a Cisco for the first environment was simply because there is a there is a large company agreement between Cisco and not telecom and Our networking folks have been working with Cisco switches for the last 10 15 years. So they are kind of Cisco experts and As we didn't have much time for Hardware selection and so we just decided to to go with Cisco because we knew it we knew that it might most probably do what we needed and We decided to to accept the slightly higher price with the second environment we had a bit more time for hardware selection and So we did a bit of comparison Regarding features and price and finally ended up with the architect Lucent because there's a there's a very attractive Company agreement with doge telecom as well So feature wise we are not using many advanced features of the switches We simply Attach every machine with with two links to two one switch each and they are they are virtually stacked and speak LHCP And that's it. So you are not actually using hardware fields and capability in your Opposite environment, right? Yeah, okay Other questions Well, yeah, hello, my name is John Brzezowski from Comcast We've taken note to and have some collaborations with the folks who are doing the terror stream project Yeah, do it still come so one of the questions I had for you was I Noticed that you're using a lot of the stuff that comes from from trunk And I know that the work that we've done and that we'll be talking about later We talked about how we've implemented IV six and to the comment earlier about NFV We're looking to kind of couple those two spaces together I'm curious if you have any any comments or anything you can say about you know If you've gone down the path of v6 You know or or or not, you know, and if you have not yet, what are your what are your thoughts moving forward? Well for our environment IPv6 Would be nice But we can live without it Just for the time being regarding the terror stream project with us again another unit inside Deutsche telecom They have the decision to go IPv6 only so for them there was no option to say, okay We just say with IPv4 because IPv6 in OpenStack will come in one of the next releases. Hopefully For us we're we're pretty fine with with IPv4 right now Are they are they sharing? Yeah, since Deutsche telecom is kind of the overall parent company Yeah, are they doing their own kind of OpenStack cloud infrastructure war? They also piggybacking on your work as well It's very difficult in inside inside Deutsche telecom We're running our own environments each of the units and There is of course a lot of Informal collaboration, but if you go up three or four management Layers and then have to go down to to reach the other guys formally. It's it's pretty difficult Yeah, my company's quite large as well. Yeah Please Yes, I am Jin Chul Kim from SK Telecom Korea. I have two questions on your presentation I'm wondering what the most hard of pain points in the migration from Habara Thai Southeast I think I saw that you're using open operating system Linux system for your OpenStack platform and I think as far as I know open-taste software to be a much a provisioning of knows with Marseille and Jujo Are they enough? Aren't they enough for your migration use case? Or is there any other? Hard problems in your migration Yeah, the problem as I mentioned is We needed to migrate the operating system first before we could migrate OpenStack from Habara to ICS We simply have not tested it But we have had the the experience that with our deployment method Every time we're having a new software component on your hardware components in in our environment. It takes us Maybe a few weeks, but maybe three or four months to to get it all running up again so maybe It might have been easier if we had chosen Mars and Jujo in the earlier stages But at the moment there are no plans to change the deployment method it's itself because we have some experts that have been working with the tools for four or more three years now and They are not very happy to Adapting new tools. Yeah to adopting new tools and say well none not again So actually we're going to stay with fire and I Think we need to test the I don't expect Great showstopper simply from migrating to to have another I thought that should be quite seamless But our problem at the moment is it's the migration of the basic operating system. Okay, I see And my second question is about your project management method How many people are involved in your project team for the virtual environments project You you mentioned the 14 people were in your court team. Is it correct? Yes And other any other people involved in the project? Yeah, there are I don't know six seven eight guys involved but not full-time. So they are they are Partly involved in certain tasks for example the the setup of the scripts for providing those virtual environments Was done by by by some people from our test and development department and Currently they're they're just managing the order so they are not involved in further development of our platform. I Think there might be some some differences over our Between our case in your cases so that can apply in the scrum to your product management I'm reading a team consisting of about 30 people on my own on our Open-stack based cloud management platform project and scrum is working better there in our case Okay, so I think there might be some some other issues in your team using scrum maybe some some some of your Your members may may not may not May not like scrum itself. Maybe there's some some reduced to adopting the method in some sense. However anyway In in some sense, you're right on that those pure scrum method is appropriate for a small project team with with an assumption of some all of the members of the Project knows where about the nature of the project and he said however If we are I think from my experience, I think I believe that scrum might be the Might be working very well For projects such as adopting officers infrastructure softwares like open state because open source open source softwares very It Changed in very free country. So in order to adopt those changes to the to your software development project scrum might be the One of the right solutions for the project management from my experiences I think what we what we ended with for for our project management Also was was kind of an agile method Even if I can't name it right now because it was it had some parts of scrum some parts of canva and some parts of Classical waterfall management But what I think we've still been working in an agile way. So It was the formal scrum processes that didn't fit very well Maybe it was it was issues with up with our teams where we had some specialists that have been working 10 years 20 years in Classical methods and we're not not able to to do that transition very well. Maybe we should have a coffee on this. Yes Yes, thank you. Thank you for your answer. Thank you Please hi, my name is Aaron. Can I start with the National Institutes of Health in Maryland? my question is about Developer or responsibility for the instances you mentioned that developers in your organization are responsible for the The security policy compliance. Yeah, how does that play out because our developers? Some developers could care less about security. That's not the their focus is on making great apps. Yeah, that that is an issue The point is At the moment where you go with your machine to to a greater audience than then simply Forward six people development team, then you have to be fully compliant with the security rules As long as only a few people have access to the machine It's not it's not really important, but if you do not comply In in in the development cycle, and it's very very difficult to get things done right At the moment you you go out, but it is an issue and we we sometimes have let's say your problems with with this but In the end it turns out to to work pretty well So who's worse once an application goes to production and is forward-facing on the internet? Who is responsible for for the? You know making sure those instances are patched and comply. Is it still the developer? You know that's that's an operation that you would and yeah, okay. All right. Thank you. Thank you Any more questions? I think I think we're running out of time So well, that's it. Thank you for coming and enjoy the summit