 Okay So now it's 20 past four. I guess we should start could start now. So hello everybody. I Would like to welcome you to my talk about open-stack at large retail enterprise boom or bane So first of all, I would like to provide you some information about me about the company Who is metronome? So my name is Thomas long fits. I'm the product owner of compute cloud a product within metronome I'm working with metronome and it's precursor now for 16 years. I Started my career in metronome As a network specialist. So I was responsible for more less 12 years for our wide area network local area network Yeah, so I'm one of these guys who is always guilty if the application is not working. The network is guilty I started the opens started with the open-stack project in 2014 So then we started it at a project now it it's Became a product and I'm the product owner for that open-stack product. We call a compute cloud In my free time I'm in volunteer at the German Federal Agency of technical relief and I am a metronomian But what does it mean metronomian? Who is metronome? So for those of you who are from from Germany metronome is not the railway company Metronome is the tech unit of metro a leading international wholesale company and food specialist So we say our slogan is we set the pace in food and technology and we do have Roughly around 2,000 people around the world Mainly working in Berlin, Düsseldorf, Hanover brush off and Bucharest Of course, we do have an ambitious vision. So we want to revolutionize The entire industry with our services and digital solutions. Sorry Yeah, Boone and Bain. Why what do I mean with that? Bain yeah, because Our previous structure of metro IT was totally vendor driven So of course we used open source products open source projects, but more like niche products We were not really contributing to any open source community or stuff like that and We had a lot of very good different departments different teams, but unfortunately they were more acting like Silo's so within these teams if you requested a solution if you requested a product for what reason ever Within that teams everything was working pretty smoothly if it comes to its borders We lost a lot of time. So from our point of view that was quite inefficient What we see is that the world is changing and especially in the IT is changing very very fast so from from our point of view IT becomes Consumable like a trim electricity So what the users Expect from IT is that it is simply working Like electricity when you just plug in your power supply to to the power plug and everything is working You don't care about what is needed behind that. You just expect that it is working That's why we why we are changing our culture and mindset and if you compare that to the history of metronome and its Precurses you see that this is a huge chain and change we are driving Boone Cloud is becoming more and more important from our point of view in the IT market In future we will see much more features and solutions which are only available in the clouds More and more companies are trying to Sell their products within their own cloud not not as an on-premise installation anymore From our point of view The hybrid cloud approach will be the future. So we as a company we are using either the public cloud environment as well as Private internal cloud environment Using a public cloud from our point of view brings the flexibility Because the part of the big public cloud providers they are developing and inventing new features pretty fast That's something we cannot compete with So we cannot say okay. We can do the same like like the big public cloud providers do but on the other hand Not all data Can or should be stored Within the public cloud environment due to legal requirements due to internal policies data privacy data protection whatever so from from our From my point of view and using an internal cloud based infrastructure as a service provides the safety and then a good alternative to public cloud providers Open stack and cloud business forces us to change our culture and mindset and we and especially I do really appreciate that because I do believe that this is the right way We have to go to Open stack is an open Is open and provides us the open infrastructure we need and we are pretty happy with that But what are the advantages of a private cloud? So first of all everything is consumable as a self service. So it's easy to start It's fully integrated into our corporate network. So we do not have any connectivity issues We do take care about the data protection and data privacy and We are cost-effective and we do provide a pay-per-use model Consume infrastructure. What does it mean? so Everything should be consumable as a self service So then the user should not request anything by a form or something like this So it should be available by an API or UI As of today a tenant has to be requested by a web form that that's That's yeah as of today and We will create the users in open stack from that point on Everything is in the hand of our customer. So everything can be controlled by the customer What we are doing is we are currently working on an Internal shop system to provide even projects in an open stack and an automated way We do have some ideas and how to implement it and it will be based fully based on open source software The poc is not started yet, but we will work on that The idea from for for us is that from for a customer for a developer for instance It should be very easy from an idea To start testing or developing something within less than five minutes Fully integrated into our network. So our environments our Internal cloud environments are fully integrated into our internal network. How did we do this? so we When we started with with open stack we said, okay, we want to start like a greenfield approach So we did not try to integrate an open stack environment into our in internal existing system landscape That's why we created and dedicated backbone for these cloud environments. We call it an inter cloud backbone To this inter cloud backbone we have connected our internal environments as well as public cloud environments To be able to connect to existing resources in our classical data center We do have multiple connection points between our inter cloud backbone and the metro backbone Metro does provide does operate several data centers across the world Two of them are here in Germany in Düsseldorf and Frankfurt So I draw it here as an example with Düsseldorf and Frankfurt so we do have two connection points between Düsseldorf and between the inter cloud backbone and The metro backbone so everything is highly redundant data protection and data privacy Our cloud environments our open-stack environments is for metro only so we are talking about Company internal applications and company internal data All hardware is located in metronome data centers or data centers which are operated by metronome managed by metronome So we have the full physical access control to the hardware We do not have any issues in terms of GDPR with external service providers or something like this as everything is operated by us and no external partner has access to our hardware We have strongly recommend full data encryption to all our customers to all our tenants But it is in their responsible Responsibility to decide whether they need data encryption or not we do provide block and object storage and Everything is provided in the context of the project. So we do not provide Any shared storage for them? Local storage volumes will be overwritten when the entire instance is deleted We are working on a full data encryption on hardware level Either for the local storage as well as for our block storage cost-effectiveness Metronome is a non-profit Unit of metro. So we are charging only for our costs. We do not make any profit We reduce the hardware costs dramatically by standardizing our hardware flavors So in fact, we as a compute cloud team we do Have five different hardware flavors We charge our tenants by a pay-per-use model. That's different from what we did in the past where Customer was requesting resources and he had to pay for the resources Even if he did not need the full amount of resources So our smallest unit for our pay-per-use model is one hour and we charge for instances storage And public floating IPs because we do provide to our tenants two different kind of floating a piece One is an internal and ten dot X network and one is a public certified IP address network So our open stack environments are based in three data centers across the world Dusseldorf Frankfurt and Shanghai in China We do have currently six deployments And in production So two of them are based on the zoosa open stack clouds five. So it's a Juno based That's where these were our first productive environments. We implemented in 2015 at that point we decided to go with a commercial distribution To have the opportunity To call a service guy in case something fails After a while we figured out and that this Distribution from zoosa is pretty fine but unfortunately at least at that point it did not provide us the flexibility we needed and So we decided that our next release will be based on a full open source version Unfortunately, we cannot upgrade from the zoosa Distribution to our internal open stack release. So we decided to Decommission these two zoosa environments by the end of this year We do have four productive environments based on open stack Ansible Newton two of them in Dusseldorf one in Frankfurt one in Shanghai and we do have the options for two more in Moscow, Russia and Frankfurt. So a second one in Frankfurt all of these Environments are fully independent from each other. So even in terms of switching hardware Everything is fully independent. The control plane is fully independent. Why? because from our perspective An application landscape if you want to achieve a real high availability of the application landscape it makes sense to Spread this application across multiple independent environments and to use either DNS load balancing or classical load balancers to Use that application to make it available for for the customer So in case we do have an issue in one of our open stack environments Either it is a software issue hardware issue or even in a human driven era We can be sure that this Era that this outage will not be will not influence our other environments. That's why we Spread it and what that's why we made them fully independent from each other We do have some Some prerequisites to our customers So we always say to them that they should be prepared for any kind of failure a VM could fail at any time in our environments. We do not provide a high availability VM Even a computer's could fail at any time for those applications who cannot deal with that We are the company still have our classical virtualization platforms based on closed source products again, as I already said we strongly recommend to our Customers to spread the applications across the environments if they need them in high availability mode But we do not leave them alone. We Try to we work with them and we educate them How to deal with infrastructure failures? How to design their application landscape? We do offer support and consultancy But we do not implement anything for them. So they will not get in a managed solution from us They will just get an unmanaged VM The software we are using to provide our Services is based on Ubuntu 16.04 LTS For OpenStack, we decided to go with OpenStack Ansible because at the very early time we decided that Ansible we will will be our leading automation system and OpenStack Ansible provides us at least a large amount of playbooks. We need to deploy an environment furthermore, it's Very easy to integrate this with Ceph Ansible to provide needed storage For the OpenStack modules, we are using from our point of view only the base modules So Keystone, Nova, Neutron, Horizon, Cinder, Swift Glance and heat, Cylometer For and for the accounting we did use until a couple of weeks Unfortunately in the past we figured out that we had some Issues with Cylometer. So Cylometer was consuming a lot of CPU and a lot of RAM and it's caused serious issues in one of our environments because it was simply struggling down the control nodes and We've implemented and we developed a way how to gather these information We need for accounting purposes in a different way, which is much less Performance consuming. In more detail how our architecture is looking like In our from our point of view, we always say for a single environment The minimum amount is three racks. We spread everything across three racks. Why? Because we say okay a single rack is a maximum unit for a single environment, which Could fail which is allowed to fail and everything should still be up and running So we do have a per-rack minimum one controller One to end compute host a monitor node for with with Radar's gateway for the storage One to end come a storage host and some free spare for top of the rack switches Currently we are running with an end of the row switch model switch concept, but I think that We will change that in the future to to a top of the rack Switch architecture with a spine leaf architecture This should provide us and will provide us a more redundancy in terms of any outages of the hardware So our network design for for the computers and for let's say for every host so We do have I said already five different hardware flavors. So we do have one flavor for the Control and computer Control and monitor nodes. We do have two different kind of computers and Two different kind of storage host for the network from the network perspective They decided they They differ on only in terms of 10g Network interfaces, so the compute monitor and and Computers do only have two times 10g Network interfaces while the storage hosts do have four times 10g Network interfaces We separated the storage traffic which is coming from from the VM from the computer hosts and the replication traffic within our safe cluster Within the Open-stack network. We are using a Vx LAN and DVR based infrastructure Unfortunately, we figured out One two weeks ago a network performance issue caused by this architecture because we are currently only able to reach 2.7 gigabit When we are talking about a VM to VM communication if the VMs are spread across multiple hypervisors And it will even get worse if it's if the communication is using floating IPs So we are currently working on that and to figure out a solution to give you some more Numbers what we are talking about how big our Environments are and about how how many compute nodes we are talking I provide you here some some more details So actually we we are hosting two and a half thousand VMs on our environments So that's roughly Two-third of what we are Hosting on our classical virtualization platform for let's say enterprise applications What are our challenges? So what? will come next Something very important for us is that we must be able to install a full environment at any time even without internet That's very important. Why when we build up the data center in Shanghai in China for everybody who was in China someday Most of the people know that internet and China is totally different from what we know as internet here in Europe or in Northern America or Even in the rest of the world So from I can always say the Chinese internet is more like a big Intranet, but it's not real internet You cannot be sure that you can reach the resources you need at any time and the performance you may need furthermore We do have a deaf environment so a hardware-based development environment for our team and This deaf environment we want to install it automatically Every night because every change we develop every bug fix we are doing Will be tested with a new with a fresh installation every night So and latest at the next morning we can see if this is working So that's why we must be sure that we can install a full environment at any time And in the past we always we often Hit some failures that we could not download the resources of the module versions have changed the packages Will not be available anymore or something like this. That's why we will mirror all the resources Locally so that we can install everything even without internet The next big topic for us is building a CI CD plan a pipeline and Here I'm talking about For opens deck itself not for the workload. So we are not talking about building a CI CD pipeline for our customers We are talking about in the CI CD pipeline for ourselves We want to assure that every of our environment Has the same settings same configurations and of course same issues Now that makes troubleshooting from our point of view much easier and bug fixing of course as well because when you fix a bug And we will road out automatically But currently we still have to work on that so we are not finished with that and of course we have to Optimize our monitoring and support processes and testing per processes for that Yeah, and as we are running on Newton We have to deal with this major upgrade so we want to Upgrade our environments at least to rocky or newer the same as valid for for Seth there we are currently running on jewel And we want to upgrade stuff as well to the latest version But that's still in process in progress I mentioned it already We are changing our culture. We are changing our mindset So from our point of view, it's it's all about people Simon cynics is said someday people don't buy what you do They buy why you do it and I really love that sentence because it definitely says what I think Success is from from my point of view not only a question of technology It's rather a question of passion and people and that's what we want to achieve and what we will achieve within our company So we are hiring People now for their potential and their personality and not for their skills If the people do have the right mindset And if they people are open-minded and willing to learn Skills like technology that's something you could learn, but if they are not willing they will not do anything So we work also in self-organized teams and that's also pretty new for us and for the people Because in former days it was more like the Department manager was entering the office in the morning and telling the people what they have to do at that day And now they have to decide and they can decide by themselves What they should do and how they should do this So we drive innovation. We will not just follow it Especially in my team, I'm very proud of that and I'm very lucky that all my people I do have in my team are very open And We are pretty far ahead Compared to other teams So we definitely work in as a self-organized team me as a product owner I'm only telling the guys what they should do and they decide as a team all together How they do this and we are pretty successful with that Yeah, now I come to the end of my talk So here are my contact details if you need if you want to have any further information from my side drop me a line I will respond to you within 48 hours and Yeah, I'm open and I want to be I want to Be connected to other people and I want to work with the community So thanks a lot for your attention and if you have any questions for you free to ask now Please use the mics so that the questions will be recorded Just one question something we are also struggling with you said that you encourage your application teams to use multiple Availability zones in your regions. Yeah, how well does this work? So how many applications really do this? We do have currently a couple of applications. I Don't know exactly how many of these are But yes, it's it's hard to to educate them and to train them But as soon as I Make it visible to them that it is in their interest because if if they fulfill that Requirement they are able to deal even with multiple public cloud providers And that's what I want to achieve at the end. We are a tech company a tech of Metro and I would like to have that the the applications within Metro are highly availability highly available, so They that they don't have to rely on any solution we provide They should be independent from us Thomas you mentioned that you guys are doing some bug fixes and stuff and you're on new ton Which is obviously, you know two years old release. How are you handling? the Backporting stuff. Are you guys just running a fork branch? Or do you how do you how do you actually take things that are fixed later and then apply it to to what you have? Downstream, do you do you have any details maybe in that process? It depends on what kind of bug we hit so the bugs we hit up to now Luckily we found a solution by ourselves or there was a backport available But yes, we come now to a point where the where we do not get any support from the community anymore because Newton is simply too old for that And that's why we want to upgrade our environments to a later release Because we also want to be part of the community So in future we want to actively contribute to the community and that's something we can only do if we are Let's say at least More ahead in terms of the open-stack versions we are running I'm having one question to the regarding your CI CD for your your platform Let this stop when it when the platform is installed or do you have final steps like? running rally tests to Do a load check on the whole system. How much can it be here or something like that? Currently we we did not have and implemented it, but the ideas and the aim is that We will run tests While deploying and we will run tests continuously So we want to have tests running in our environments. Let's say every couple of minutes. We want to And We want to get baseline data. We want to get performance together performance data off on out of our environments so that we Can see when there's an issue before it's really visible into our tenants I have a question regarding a keystone setup. Do you have a separate keystone for each of your regions? Yes, yes currently we do have a separate keystone per region per environment and Currently this is This works with a local database. So we do not have any integration into our centralized Identity management system. That's also something we are working on You told that You had problems with CELO meta and gathering usage data from your instances from your environments Can you tell a bit more? What is your approach now to get these accounting data? If you can share sure sure no problem So with CELO meter we had we hit a couple of times the issue that CELO meter was running on 100% CPU load Or it was even consuming so much memory that the controllers were running out of memory we fixed that actually by by developing a script a Python script which Gathers the the needed information from the API So we're just connecting to the API and requesting the data from the database Question on that one which API you mean because we also hit the same problem and essentially I selected completely different tool because the rabid mq load and CELO meter was going here you wire So we used a new key and it works for a while, but would be interesting. So yeah More detail on that one. Yeah, sure. It's the open second API and if you want to have more details The people from my team are sitting here. So they can definitely in detail tell you how they implemented it So come to us afterwards and we will tell you sure you said about quite fixes and What's your procedure to deploy to production changes which you Move to development Environment and farther to production. How you upgrade to your cloud? Yeah, actually we do have Development environment based on hardware. We do also have we call it a pre-production environment That's also based on hardware. So that's not not a virtual open-stack environment and We will test Changes and new features in the deaf environments while reinstalling the deaf automatically every night and In the pre environment, we will roll it out as an upgrade So that we can see both behaviors does our change work either when we reinstall in an environment and Does it work as an upgrade and As far as these tests afterwards are successful Then we will roll it out to the production environment So at least that this is the idea How homogeneous it's your infrastructure so As to say for example regarding the network infrastructure. No, I've mainly heard about open-stack and server installation Does this CICD job also include like reconfiguring of the whole network like You made up for example Switching to an IP fabric something like that and your switches need to be reconfigured You test things like that to or if the network a static thing that is there and Underlay network the physical or is it we configured to Currently we do not know this yet as we do not have this I've been running but at the end what we want to do is we want to test more or less everything Over the time so that this will also be a developing process We have to do we will start of course with some basic tests and Then we would develop new tests and stuff like that. Yes, we will also test on the network Our CICD will not have any influence on on our switch infrastructure so As I said for us even for us it is consumable like electricity and I expect a Network port to be consumed like like electricity So at the end what what we are doing in the network from from the switch perspective our data center colleagues They provide us a simple Network port with the with a base set of VLANs and Everything else is handled by us and in overlay network based on VXLan and DVR One question regarding the networking part you I saw that you are using VXLan with the DVR Yes, that means you have a source routing for outgoing, but incoming is still coming through the L3 virtual router is that right? Your ingress traffic is still coming through your L3 routers No, the ingress traffic is also coming to to the distributed router As far as I know Okay, I'm maybe on the older state And another question was regarding you mentioned about the cloud But I saw a slide where you had a public cloud and also Integration on the private cloud can use some shed some more detail how the integration is Yeah, with the public cloud provider. We are using we do have a lease line. So a Dedicated connection to that cloud provider and that's how we connected it to Our inter cloud backbone The inter cloud backbone itself has a redundant connection to our metro backbone. This provides us the the flexibility to separate the traffic if needed and to have a Dedicated point where we can think about even if we are not using it as of today to think about IDS IPS solutions and stuff like that Question was more regarding like what services are you using from the public cloud? So are you just using as a? Infrastructure, so you are spawning some forward VMs directly using the open on the public No, no, we are not using we are using mainly VMs, but not long only VMs so so we are using also the the additional features which are available in the public cloud provide at a public cloud provider and That's from our point the big advantage that we can use both What do you think about the different deployment methods on OpenStack? Would you choose? another method out there like a triple O OpenStack on Helm on Kubernetes or would you do it the same way via OpenStack Ansible if you any thoughts on this currently we Are going for for the same way of based on OpenStack Ansible, but yes We do have always a look to these other methods like triple O like OpenStack on Kubernetes and see what will be the advantages? No, not now not now So unfortunately The team was not at the size That we had enough free time to evaluate it really But yeah, luckily we solved at least that issue and now we have new people Join the team and they have to get educated and then we will of course evaluate even such solutions How did you convince upper management to move from a vendor solution? Fully entire open source solution even changing like the distribution because they're using Ubuntu now To be honest that was easy because in upper management were Requesting us to have a look at OpenStack as an open source solution to build to be able to implement an internal private infrastructure as a service So we did not really have to convince them what we must do and that Will be a never-ending process. We have to approve that our solution is working that is that it is stable and Reliable Share us on one of your biggest challenges that you faced during the entire This adventure Sorry, I didn't get it the biggest challenge is that you thought in the network storage computing orchestration In the past I Get get get some hands on OpenStack. That was the biggest challenge. It was yeah Getting the hands on on the complexity of OpenStack getting a feeling of OpenStack the deployment methods debugging, of course That was one of the biggest issues we had You said about problems with Communication bandwidth between virtual machines. Do you use OpenVswitch? Yes. Yes, okay So we have similar problems with similar bandwidth. Yeah, what what we found out there is a Possible solution which at least could reduce the effect So what what we can see is by using OpenVswitch DVR NVX LAN we do see a lot of soft IRQs Which are only handled by a single core on on the host on the machine and that core is running on 100% load and that's what causes the limitation of the bandwidth there is Feature or let's say there's a feature on an on the network card available to have the ability for Transmit UDP tunneling segmentation So it depends on the network card you are using if the network card supports that of loading and What we figured out is that it might also be necessary that the operating system needs to support that So we do not have a solution right now, but that's the way we are currently looking at Any other questions? Okay, so then I say thank you for your attention and Thanks a lot