 OK, so welcome on our session, my name is Jakub Pavlík, I am former CTO of TCP cloud, who actually started to do the implementation in Tieto. And during the time, Mirantis EQUAR TCP cloud, so now we are delivering Mirantis OpenStack there. And my position now is director of product engineering or platform, and I am here today with Lukáš Kubín. Hello, I'm Lukáš Kubín from Tieto. I work as a cloud architect in Tieto for the OpenStack services we are doing. And this is a use case story, so we will talk about the experience we had by deploying what actually the guys from TCP cloud did and our experience so far, and we will also uncover some plans we are having together for the next development. So first, I would like to say something about Tieto, for those who might not be familiar with it. So Tieto is the major IT services provider in North European countries. It is delivering the services to healthcare, government, financial services, industries, and is one of the biggest, perhaps the biggest IT services provider in the market. We aim to become the customers first choice for digitalization and also for all the type of cloud services. The turnover of Tieto corporation is around one and a half billion euro, and the shares are listed on the stock markets in Helsinki and Stockholm. So where we have started, when I get back to last year, during last year, Tieto was offering publicly the services and cloud services based on infrastructure as a service concept, running on VMWare platform, and also typical physical hosting services, and some very unique services, like for example SAP hosting dedicated for running SAP application in a standardized way and delivering them to the cloud. What was missing in this portfolio, however, was a shared hosting service for the horizontally scalable applications, so something for the cloud native applications to be delivered from the Tieto data centers was missing. And I must say that even the time talking about the year 2015 we have been working in that time internally for some two years on evaluation, as most of other companies did the internal evaluation of OpenStack, like to recognize what that new kind on the market might be useful for. So in that time we have been evaluating and running actually in internal development systems Havana and Icehouse releases. And actually couple of those environments are still running used for some internal development. In 2016 we have started building officially, let's say production workload and production service to the customers, so it was no more development, it was the end user service offered to the customers running on top of OpenStack and running on top of TCP cloud OpenStack, and we will tell you something more about the journey which was leading to this. So when doing the evaluation we have set some targets to fulfill and some requirements we had on the OpenStack environment and on the infrastructure to be able to deliver the services. It was of course that we would like to avoid any kind of vendor lock-in. We know that we can't totally avoid any kind of vendor lock-in, but by using open source we can at least keep a lot of control in our hands and keep us in our hands the flexibility of change of the vendors. We needed a seamless on-demand capacity expansion to be able to independently scale on all the infrastructure levels, both the networks and storage systems and computers and all that kind of scaling. Infrastructure as a code was something we realized later, that was actually our requirement. So I will start now with our beginnings. When I told you that we have started some internal activities around OpenStack around the years of 2013 or in the OpenStack release words it was the time of Havana and Icehouse. In that time what we were testing was what we now tend to call installer type distributions. So we were dealing with PecStack RDO, with the fuel and this type of distros which were quite easy to learn, easy to deploy on the infrastructure. It was easy for us to jump in and start doing something with the OpenStack. We get the OpenStack environment running quite quickly. In that time we were the happy guys seeing the first pink packets coming from one virtual machine to the other one via the overlay. We wondered why and how that happened. We were running the TCP dams to see how that's possible. The GREs are really transferring through all these layers from one server to another one. Then things started to complicate slightly. When we started to think how to push this environment into production and how to let it manage by more people. We realized that the installer type distributions are focused mainly to deploy the environment, but don't give us much support in the way how to manage the environments in the other days. They usually use something like the single flat file where we set all the initial parameters, all the IP addresses and configuration options for the initial setup. But didn't allow us to do multiple level of configurations. Something like, we might call it configuration tiering or layering. Let's say that we would like to configure a group of compute nodes for some purpose and another group of compute nodes for another particular reason. That's something which was not feasible using these single configuration files, for example. So that's one experience we had with these draws. And actually any deviations to this prescribed way of configurations were quite hard to manage. And another thing we have realized during working with these kind of distros was that there is no way actually how to compare the status of configuration which is running on the servers with the status which is intended by the administrator from the initial deployment as it is defined in the configuration files actually. We were not able to, let's imagine that someone does a change in configuration files in Nova compute, for example, config file. And we would like to realize that there was some change done which was not documented. There was no way how to do this kind of configuration auditing because there was no version control repository of these configs and the way how to do the comparison of the configuration running and applied on the servers with something which is in the configuration template. Also a complication we were facing was that we actually, this situation resulted in the way that we had to do most of the changes manually on the servers and of course that's easy because we can use some scripts to do some set replacement on the config files and issue some restarts commands. But each of such changes during the time needs to be documented somewhere. So we had to document each of such changes in some wiki page or some shared notebooks. To be able to reproduce these configuration changes later when, let's say, we decide to extend the system by another compute node we would have to reapply all the changes we did on the previously deployed compute nodes and on the other one also and not to forget something. And so that was a tough requirement to really be that precise to document all these kind of changes in this environment. And we realized that this is a problem and that we have no tools how to manage this. But still, I'm noting that I'm talking about development activities which were not yet focused on delivering a production service at that point and that decision came later. But that was part of our evaluation doing in that time. So this situation led to a mixed configuration approach when some of the changes were, some of the configuration was deployed from the initial deployment and some of the changes was done later by scripts and manual adjustments. And I was visiting the OpenStack Summit since the Paris Summit, I believe it was here 2013 or 2014. And I could hear guys talking then about these continuous integration approach and continuous development and how to reproduce the developers processes and tool sets also on the infrastructure. But I couldn't find how I could do that because none of the distributions was doing there. The guys talking about the debt on the summit in that time were mainly from the companies who had tens and hundreds of developers capable to do these and they could employ them to take Vanilla OpenStack and fine tune it to be able to apply this kind of configuration approach. So when later the decision came, the company decided that we will build such a OpenStack service in a production and we will offer it to the customers. We already knew and had some experience what we would like not to follow. It was just what I have described. We were looking on the market then on the OpenStack distribution still of this kind and because we also were dealing a lot of details like the software-defined networking we realized that OpenViz which might be not enough for us to follow in the multitenant environment with a lot of external customer-dedicated connections. That was the time in 2015 when we came across Jakub and the colleagues from TCP cloud to the Emirantis because they were very skilled and they are very skilled in OpenContrail and we thought that they could help us with the OpenContrail integration into whatever OpenStack distribution we choose. Very soon we however realized that what guys are doing is exactly the fit and solution for a lot of the problematic areas I have mentioned that they are doing this configuration integration approach and they store all the configuration in git in structured reclass, config files. And so we decided to go with TCP cloud as a provider of the distribution for the OpenStack environment. So the issues we have resolved by that was that we had an open source distribution because the TCP cloud MK20 distribution is an open source. We fulfilled the automation for all the infrastructure management. What I have not yet mentioned and what was the trouble for us of the installer type distribution is that they usually care only about the OpenStack part and on the supportive services which are very close to OpenStack so they care sometimes about the gallery they care about HA proxy but they don't care about much more but for us as a service provider to operate the service we need a lot more we need to have monitoring for all the infrastructure part we need a log collection we need a metrics collection to support us with troubleshooting and root course analysis we needed that SDN integration we realized that contrail is one of the right solution but it was not hard to find a distribution which works correctly with contrail, for example so that we got from boys from TCP cloud boys and girls, sorry OK, so talking about that just a short roadmap of the state where we have started with this production efforts it was 2015 it's about 12 months now when we did the proof of concept installation with TCP cloud distribution and now we are running that in production in the first availability zone and we are now deploying the other availability zone when talking about the networking I mentioned that it was a challenge for us because we got a lot of customers connected to our data centers usually the customers are using some dedicated lines some VPN connections terminated in theater data centers and we needed to extend all of these customer networks to the open stack environment and to connect them only to that particular customers and that was a task which was practically impossible to implement using the standard open V-switch base distribution and network model we would have to deploy a lot of network nodes dedicated to have all these external network connections in that time of course now something might have moved but I'm not following that carefully because I'm satisfied with what open contrail is doing for us so please don't blame me that open V-switch cannot do something particular but in that time that was the case when we were evaluating so we needed to ensure the high availability this specific of this external connection we needed the solution to be robust and to be able to talk to our HR authors because we deployed the open stack in so called availability zones which contain all the management management and control nodes with all the supportive features and they are in dedicated pods so all these dedicated pods are etched by a pair of physical routers and these routers are connecting then to the external data center data center networks and we resolved a lot of these challenges with open contrail we had the SDN solution talking to our physical routers which is very good for scaling for interoperability with the surrounding networks we achieved a great performance on the networking level and we had an open source based solution which is easily extendable when troubleshooting for any particular bug it's very easy for us now to check for example where this DNS request might have some problem because it is not passing somehow and we can look into the source code and identify the cause and ask someone to fix it or do it by ourselves but we are not yet there we usually ask Jakub and the team to fix these things and we also chose the open contrail based SDN solution because we set it for us as a requirement that we will need some interoperability with the containers we are not offering that kind of service yet we will need to come with that very soon and we wanted to build a solution which will be ready for container support which will be ready for bare metal deployments for example by supporting the OVSDB protocol to be able to talk to physical switches also and these things and we have found all this in open contrail so the facts where we are today now we are running this MK20 distribution we are running open contrail 2.21 we are providing basically all the core open stack services glance, heat, nova, compute we are not running any more particular because we have focused on stabilizing the core services and then we will come with the additional we are using SEV as the software defined storage for the cinder backend for the block storage level today but we are evaluating all flash storage system enterprise all flash storage systems to be offered as an additional storage tier to be as another SEV to run as another cinder backend these are the tools that we are using mainly the product I have been talking about already it's not important now to go into the details it's just the overlay of the management infrastructure so all the small colored boxes are the management machines there are open stack controller nodes there are dedicated database galera nodes serving as a backend for the open stack service databases there are a lot of other databases for metrics, storage and cilometer for example and some others and the reason why we do have this illustration here is to show the complexity of the management environment on one side it's a flexibility for us because as opposed to some distribution which put all the stuff and all the tools on a single operating system on a single kernel running together we requested those to be isolated into the virtual machine running on set of managed nodes so all these supportive features and control features are encapsulated in the virtual machines and running on top of KVM hypervisor on managed nodes but there are quite a lot of them to manage even though that we have now the sold stack driven deployment procedure which is stored in git and we got a nice evidence of all the changes we got a robust infrastructure based on this now and we have realized that there are some new challenges actually I must say that with this distribution we have solved all the troubles we had with the installed type distributions we have a very flexible configuration structure there is no more a single flat configuration file where we need to pre-fill all the parameters and let it deploy and have no chance to modify it in some more detail level now we can split it thanks to the reclass structure we can create the levels of configuration for one type of compute nodes and other type of compute nodes we can have a single configuration pay plate for Linux operating system for example we can have one for Ubuntu which we are using for majority and one for Red Hat which we are using for the save backend and identity management and some other support features so we can control the way how to deploy and manage the configuration of various services running together on these machines that's what we want it compare the configuration and state on the servers and that's actually important to be able to do this comparison because you can imagine it's not only change of the technology to move from the classical manual administrative approach to this continuous integration approach because we are mostly admins I have spent most of my professional career doing some storage management architecture my colleagues are experienced in Linux administrators and no one of us was any time working with developer tools like Git and the address so we had to embrace and accept this developers approach and to be careful to really apply each change in configuration file put it into Git let it be reviewed and then put back into our production repository and then push by salt to the configuration node it's always sorry, to the manage nodes to the manage services it's always tempting for us to avoid this kind of change and to do that something manually just to try how it works but it's not the way and of course we have set ourselves strict rules how to work with this and thanks to this type of approach we have also a way how to check that the systems are really configured as they are intended to be because we can any time run the check which will compare the configuration on the nodes with the configuration which is stored in Git and so I must say that they are some challenges but we are satisfied what we are running and what we see that can be and should be improved yet is related to the structure of the management nodes and the configuration nodes as I mentioned it's over I have counted it this morning it's over 30 management VMs which are running these control features in each of the availability zones and you can imagine those are over 30 operating systems which needs to be patched which needs to be maintained there are a lot of HA proxy to load balance any kind of virtual IP address which is providing the cluster features and availability features for a lot of the open stack services so there are a lot of HA proxies and a lot of additional services just to keep this running and that's perhaps a management effort which can today be resolved by the containerized approach and that's the time when I should pass the talk to Jakub to talk us something about the architecture we are now planning to implement where we would like to move to Yes, so as Lukáš said the I will probably repeat my several of my words from my today's session but as he said like the biggest issue here like that it's working you can reproduce it but it takes some time to build it because if when you have to spawn certain management VMs and run the states it takes let's say it can take two hours but it can also take one day because of some mistake somewhere or mistake in the physical server and when you booting on we have I think around six or seven physical nodes where we have all those certain VMs just for managing the open stack and that's the problem because it's not flexible as it should be and therefore yesterday I introduced the Mirantis Cloud Platform now which should cover three types of workload so it should provide Kubernetes itself for the containers workloads it should provide open stack on top of Kubernetes for virtual machines for standard world and we also want to provide bare metal provisioning through the Ironic so what we are working on right now is that we are building second data center on a new approach on open stack on top of Kubernetes and we got couple of couple of challenges so let me show you how it should look like so we should have dedicated again four or five physical nodes for the controllers but we will not have any more virtual machines or KVM inside and run virtualized Galera free nodes and Cilometer free nodes and open stack API free nodes for the separation but we will install on the host operation system just just free binaries one is ETCD for Kubernetes one is Hypercube binary which manage Kubernetes and then we deliver Calico as a container we are touching host operation system only in free ways and rest is delivered as docker containers and as a Kubernetes ports so we label these five nodes in Kubernetes as opens the controllers and then we will provision in standard way just management of the network interfaces open SSH because we are transferring all servers into free API and Kubernetes and rest will be launched by kubectl start and we will start free Galera ports for the containers free ports for RabbitMQ and then open stack and open control services which is pretty much flexible and deployment you have physical servers with Kubernetes took around five minutes so we can in five minutes provision and delete the stuff because biggest difference here is that you don't have to download the packages even if you have validated configuration management tool let's say uncible, sold, whatever and you have packages validated everything together every time something broken like I've never seen deployment where I press the button and something was changed some access to repository or package or something like that and you have to rerun the whole state the difference here that we are pushing the other prebuilded docker images as a binaries instead of packages and they generating just the config files for the novacomput we are delivering lipvert and novacomput as a containers in single Kubernetes port and vrouter for open control is installed on host operation system so what's the difference here easily to explain is that today we have three VMs with keystone API, glance API in the API inside we have HAProxy keep alive D so keep alive D switching virtual IP address and HAProxy balancing tomorrow or what we are in the procedure right now is that we have Kubernetes service which is simple IP table D NAT which balance on free pods it's much more faster and much more efficient and complexity and mistakes are decreased because you don't have HAProxy and keep alive like even keep alive if stable I saw several times something that system process was frozen or something like that and we have to restart something so this decrease the complexity of the balancing for the mapping to imagine you how it's look just one of the example for the environment that in Kubernetes you have pods and pods share the namespace and it can be one or multiple containers so for glance or for nova right now we have single pod with multiple containers and you scale this pod of course for really large environment we are also able to for example scale just nova scheduler or scale just nova conductor and that's the huge benefit because today you have three VMs three services and you have to scale by adding extra VM here I will just write number from three to four run the Jenkins job and he will auto scale in one minute and add me new nova conductor just nova conductor if I need it or I can scale it as same way in VMs that I have everything tied together depends on the use case and scale of the infrastructure so how looks the workflow I also show it yesterday so you have metadata model and what's the important here I think I have a slight about metadata later so I will explain later so what's you are doing here you are changing the YAML infrastructure so let's say you want to upgrade component so I will replace version from Liberty to Mitaka I will do the git commit git pushing to get it someone from my team or Lukash team check the review approve it and Jenkins will trigger update model on the saltmaster and saltmaster just call Kubernetes commit to deploy which trigger auto scaling and rolling update or rolling update of the container and we have in this case Artifactory do registry is on the this site so it just downloads the new container from the from the Mirantis downstream a validated approve container and then it provides output back to Jenkins job who wants to see it live I show it in session yesterday so you can watch it live how it's really work but what's the most important here is like I discussed it like several times so today theater using like MK20 deployment which is deployment based on re-class model jammel structure as Lukash said that they have to change the jammel and push into the git and what's happened then it called saltstake which usually install package do changing the configuration just like typical scenario in your standard world what you are doing but what we have here we have nova configuration files sender files and all those kinds validated and it's working and it's in production so when we look how to build the mcp and how to generate configuration files for the containers we wanted to avoid way where I will have to maintain mk20 I will have to maintain new mcp so let's imagine that someone wants to add some new parameter into novakon so we have to go into mk20 add the parameter and then go into the container and someone do the mistake so we took this model and we are using exactly same model for the VM deployment as for the containers so customer for whatever reason take 3 bermetal servers and says I want Galera here on this physical servers I don't want to push it into containers and rest I want to launch on Kubernetes and that's possible because we have single model which consists like form 4 parts so you are defining the bermetal servers and then you are saying like roles in the yaml structure what they should have Kubernetes nodes and it's OpenStack Kubernetes controller which will run this services and then you defining parameters for your containers so we have same model for both worlds and we can cover transition from both worlds very easily because I can imagine that not every customer can tomorrow say like I will be the containers because containers are cool and someone in vendor told me that it should be containers so this is not our way because we are trying to push the operation SLA supported model and not just cool stuff where someone says everything must be container and they delivering job which triggers something which change to kernel parameter or whatever so this is like approach here so we can able manage host operation system by old way and launch containers by new way the next advantage of the metadata model that we want to support multiple solution so not one deployment as is usual but let's support 100 clouds deployment for big customers so you can have single model with branches so let's imagine you want to change CPU allocation ratio do the change in the yaml at new line test it into test environment and then apply into production environment just as a pull request or merge request between the branches so you have node level service level and global level so you can apply one change for your 5 clouds and still you managing one stuff so you are not touching the binaries you are not touching docker containers because docker container in our case is just binary it's delivering instead of package but you touching only one single repository where it's everything, it's auditable and you see all your changes and you can generate configuration whatever you need and that's the point so for customer like theater there will no be difference between what they have today so they will approach it and what they will have tomorrow this is one of our last slide this is a nice job of my team for the horizon so you can see how we customize the horizon for the theater that we edit like building feature and monitoring and everything into single portal that's another benefit that we are able to simply, very simply customize horizon as needed to provide for system administrators for example single console for everything instead of multiple dashboards per each purpose so that's all from our side so if you have any question we can answer them yes so question is you are using open country but you are also using Calico I don't have slide here probably I don't have slide so the reason here or how to explain it is that you have Kubernetes as underlay and Kubernetes works best with the Calico which is just Elfleroting so it just provides network for containers which running open stack control services in this case and then you launching on top of so you can containerized opencontrail on top of Kubernetes with Calico so for the controllers the opencontrail is just another app so it needs kassandra databases it needs zookeeper and it needs opencontrail config control and some other parts so it's easy and on compute node side you will install compute node as a Kubernetes node for delivering LibVirt and Nova compute and you will use host networking host networking so it means like that all pods, all services listening on the same IP addresses as compute node has and you will just put their kernel module for the vrouter and VMs are plugged into the vrouter and LibVirt and Nova compute is plugged into the Kubernetes Calico which is distributed through the bird we also want to offer option for customers which don't need to have overlay in VMs like use single Calico for Kubernetes as well as for open stack for both solution I have mentioned it I will repeat the question the question is how did we manage the skills transformation of the administrators to the way of working of the continuous integration and how to learn them and do this way of working Well, yes, we have started with some trainings given by guys from TCP cloud to manage the platform and actually it's not that big problem yeah, it's a to learn itself use the tools like Git and salt stack it's not that challenging it's couple of days training and some self work when the guys have some space where to do the training without actually harming any production but the challenge is mainly in the mindset in the thinking of the people to really think of doing the changes correctly, it means usually to change the configuration template if the particular parameter is not yet parameterized in the recline structure for example, so that they need to start with requesting the template or doing themselves the template change to get for review and not really doing these these changes manually on the servers directly just believing that yeah, it's just for try and I will switch it back later for example but we have implemented automatic procedures to check the configuration periodically so that we know that the servers are running the configuration as the configuration is prescribed in the git repository which is controlled no, it's it's full, no it's not about the tenant workload the tenant workload management let's say a separate server management service which we offer via a standard tieto offerings we got large teams of people who are skilled in windows linux and a lot of the application systems so they focus on the tenant workload management my talk was mainly about the infrastructure on the back end of the open stack and all the supportive features as I mentioned, the open stack is just 50% of all the all the tools we need to run all the environment the rest are the monitoring, clogging tools software storage systems networking and the address OK, so if there are not any more questions thank you for your attention díky