 So I'm Adam spires. I'm a senior software engineer at Sousa working on Sousa open-stat cloud And my colleague introduced himself. Yeah. Hi. So maybe some of you have expected Marc Houdara to be on stage today, but unfortunately he couldn't join the Omstech summit. So I will replace him. My name is Boris Mac I'm like mark part of the cloud infrastructure engineering and architecture team with an SAP So today we will talk about the customer-vendor relationship between Sousa and SAP and how we successfully established DevOps model in this relationship Yeah, and so this is a quick agenda of what we're going to talk about just a setting the scene Regarding the open stack strategies of the two companies Then we'll look at what a typical vendor customer relationship Looks like and how that caused some issues for us and how we Felt that we needed to go beyond that and develop a new model So then we'll go into how we develop that new solution and what the benefits that we've got out of that so far are and We'll talk about what we're planning for the future and then we'll wrap it up Okay Yeah, let's talk about Omstech at SAP as well as on the Sousa side. So I will Start With the summary why we choose Omstech at SAP as our platform As you can see on the diagram the SAP IoT platform is running on the so-called SAP cloud platform on the other hand the SAP cloud platform is using cloud foundry as its platform as a service layer Also, SAP wanted to be vendor neutral and then avoid any vendor lock-in. So these two points the cloud foundry platform as well as the vendor neutral To be vendor neutral and let us to the decision that Omstech would be the right infrastructure layer as the fundament of the whole stack and Quick few words about the product that I work on which is Sousa open-stack cloud And that is one of the products the one of the cloud products that SAP has selected to build as the Infrastructure as a service platform and they build everything else on top of so it's an enterprise distribution of open stack it's fully free open source software and In addition to packaging An enterprise polished version of open stack it includes some additional deployment management capabilities So let's talk about me. Let's talk a bit more about a Customer and vendor relationship How it was applied That's okay, and how it was applied, but I was applied between SAP and Sousa So as you can see on the diagram We have an internal customer within SAP which is consuming the infrastructure as a service platform, so this customer or stakeholder is talking to the SAP infrastructure dev team for its requirements the dev team is trying to Set up these requirements and establish them in the production environment and so we have a connection between the dev team and The ops team within SAP the ops team, of course is responsible of running the whole stack and Operating it in case of any problems and failures the operations teams can contact the vendor in this case Susie L1 and to support layers So if the Susie L1 L2 support is not able to fix the problem They need to do some internal escalation to the Susie cloud engineering and Finally, if it is not a problem within the Susie cloud product bus But maybe a problem on the Susie enterprise Linux operating system the Susie Cloud engineering team needs another escalation to the Susie's less engineering team But that's how it should be in an ideal world let's come to a more real-world example because we have not only Stakeholder talking to to the dev and ops team We have also the platform operations and as you can see the communication that's much more Difficult because the the platform operations is also talking to the dev team It's talking to the ops team in case of some problems and errors the SAP dev team Is not talking to the ops team and the ops team to the to the vendor, but the dev team will directly Speak to the vendor to speed up communication to simplify communication and It's hard to get the big picture and get an idea of who is working on on what topic but Do you think that that's already the Did a full picture of course not because we have also management involved So in case of a real escalation management always want to be involved In the in the communication and communication flows so This means of course some extra communication on the one one hand this can Can hinder because It's it's communication flows on tops and as you can see it's really Confusing communication flows now of the other side and on the other hand This is sometimes required to get the right and priority of the Of the error of the resources on the vendor side so maybe You remember this picture we had before and you are wondering why the beautiful Simplicity of the support diagram feel familiar to you. So let let me remind you where you may have seen this before Okay, just talking Let's see if you haven't seen it before it's the open stack architecture diagram Which is one of the it's what it's the the image that is like it has to go on every open stack presentation everywhere It's like as a rule And it's yeah, beautifully simple Okay But what are the issues with the standard support even if everything is working like expected? Yeah, so the first issue as we've seen like the information can take multiple hops to get from The the source to where it needs to go if somebody needs to finance some information So you get this inherent latency in the system of communication and Similarly that makes it really hard for people to get visibility into what's really going on if they have to Ask somebody who needs to ask somebody else who needs to ask somebody else and so on so the the they can the all the different communication channels and the the hops between them really contribute to this lack of visibility and Similarly this has an impact on the ability to respond quickly when things happen because it can take time for the some critical message to get to the right person in the in the right team and And so that that's a kind of short-term issue in terms of agility, but there's also the long-term issue as well of if the the customer wants to Feedback to the the vendor and kind of Give real-world expertise and guidance around the direction the product should go Then that the efficiency of communication is important on the long term as well Yeah, so let's have some real-world example or should I better say it's a catastrophe So it wasn't a beautiful Saturday morning. We had a well-prepared maintenance window Everybody was feeling like let's get starting early finish early have a nice weekend enjoy the weekend but Of course we during the maintenance we run into some massive network problems So the VMs stopped to communicating to yet each other and we had no clue why this happened Yeah, because you just wanted to apply some home stack patches, which doesn't relate to any networking feature At the end with we were able to solve the problem But for this we need to start a restart and all Nova services and OBS services on all of our hypervisors So Yeah, I will continue So let's get more to the details of this outage. Yeah, as I already mentioned it was Saturday morning. We were well-prepared for the maintenance Everybody felt comfortable because we had also Suze on on hand. So Adam was connected to this maintenance bridge and We were like, yep, everything will went well However And Adam has had no console access to the environment, but there was a screen sharing session available So we started applying our packages and updates through on the configuration management. Everything was fine and then suddenly the VMs started and Loose their connectivity as I mentioned before we had no clue Why this happened because we didn't touch anything on the network layer and just applied some Open-stack patches. So what to do now, of course, we had to travel through the problem We had to extend the maintenance window So luckily several hours later We were able to stabilize the cloud again and everything was running smooth again, but Due to the limited visibility what operations did to fix the problem and Nobody really could could answer why this happened and we had we didn't had any root cause and Of course, as you can you mentioned, we had to extend it to the maintenance windows for hours our customers really Happy about that situation Yeah, so I was one of the lucky people Involved in the the whole saga from beginning to end so Like Boris said I was in the maintenance window And then on Monday morning the the root cause analysis Started in earnest on top of what we'd already Gathered on Saturday so we started collecting debug and this was really quite a laborious task because there were huge numbers of log files and We needed to get them over to the suzer side so that we could analyze them in depth and run some of our analysis tools on them and At first we weren't sure exactly which logs we needed So we collected a bunch. We did some initial initial analysis It was clear that it was the problem was related to open v-switch in some way and we we had a sort of initial Conclusion that we shared with SAP a couple of days later And we kept analyzing We were still trying to get to the bottom of things had an internal call did more analysis Eventually we got to by day 12 the place where we thought okay. We've got a pretty good Understanding of what's going on. It's not like a smoking gun. They were a hundred percent confident But it does look like we found what seems to be very likely the culprit But of course we were not satisfied with that so we kept digging and eventually when we'd Brought in all the possible experts and spent as much time on it as we could we finally came up with something with We were happy. Okay. We really understand what's going on now so it was it was quite a long period and the In case you're curious, you know what briefly what was the the cause of all this? So a while before someone in the SAP dev team had Needed for very valid reasons to test a new feature. They didn't want to test in production just in a you know other environment And this new feature relied on neutrons L2 population feature, which you don't need to know what that is if you it's not really important for the Purposes of this, but it's a it's an option in neutron that you can enable or disable and the code change to enable it Accidentally got into the ptf package that we built for production which included a bunch of other fixes that SAP needed For their production cloud, so it got accidentally enabled On some of the nodes that we updated during the maintenance window But not all of them and this was a big that really the source of the problem So why did it take us so long to figure it out? Well, firstly neutron L2, especially with urban beast, which is just a really complicated beast and there's there's no way around that There's you know hundreds thousands of log messages that you really have to be an expert to understand and And the the fact that we had accidentally enabled this feature on some of the nodes But not others meant that we were effectively using an invalid config and there was no validation in open stack to catch that invalid config and It was probably some situation that maybe you know No one else in the open stack community had made that same mistake before So the resulting behavior was not only undefined but undocumented and you couldn't Google for it Or really figure out You know that recognize it as as bit as coming from this this cause the other mistake that with hindsight from I guess as soon as I we made Was to not involve SAP enough in the in the ongoing investigations Because they did have you know some great knowledge about exactly what happened during the maintenance window that could have potentially Led to us figuring things out a bit quicker But that it wasn't the the main cause of the the difficulty and Of course While this was all going on we all had other responsibilities that were competing for our time with what was going on So we learned that the organizational structure was not set up for Succeeding in the way we want to And the communication while this was going on was all over the place like the diagram you saw earlier We did waste time collecting logs because we collected some Early on and then realized that we didn't have the right logs And then we collected those and then we realized that they were still not the right ones because the logs that we actually needed Were had been rotated because they were from an earlier time period So we have to collect again and so on and this is clearly not the most efficient way of debugging So remote access helps a lot with that because you can just get the logs that you need straight away and And similarly if you have remote access, especially in a joint vendor customer situation, then you can do screen-sharing type tools through the same terminal session using like screen or Tmux or something like that and We also learned that the the process that we had for building these custom packages For customers who need fixes Earlier than they normally get released through our maintenance update process was not rigorous enough that you know, they're manually built and it's hard to keep track of of all the fixes that you're building into that ptf and We didn't have in rigorous enough testing around those particular packages even though the rest of the products and all are Officially released maintenance updates do you get very rigorous testing? This was a bit of an Achilles heel And finally, of course open stack is just a complex thing as we all know, so There's not much we can do about that in there in the short term at least Yeah, so Like Adam mentioned we had a lot of lessons learned But what to do now and we decided to have a two-day joint meeting together with SAP and SUSE folks to See how we can can improve the process and the situation here So yeah, what was the outcome of this? Of course, it's a new DevOps model approach together with SUSE So we apply this new DevOps model and Yeah, as you can see it looks quite different to the To the drawing we had before so we really try to avoid all this cross communication flows we had before So as you can see we have two main tracks one Still for for development the other one for operations, but we have now in the middle this DevOps approach So let's begin from the top We still have the stakeholder and the platform operation guys But they are now on this on the same level and they Need to sync each other if they have new requirements Second we installed a new team, which is called in our case the infrastructure architecture team As a new layer Between the stakeholder and and death So the architecture team will Validate the requirements and if they are not valid give it back to the stakeholder They will finalize and prioritize Also the requirements and coming from the customer So the architecture team will hand over this these tasks to the dev team and As you can see the dev team has now direct access to some of these user cloud engineering resources So they can even work together and on some of the topics if you have bigger topics We are both involved So and for this dedicated suzer team it is much more easy to involve the right resources Within suzer if the other resources from the vendor are required Rolling out new features. This will happen in this DevOps part. So not Development is rolling out new features and or handing over to operation It's working again Features big so it's it's a joint approach approach here with deaf and ops people and even if it is bigger rollouts of Of some Features which which are Not it not it not that easy To roll out the dev ops team can can easily involve the suzer team for this rollout in the production environment and so also We have closed With this approach the gap between the deaf and the operation team So if the operation team has some problem during the running the production and the daily work, they can easily Connect to the development colleagues for further help on the right track, we have still this The daily operations flow where the platform operations if they see some incidents or have some incidents They can still open a ticket on with the operation guys and operation guys will will run Yeah, we'll run the production and operate the production environment okay, and on the development side like on the technical side we decided On a new approach that was to do effectively a friendly fork of suzer open stack cloud There was specific for SAP that had the short term fixes and enhancements the SAP needed Urgently so we could deliver them to SAP normally Faster than we normally deliver through our standard update process But we didn't want to maintain a long term Continually diverging fork. So we agreed that everything should be Fed back into the the mainline product and of course that that allows our engineering to scale in the normal way rather than having yet another Product effectively to test and maintain and a whole extra code base and We use agile practices within suzer as standard So we wanted to keep that going in this context So that means things like daily stand-ups obviously testing and see I in general and a shared Sort of product backlog for this This new you know product in the sense that it's a new code base stream delivered to a single customer effectively so the way that we Decided to host that was a new github organization with shared access from from both sides that has forks of the product repositories that That we expect to deviate from from our standard product and each repository has an issue tracker in it which is of course shared to both sides and We also use this github project feature where it supports a single Kanban board type of view of all of the issues from all of these repositories in a single view and We developed a branching and tagging strategy for the git repositories Which is that that's not rendering properly Okay, that's a shame if you if you at the end we have Again the URL for the slides so if you if you follow that and you view the slides on your own Laptop, hopefully you will get the proper view for this. I don't know what why it's not displaying but there's a git branching Diagram that accompanies the comments, but effectively have a mainline Stream that we already have for excuse open stack cloud and then we branch off that for each cloud each cloud the SAP is deploying whether that's production or test and So each branch effectively for represents the queue of changes that we want to deliver to that cloud and then we tag Every time we deploy to that cloud We we make a tag on that branch so that we can go back later and and see exactly which code has gone into that cloud So it's typically one tag per maintenance window But it could be multiple if if we forget something in at the beginning of the maintenance window and then make a correction later on And we merge the changes back into the into the product like I said before So we get from this improved communication because we're using github As the focal point for the communication previously we were tracking internally in suzo We have a bugzilla instance that we were tracking The the technical side of these issues But it had limited visibility to SAP basically by default bugs are private and we wanted to share them with SAP For visibility on their side, so we had to CC all SAP onto all those bugs So they have access to the historical information, but it's it's you know for stuff going forward We're using github We have a shared slack channel that we're on you know all the time throughout the day So we can quickly message anyone in either company and we have the daily stand-ups Like I mentioned before Yes, and so all as Adam already mentioned so the it was really key to enable remote access for the render and So we we gave through to the remote access to the to our development as well as to the production system Hmm, of course, we needed to take care about the right access level for the susa engineers So they should only have read access to the to some resources And not full access but this improved working together a lot as the susa engineers can now examine Mine and directly the the production cloud and also we don't waste any longer time in collecting locks or let's say the right locks Anymore there's now Also more in-depth examination possible for the engineers and finally Yeah, I didn't already mentioned we couldn't work together more if Officiously because we could use tools together like screen sharing Or like Sharing the console on a on a screen session yeah From my personal perspective that that was a huge boost in Making making my life easier helping out sap Having direct access So on the ci front We decided we wanted Essentially to make sure we could make code changes Faster get them deployed to the sap clouds quicker, but without increasing any risk So we wanted to obviously test everything before we deployed it But just not test it as a fresh deployment, but but test it as an upgrade path from a cloud Like sap already has deployed. So we're really focusing on what is the impact of of upgrading Rolling this change into an existing cloud that's At a certain state already So the way we did that we we could reuse a lot of the existing internal ci that we have at susa For our products. We have a bunch of build and test tools and methodology and Fortunately a lot of sap's existing infrastructure was very accommodating for that. There's a lot of commonality between the two environments Of course, we wanted to focus on sap's hardware and software stack and their particular open stack configuration Susa opens that cloud is a very Flexible product in terms of how you deploy your open stack cloud But sap have decided to go in a particular Root in a particular set of configurations. So obviously that's the one that this ci Needs needs to focus on so we have daily builds Various forms and then we also have gating for pull requests So every time a pull request is submitted the ci runs on it and we can see how it performs So the components we use for that obviously github again As soon as we have this thing called the open build service, which is a free open source project that anyone can use to build packages For multiple distributions actually not just for susa distributions. There's a many different distribution supported by it It's a great place to host your package building Processes and it's certainly the the core one of the core components That we use at susa to build susa opens that cloud And it was the the ideal Component to to use in this environment as well for building custom sap packages for testing in the ci And the whole thing was orchestrated using Jenkins, which is I'm sure you're all familiar with So, uh, how how have we done now? It's important to point out. This is still a relatively early approach to or relatively young approach to this Change in in organizational and collaboration model. Um, so it was only around february time frame that we started so It'll be interesting to see how it develops over the next six or months and and beyond But we've had some positive results already And so a couple of examples, um, there was another uh, a recent outage Where some packet loss was reported on network nodes in in the morning one morning and Obviously with susa got visibility of that very quickly because now we have the slack channel So sap people can just message susa people straight away So obviously, you know, we took this very seriously immediately reacted And then a short while later, we also discussed it in our daily stand-up and as luck would happen we had um a guy on the on the stand-up who Before the organizational changes. Um, he would not have had any direct involvement with sap is kind of specialist automation engineer But due to these organizational changes he attends the stand-ups And uh, he heard the description the problem from sap and immediately recognized the the issue was something that he's been looking at for the last Six months. It's an ongoing open v-switch Issue and he knew a workaround for it straight away. So so we went from Pretty much a critical situation that we were all very worried about to Suddenly feeling, um, maybe I shouldn't say relaxed, but certainly a lot more comfortable with because we already Um, had a good understanding of the of the issue And how to mitigate it Um, the second example was we had um on a Tuesday evening, uh, an unexpected restart of the message queue and, um When we started to look at this in earnest, um The uh for for engineers including myself From my team we're all able to directly access the production cloud that was being Affected and when we took a closer look and we produced some statistics from the metrics that get gathered And you probably can't see any of that at all. Um But I don't know there's one there's one square graph in particular Which is in the first column and it's the Well, it's the one in the middle basically, um, I don't know whether you can see it But there's a very sharp increase in in in a red line which represents the load average of that particular node each of these Five nodes is a is a control node And we we saw that the load average on one was just going up sky high to about 140 150 At which point the the node just completely fell over And that was obviously a very Bad thing to happen and so we looked more closely at the statistics And we found another spike Um, which again is in the first column. I'm not sure if you can see it But it is the um fourth from bottom and it's a spike in udp v6 traffic And as soon as we saw that then we Um lightbulb started going off in people's heads and we effectively realized it's correlated to nfs activity and there was an issue with one of the nfs servers Which explained that and What we were pleased with in in that incident that is that we were able to With this kind of combined firepower of four engineers all logged on at the same time We very quickly honed in on the right thing In in a speed that we have maybe not managed to to do Previously so that was definitely a nice benefit Other than outages um that there are some other benefits So working i've i've been working on a new feature recently for example around api rate limiting And it was really helpful for me during my kind of research Phase of trying to figure out what's the best approach to this It was really helpful to as i was looking at different technologies available and different configuration options To be able to message my sap colleagues who are the stakeholders for this feature and say hey, you know Do you need this particular approach or do you need this one? And to have that kind of agile approach to The the design rather than having sap you know write up some lengthy Specification of the feature that they need and in advance and then pass that over to suzer and then analyze that It just felt like a much more efficient way of collaborating Similarly with short-term support requests, which Traditionally go through this this system Which i as an engineer don't have direct access to and for long-term support requests for new features Again, we have a a system that is is a kind of more heavyweight process So yeah, it just all felt more efficient So far just a short outlook what we are working in in the future on together with suzer and sap so So we want Oh, sorry. This was your part still. Oh, okay. Yeah. Well, so yeah, I mentioned that I was working on the the rate limiting and so the reason for that is that the The customers of sap Are tending to use the open stack apis From a cloud consumer perspective very heavily. So we obviously need to to mitigate against the potential for one customer Having an unwanted impact through overloading certain api services And we're also working on other things like improving neutron l3ha and And other bottlenecks in the control plane And it's it's handy to be again to be able to discuss those directly with the stakeholders on a daily basis So now what's planned for the future? Um, of course, we want to improve the monitoring at the moment we plan to have an elk stack available in within the suzer product so we can really as a customer now influence the the suzer guys Um to point them in the direction we want to see the product um Yeah, we we will work together on a centralized management because we have the idea to to deploy more Open stack clouds um in in multiple disease with multiple availability zones and multiple In a multi region um deployment so yeah, really for our operations and To have the visibility in one point over all our cloud installations We want to have decentralized management and um, finally, of course We we want to improve um the ability of the cloud overall so uh to wrap it up the um Traditional approach to uh vendors and customers working together. Um, of course it works in in many situations but it it's not the only way to do things and um, it it sometimes can be a healthy thing to to try considering other approaches um, and The the obvious sort of rebuttal to this approach is like well, that's great, but you know, it's it's expensive for Uh, you know suzer to dedicate engineers to a customer It's expensive for the sap to you know dedicate teams and have have this sort of It like it does take extra effort And and that is true, but It's not an all-or-nothing thing. You can you know steal some of the some of the approaches we've taken. I think apply in any situation No matter how how big or small the the relationship for example, you can You know as as we're all working in in open source and open stack believes in the four opens as a community It you know having for example an open issue tracker makes a lot more sense In terms of collaboration than a closed one And things like ci any any customer can potentially build ci that then integrates with the the vendor ci and adds extra voting or non-voting input to to changes in the code And and of course anyone can you know fork a code stream and submit changes back again because Working in an open source environment So and last but at least we think it's a win win situation for everybody for sap for suzy for open stack Because sap as I mentioned can influence the roadmap of the product suzer has The ability to work directly with the customer to improve the product On the customer needs and don't and also have access to a production cloud not only to two small labs And yes as we mentioned all our fixes will go back upstream to the open stack community So I think we are the end and yes, unfortunately, no time for questions, I guess But it's worth to mention that we are hiring so if you want to be part of a great sap or suzy team Yeah, just talk to us after After the session