 Welcome to Keeping End Landscapes Blooming. Yeah, we're from SAP. As you can see there, I'm Andreas Haix. I've been with SAP for 17 years, working Cloud Foundry for two and a half years. Warm welcome from myself as well. Martin Schroeder, besides that, he's the same attributes and Andreas are working on the same project since the same time. I think that's the last talk between you and lunch, so I will say let's go. Okay, probably who we represent today, mostly of course our team. So we are across four locations and we also represent this number of virtual machines which are live in our landscapes. That are numbers from September, so I guess they're up again. Since then, what do we want to talk about? I heard the term landscape a couple of times today, so maybe it's not that exotic but not everybody calls a CF installation a landscape and we just want to show a little bit why do we call the stuff we have landscapes and landscape gardeners. Second part is the working mode, so how do we actually run the platform? How do we bring in the new features and versions from the correct form development there? And the third section is probably, let's say the most tangible, it's the monitoring and the tool selection we have to really run the stuff. Okay, landscape gardening and cloud foundry. So when you think landscape gardening, you probably think green field project, you can do what you want, but probably the moment you really stand there on that area, you realize there are a lot of boundary conditions like the terroir, the climate, and also of course the customer who wants to have a new landscape or new garden from you. They have a certain budget, they have certain requirements, they want flowers, you want to grow crops or so on, you somehow have to get all of this stuff together and you suddenly realize that the green field project is not that green field after all. Once you have set up the stuff and to all the requirements and all the needs and have it there, you realize that the real work starts from then, to keep that all blooming and keep that running. And of course you know that there are external forces that are not easy to control, so you try to build in some resilience, you think about, okay, how do I water the garden? How can I provide shade for plants? How can I protect against snails and bugs creeping into that and so on? So you try to provide the resilience, to plan the resilience, to test it and so on, but in the end sometimes disaster will strike. Things will get broken, things won't work and so on and at these points, normally you as a team or as a gardener you are also challenged, right? Fixing your garden in a thunderstorm is much worse than fixing it in a lovely Sunday afternoon or so on. So basically important for us also for the analogy is that you have to build in resilience in your team, processes and so on, you cannot build a garden on just edge work load and so on, that just doesn't work for yourself. Looking on the positive side, I think if you think garden, if you think a park or so on, that's hopefully for everybody a good thought, a positive picture and that should also be the landscapes that you provide for your customers and stakeholders and developers. People should like to work with the platform, that should be a good place to go. That is the the challenge we put to ourselves, that's what we want to provide and of course we also want to have commercial results, right? A garden or let's say a farm or so on of course needs to yield commercial results. So these are let's say the analogies where we see that's fitting for us quite nicely. Okay, the working mode we have for our team and how we started that basically in January 2015 we started with Cloud Foundry, so that was new to all of us in our team. We started in two locations, Waldorf and Le Valois close to Paris and we got a kickstart with the developers who already worked on that within SAP and that worked quite nicely and already in April 2015 we had the first commercial landscape running including SLAs, maintenance calendar, announcement channels, lots of really really good discussions with the customer so it's been running ever since. We heard this term multi-Cloud a couple of times already today so in October 2015 we had the next landscapes running and that was on AWS the first one was on OpenStack within the SAP Data Center, one SAP Data Center and 2015 in October that was Amazon and now in 2017 we look at several OpenStack flavors. We have AWS Azure GCP all there and if you look at the other tags we have there thinking about the Greenfield project which is not so Greenfield after all there are a couple of also other qualities we need to put into the landscape. So compliance is a very important topic for the range of business we are in that needs to be there and it had to be there right from the start. We decided in the organization for bi-weekly update, latest bi-weekly update of things of course hotfixes, security patches and so on coming in even faster so these are things that had to work across all platforms and all landscapes that we run and yeah looking at the speed we have there that also sometimes mean for us as an operations team we have to provide the bullet and do manual processes to keep things quickly coming in and yeah I'll come to that in a minute again. When we talk about landscapes, CloudFoundry landscapes so what's the bill of material, what's in there I think the pictures speak for themselves you know most of the icons there the SAP icon is a placeholder for a lot of SAP specific applications and so on in there and basically this bill of material we deploy to many landscapes to many installations across the globe basically with small variations sometimes you have more components in there sometimes you have less in there and of course the configuration and the sizing is different per landscape. What we have in addition and that's let's say more on the the sole decision of our team is what do we use to actually run the central tooling what do we do to really monitor and run the landscapes for ourselves and that's going to be in the second part of the presentation. So how do we organize our working day or working mode so basically with the three time zones we are a really distributed team 25 people and our most important task is to keep the systems running so that the live site is up that the request from customers are fulfilled that the updates get in there all of this needs to be done all the time and that is the top priority even if it's sometimes cumbersome you have interesting other things that is the top priority and that of course scales with the number of landscapes and the number of customers and the sizing you have so basically if you don't want to increase the number of operators all the time you somehow have to balance the cost that you have per landscape so when we look at how much time and energy we have as a team we somehow have to balance that on the one hand you have to keep the system running and we counter that with improvements in our process so we look at the staff I mean we are one team so we know where it hurts so basically we see okay where are the points where we what are the issues that we need to take where do we need better automation where do we need better reporting what are the alerts that keep us busy but are either false alert or are flapping or whatever so that's where we try to invest in our team and say okay these are things we fix and these improvements are mostly for our own team so people outside don't see it they probably see that we are more relaxed or less relaxed or so on but that's mostly within our own team and the third thing where we can invest into that's what we call innovation and that's where we basically say okay what do we change in our processes and tools that have an impact immediately on the people who use our landscape so for example for developers how can we make their life easier with the live landscapes for the people who do QA or do demos or whatever on the landscapes what do we need to do that they feel comfortable and can request things instead of doing a ticket they can just do a self service instead of asking us for status they can just check it themselves in the system so these are whenever we have backlog items in our can run board they fall in one of the three categories that we have here and so we try as product owners we try to balance that our goal would be that 40% of our energy goes into keep the systems running honestly right now it's probably around 60% and yeah with the fast adoption of the platform and the increased usage there we are just busy keeping these numbers somehow stable challenging but good okay we are not alone in this world which is good so we have a lot of other teams that we work with together at SAP so when we look at it we as a team are certainly the generalists so we know the whole landscape we know a lot of the processes around that and so on and staying with this landscape analogy when we see that a crop is not looking well we can water it we can put in fertilizer or so on but we might also see that there is no help with that and we directly call the specialists so development teams who know their components very well but probably not the other components so that's us who have to do the first aid to the landscape and then get in the specialist to really do the deep analysis and then either provide a fix or develop with us a workaround to keep things stable and then again we see the different focus of us and the development team so we have the right now focus we must keep the landscape running right now there's no denial of that we have to do it and if that is painful if that's manual work okay, happens but it buys the development teams the time to do the real fixing and that's where we see the separation of concerns or separation of duties and so on and that works quite well it also means for us we have to be open for the new improvements that come so sometimes there are probably wild ideas and it needs some time for us to adopt to that and so on but that is the idea right so some of the changes are bigger are different really challenged the way we work and so on and we have to get that I would need a smager for the fly we have to adapt these changes and we have to think together with the development teams how do we best get that and so while we still stabilize the landscape manually the developers have the time to do something else roll that out and try with us to bring that innovation into the landscape while we share the common goal of having let's say a stable attractive landscape for the customers we have this different focus right now whether it's let's say long term or improvement of features and so on so how do you work with that in everyday life one thing is of course close collaboration in there so that we really have the direct means to get the development teams in there when there are problems coming up it also means working together on the backlog so that we have this feedback from outside the lab that we can provide that to development teams because we see what's happening on the life landscapes a lot that is one thing the other thing is of course also that we as the development teams to give us hands on sessions before that we roll out new features together that we do critical updates together and so on and yeah I guess that is needed to really build the trust so that we just say yeah every two weeks we have something new we put it in don't worry about it Thursday morning the stuff will go into the systems we will do that for you so that is the one thing that the development teams can rely on and we can rely on that they're there to support that one of the things I mean a lot of times you sit together during outages that's when we provide the cookies and the chocolate but that should not be the only time that we sit together so sometimes there's this notion of celebrate small success and that's probably that doesn't happen often enough there's always I guess something that falls short in that but yeah it does happen from time to time and there are a lot of good pieces where we see really the improvements coming on really in small iterations step by step okay basically that would wrap up my part here on the collaboration at the working mode and then I hand over to Martin for the right side of the landscapes yeah with a second part of our presentation I would like to show you how we actually monitor the system and which tools we use for that one but before that let's first go back to our analogy of the garden so if you look at the left hand side tomato plant would you say is that one healthy? rather looks like yes and the right one is it healthy? not really right so for the tomato plant it's rather easy to see whether it's healthy or not unfortunately for a Cloud Foundry landscape consisting of let's say hundreds of applications and components it's a bit more difficult than just looking at the landscape so of course we have the Bosch Health Manager you could ask the health manager what is the status of the VMs Bosch will tell you the truths I guess but unfortunately Bosch Health Manager and the VM status doesn't tell you anything about how the customer actually perceives the landscape so even if all the VMs are up and running and if all the jobs are running it still could be that the scenario for the customer is still not working so what we did is we basically focused at the very beginning on black box monitoring from outside so on the left hand side you see all our landscapes mentioned by Andreas so we have various landscapes and what we do is we have our Jenkins therefore the name Landscape Gardening because the host name of that thing is Landscape Gardener and what we do is we define a couple of scenarios which are simulating end user activities on the platform so we don't look at technical components like VM status on is the app instance running or two out of three instances running all of them that's interesting but from the endpoint from the end customer point of view not really interesting so we simulate end user scenarios one scenario which we have is we call it the developer scenario so we are simulating typical activities of a developer a developer is pushing an application creating a backing service binding that one to an application viewing the logs of the application starting the application stopping the applications doing other stuff so that's what a typical developer with Cloud Foundry is doing so we define that scenario and we have a Jenkins which is firing that scenario on a regular basis against all the landscapes so with that one we would be able to see whether at all point in time a developer is able to work with our Cloud Foundry landscapes being it on AWS, Google, our own data center, wherever so that's what we do with our monitoring from end point of view and we do it as a black box so we don't care about the details whether it's 200 VMs, 300 whether 10 of them are running or not but we try to look from the customer point of view we did that at the very beginning with growing scale, growing interest of our landscapes we had the issue that we had only the Jenkins so the job status or the availability of the landscape was visible in Jenkins itself so who of you know Jenkins or have worked with Jenkins? so yeah quite a lot of them so doing user management on job level on Jenkins it's rather technical so you can do that to some extent but at some point in time you're just doomed so what we decided to do was we tried to decouple the actual status visualization from the execution of the monitoring so that was a point in time where we introduced an influx to be in the Grafata instance to just tell Jenkins yes you do the execution but the status of your job execution this is something you report to influx to be and we have a Grafana on top of it where everybody can look at the status so that one and that's another thing where we said that's very important for us is the availability of the landscape should not be a secret to somebody in the company that should be public everybody should be able to see it it should be us as an operations team it should be the developer it should be the management whoever in the company should be able to see the status that's something you don't want to hide but of course you want to hide the details in the Jenkins so that's why we decoupled it how that looks like in real time or in detail so that's the screenshot of our Grafana dashboard what you see here is I think I can't go far away from the mic so on every tile that's one of our landscapes so that's the overview showing the status of all of our landscapes the good news here is we have a lot of green tiles which means many of our landscapes are just running fine but you see for example a yellow one on the top left that means while the names are a bit anonymized for legal reasons but that landscape seems to have an issue because it's yellow it's not green the yellow means not sure whether you can read the text on it but yellow for us means a landscape is either not a real life productive landscape so it could be a staging landscape internally or maybe it is a monitoring scenario which we don't rate as prior one so in that case a tile here would turn yellow luckily we don't have a red tile over here but if a landscape would turn red that means it's a real life productive landscape and it is a prior one scenario which is failing so for example the developer scenario that's prior one so if you can't work with Cloud Foundry as a developer we have a serious issue so a tile here would turn red so that dashboard over here would give us the overview of all of our landscapes that's nice to get an overview but yeah, if you see there's a red or a yellow landscape you would like to see details so what you do, you just click on the tile and you come to the next slide, no, dashboard you see all the details now each tile is one monitoring scenario and that's the landscape so that's one level down so that's all the different monitoring scenarios we have covering certain scenarios and use cases for our platform and then you would see the detail okay, the landscape was red or yellow and why was it red? because the developer scenario wasn't working or pinging the Cloud Foundry API wasn't working or I don't know the authentication wasn't working so you would see the details what's actually not working in the landscape for some teams or team members or some people the next level down would be, if you click on the yellow tile in that case you would again end up in our Jenkins so this would be a step we and our operations team could do some other people would have permissions as well but many other people would be blocked here because they shouldn't access Jenkins should see the details but for people who have the permissions could even go to the Jenkins execution you would see the overall lock you could retrigger the monitoring that you say, okay, I fixed the issue or maybe it was just a temporary hiccup so I retrigger the monitoring execution and could check whether it's green now or still have an issue but that's the division we did that we separate these availability and the transparency of the landscape from the actual monitoring execution so that ideally everybody is able to see at any point in time are our landscapes healthy or not another example unfortunately the screenshot isn't that nice so imagine you would see tiles on here so it's rather similar to the previous one again one tile for all the landscapes but that dashboard over here is showing you the deployment status so we have various landscapes on various infrastructure layers and we have one product version but keeping all those landscapes in sync is not that easy so in most often called question in our company would be okay, there's a product version X, Y, Z and do I have that product version on all of our landscapes so we use congos for automatic deployment but still it might take some time until all the updates on all the landscapes so people want to see which version of the product is running on which landscape or there was a critical security hotfix which we need to deploy so people want to know is that hotfix has it already reached all the landscapes or do I need to cover some landscape or do I need to trigger the deployment so that's an important information and again the same paradigm it's open to everybody operations people should see it development people should see it delivery management, product management whoever wants to see the information should be able to easily get that status we have combined that with a couple of more informations so for example you take that hotfix on the bottom left corner there's some red text over here we're just telling you the status of some deployments is failing so as I said we are using congos for deployment but you don't want to use the congos UI for all of your landscapes so you don't want to browse through different congos installations and UIs but that's the overview so it would tell you okay some of your deployments in that landscape is failing so the landscape would turn red some text on that slide is yellow yellow in that case means the landscape is either not on the latest product version so there's no error but it's let's say outdated or there is a change on the configuration of the landscape but that change hasn't been applied to the landscape yet or the deployment is still running so that you could see I did change already whatever that is but it's not yet applied to the landscape and again the details if you would click on one of the tiles then you actually see the landscape material in a tabular manner and then you would see in that case there was one service in our platform and the deployment of that service is failing so that's the details on here and that's basically the two scenarios I wanted to show you today so how we monitor the landscapes one is the availability monitoring sorry for that and the other one is the deployment status of all our landscapes because at least we currently see a growing number of landscapes which we have and keeping that in sync and easily seeing the overview and keeping track of them is one of the major challenges for us so what's next in the next planting season coming soon so currently we rather focus on black box monitoring which is nice because it simulates end user activities the drawback of that one is if we see the landscapes red then we see it as well because we already have an issue so it's rather reactive so we only see it if the issue is already there so what we want to do is we would like to have or extend that with a white box monitoring so we try to get more insights from the landscape so we have an elk stack in the landscape, we have a lot of locks of the different components we have the Bosch health manager we have the Cloud Foundry health manager which we could use as a kind of early warning system so if they report some issue still the scenario for the end customer could be working but we get some information upfront that we can react before some customer scenarios are down so that's one of the next things we want to do extending black box with white box to get insights earlier and be more proactive not just reactive, it's already down and the other wish and hope we would have really shared our experiences with you and we have seen that there is a sea of operators community but it's not really active but we would like to share our experiences with you and would hope for ideas experiences from your side because I guess among you who's operating Cloud Foundry landscapes so yeah, many people so there are I guess many experiences outside our SAP on how to operate landscapes and I guess it's in most of the cases more than just one Cloud Foundry landscape so we would be happy to get your feedback and ideas how you operate as a small motivation for that one if you remember the planting pot from the beginning of our slides and the seats so as some motivation to provide us ideas feedback we have brought of course some advertisement for SAP as well we have brought a coffee mug and inside some seats for various plants so to keep our energy running so yeah, if you're interested if you have some ideas feel free to grab one I think unfortunately we won't have enough for all of you but yeah, I would say first come, first serve feel free to grab I think that's a for our talk and I think rest of the time we could use for Q&A any questions and so we have an elk stack which is currently already collecting the logs of all the other components so I'd say the rough idea would be we just have to find the pattern in the logs which would indicate a potential error and let's say would build an alert on that one so still pretty rough and I think if you have thousands and millions of log lines the interesting and difficult part is what is the pattern I have to search for but that would be one of the tasks to say hop onto the elk stack and then put some alerting mechanism based on logs so that certain patterns, messages in the logs would trigger an alert so one tile here for a landscape would turn red or yellow depending on if you see a certain pattern in the logs we also have, I mean this is let's say the viewing part we also have alerts based on for example Riemann already coming in but they are not let's say integrated in this view and it's somehow this shared view on the landscape status this is something that we freely find so valuable across the company so that's the question how can you get more of these white box things in there which also depending again is it just a yellow or is it already a red when you see that I don't know when Bosch would report it's in meltdown state or that's probably something you want to see red on that monitor but that's something that we just can't integrate right now so we would like to build to have just a more complete picture of the landscape so checking what of the Riemann stuff could really help color the tiles and what of these log based patterns would indicate the status that you, yes, you have to do something to fix the landscape yes, we do, we quickly browse back to that slide here I didn't mention that one while talking so we have a tile per landscape but on the very bottom we have the infrastructure layer where you see our OpenStack, where you see AWS where you see Azure not really updated slide here so there's GCP as well so we have a couple of monitors like resource limits in AWS service limits quota limits in OpenStack whatever so we have a couple of infrastructure monitors between integrate here as well for example we have another one downloading just some binary from internet just to have some check of the bandwidth and availability from our own data center to internet so we are enhancing let's say the end user scenarios with some more technical checks to get a closer insight on is the infrastructure causing Cloud Foundry not to work or is it something inside Cloud Foundry until now it's manual one so we would have the alert but it's still a manual scaling that we kind of some to at least to some extent by intention if we run out of resources we always autoscale because that might be some app going crazy some developer setting up a CI build I don't know deploying apps like hell because he's using a GUID and is always deploying a new app so that until now we rather keep the safety of we first check why we have to scale and then we scale but yeah would have on the other hand would be nice to have some autoscaler but yeah we haven't found the right balance yet yes we have so there's let's say a distinction so part of it is of course the platform which is let's say what we are mostly concerned about and then you have all the orgs and spaces for customers and that is let's say managed by the commercial infrastructure at SAP so customers can just buy and increase what they want to and then we have to hurry up and buy the stuff at Amazon or Azure or so on and then there's a scaling of the platform itself and there of course whatever comes with the deployment when the deployment says I now need I don't know three large Postgres instances okay we provisioned that so there's no quota on the platform coding itself okay more questions more hunger okay then thanks for joining thanks for taking the time alright