 Hello and welcome, I just got designed to start. Welcome to, you build it, you run it, where we're gonna talk about our Hypersize Service, or YAS, operational model. My name is Rene Velches, and in the second part, Johannes will talk about how we use monitoring and logging to support our operational model. And we are working on Hypersize Service at the site of Cloud Foundry and monitoring and logging. First some word about Hypers, we're in SAP companies in 2013, and our main product is Hypers Commas Suite, so we are really focused on e-commerce and commerce products. And I have to rush a little bit to the slides because we only have like 30 minutes. And I wanna give you a brief overview of what YAS is or better what it's not, because there's always a little bit of confusion about what is Hypers as a Service and how does it relate to the commerce as a service suite. Hypers as a service or YAS is not the Hypers Commas Suite. It's not the future of the Commas Suite that we have, and it's new, and it doesn't run on Cloud Foundry because there was often the questions like, hey, we have customers that are having the Hypers Commas Suite, does it run on Cloud Foundry? Can we run it on Pivotal's Cloud Foundry? No, we can't. So Hypersize Service is a real new product. So what is Hypers as a Service? Hypersize Service is a cloud platform that allows everyone to easily develop intents, extends and sell services in application. It's not necessarily restricted to e-commerce. So it's really an open platform. And of course, it's built on microservices. Like you can see here, we have core microservices, we have a storefront, we have commerce services, measureups, but it's easily to extend. And therefore, I will show you a little bit about the Hypersize Service architecture. On the bottom, we have an infrastructure layer from SAP, and we are running on Cloud Foundry, which is operated by the HANA Cloud Platform team. And what we build on top of this is we have core services, which are domain agnostic, like a document service for persistence, or an account service. And on top of these teams, build their own domain-specific services like product or cart. And then we have a mashup layer, which I will briefly explain in the next slide. And then we have the connecting application clients, which is really interesting for you or for anybody that wanna use the Hypersize Service platform is that you can build your own domain agnostic or domain-specific services like loyalty. So if you go later to the SAP booth, I, we probably can show you like a loyalty demo case that some of the other SAP teams developed on top of Cloud Foundry, so, yes, sorry. To the mashups, why do we use the mashups? If you don't use mashups, the client will always have to implement aerologic handling, requires a lot of calls. And every client has to implement basically the same logic which will end up in like a lot of calls to the backend, to the services. And this will also create a lot of network latency. That's why we introduced the concept of mashups and a mashup basically bundles the calls to the backend for a client. And the mashup is also there for creating resilience. So if one of the back services like, let's say, the review services is down, the mashup should be so resilient to still send back a result to the client and the client can handle then the output. So much to the architectural background and now I wanna go a little bit deeper into what we actually wanted to talk about. It's the you build, you run it paradigm that we have. And I wanna start like a little bit in the past as the headline already says, experience is what you get when you didn't get what you wanted. And we started our first iteration of our software as a service approach about four years ago. And our first approach had like a couple of flaws which basically, yeah, you can see it as the first iteration. And what we had was we had business service teams and core service teams as we have it now, but we had a specific DevOps team which was responsible for packaging the code and create puppet script and Mcom collective and was responsible for deploying this to the, to an dev environment. So the DevOps team, as soon as it was done with it's part, it would hand it over to a so-called team that's called the delivery. And the delivery team was basically responsible to roll that out to stage, test or prod. And on the other side, we had our infrastructure team which was basically providing us with virtual machines. Unfortunately, it was not like a real infrastructure as a service. We really had to create and request tickets on tickets based VMs. And all this led to a couple of issues or combined with the architecture that we had like really long release and deployment size life cycles. And if I wanna break them down to the architecture and the organizational setup, in the architecture we had already a micro approach but it was micro applications. They were connected through an SDK and not as stable APIs, we have it now. And independent deployments were possible but you could only release the software as a whole package which ended up in like complex and long running releases. The architecture also had stickiness which means like we had session bound to the services and to the applications. So we couldn't use like zero downtime deployment paradigms like blue-green deployment and everything was only restricted to the Java stack. On the organizational side, I already showed that in the previous slide we had like this separation between operations and delivery. And this caused like long deployment and release cycles that the difference between the development environment which was handled actually by the development teams and the test and stage environments and the production environments were sometimes like two or three versions behind. So there was clearly a disconnect between the whole organizations and one of the main factors also was that there was no real infrastructure as a service provider that we could leverage because we were always depending. We couldn't do like any kind of zero downtime deployments as we saw in the previous talk with Bosch. So what did we change? First of all, we decided we have to go for infrastructure as a service. At the beginning we did that with AWS because it was the easiest and fastest way to set up and later we switched to SAP Monsun which is our SAP internal infrastructure as a service provider. This is really essential if you wanna use a platform as a service like Cloud Foundry. And we use this also for our back is called as deployment, which I will show later. Then we introduced a platform as a service. We started with days for prototyping and playing around and getting an experience but we pretty soon switched to Cloud Foundry. And this gave our developers basically the freedom to self deploy and self operate their services and also to choose the freedom of programming language that they wanna use. Then we did a lot of changes in the architecture. We introduced the paradigms that are given by the 12 factor net like one of them is no sticky sessions, restless date. This way we could do blue green deployment for zero downtime. We followed the reactive manifesto for resilient services, for example, for scalability. And we started introducing a microservice architecture which resulted like in stable APIs. So we really have now stable APIs and every change to the API will reflect in an increase of a API version number. This all helped us to introduce basically one of our factors, the ability to run it. So the dev team that develops the service also operates the service through all stages even in production. They deploy it, they set up their continuous integration. They set up their continuous deployment process. And the microservice architecture and the way we use Cloud Foundry enables us to do that to have independent release and deploy cycles between the services because there is no dependency between the services. And we also extended this whole, you build your run it paradigm to the backing service. So if a team requires a database for their microservice, they have to operate and deploy this as well. So this came then to more or less like a view of this, how we structured our teams. So we have on the bottom, we have core services teams which working on the core teams and then we have the higher level, the domain specific teams that build their services on top of these core services. And then we have our UI UX teams which are currently separated, but I think we're gonna move them partly into the teams. I mentioned that also our teams managing their backing service on their own. So in the previous talk, we learned that in order to deploy Cloud Foundry, you have to use Bosch. And we extended the whole concept to our teams and yeah, they're using Bosch as well to manage and maintain their backing services. And what we did is like the micro Bosch, we have micro Bosch per team set up. So each team has their own micro Bosch and can deploy independently from each other. And that's the way how they manage their, yeah, their sue of backing services. That's it from my side. And Johannes is gonna now tell a little bit how we use monitoring and logging to support our model. Yeah, hello. Is it on? Perfect. Yeah, I guess I will skip the introduction of myself because yeah, we planned to, we thought we had 40 minutes, but 30 minutes, yeah, well, we have to speed up now. So I'm basically one of the guys living in the dark and doing some computer stuff. So I guess basically the same, same like you. If we talk about logging and monitoring in microservice architectures, we are facing a lot of issues or challenges. If we have looked into the old days, it's really hard to keep an overview what's basically going on in my system. And my first employer had just had one box. If there was something wrong, I had to look at it and could fix it. Now I have to, at least in Cloud Foundry, I don't know how many components are there, about 10. So at least to debug Cloud Foundry itself, it's more like hell if you do it with old school ways. And if we have a look onto our services deployed on top of Cloud Foundry, it's going even worse. On top of that, each team can pick whatever they like to, so they can change, try whichever technologies they like to use. We got some teams using Go, others are using a lot of them using Java. I guess somebody is using Aka. I don't even know what it is. So we have to find a way basically to monitor and log all this through one single pipeline and to somehow get a similar view on the different technologies. Then there are different app scopes. So let's imagine we have a product service. If you click in your shop, you'll like to see immediately the details of a product. On the other hand side, we get some really slow responding services like the checkout service. The checkout service is in the backend checking if your credit card is valid, if your address is correct, and so on and so forth. So the checkout service is probably really much slower responding than a fast responding product service. So you can't apply the same rules to each app. And last but not least, you always got some kind of platform influence. So if our infrastructure as a service provider is not running well or if we are doing a bad deployment, you will see probably on app level some influence. And to somehow get all of these different points into one big overview, we sketched out in the first draft or in the first try the following architecture. Starting from Cloud Foundry, we are sending metric and event data over to Riemen. And from Riemen, we are starting the alerting, we are doing some data combination, check if one service is up, then should the other run as well and so on. And at least alerting the teams through victory ops and storing all the metric data into graphite. One of the downsides of Riemen, since at least I am not the real big enclosure, it's hard to configure it. And at least it is not really scalable. There is for Riemen, you can program some scaling in there, but it's not really scaling well. On the other hand side, we are running the locks for a different approach. We are using lockstash with the dislock endpoint, where you can configure in Cloud Foundry the dislock drain URL, sending all the data over to lockstash. When it starts or cached in Redis, send over into a lockstash indexer and finally it's end up in elastic search. That's quite okay, but lockstash is not really comfortable to use like the thing we are doing now. In October, Cloud Foundry released a new cool feature called the fire hose. So this is basically the underlying architecture. You get some sources where locks are sent over to the Metron agent installed on the VM. The Metron agents are shipping over the lock messages to do Doppler where it's buffered. If you have configured a dislock drain URL, the locks are sent to dislock directly. Otherwise, the lock regator traffic controller is taking care that if you ask for some locks, you will get them. And that's basically the way how the CF CLI, if you run CF locks, is working. And on a much bigger scale, the fire hose is doing the same job. If you request or if you connect to the fire hose, you will get the locks of the entire system. And one of the main benefits of the fire hose is it's also scalable. So if you connect different client with the same client ID several times, the lock messages are spread equally over the whole fire hose. And this is a really great thing because now you can build a small app, deploy it to Cloud Foundry, connect it to the fire hose. And if this is not working anymore, you don't have to do a whole bot deployment like. Cornelia told us you just have run CF scale and you're on the safe side. So what we are currently building is based on this Doppler in mind. We still have all the backing services and the application and Cloud Foundry component locks itself. But what we are doing now, we are sending all these different sources of lock information and metrics information into Doppler. Just as a note at this side, there is a client available you can use to integrate also backing services, deployed not into the Cloud Foundry cluster. So yeah, that's something we will try out soon. But at least the application locks and Cloud Foundry locks itself are running through Doppler. Then we implemented a lock parser. That's the thing I described before. So the NOAA client is basically the component PVOTL is providing us to get the locks out of the fire hose. We build a lock parser which is basically checking which kind of messages message are you and are you a normal lock message? Then you will pass with the first rule. If you're maybe a router message, you will be parsed with a router rule. And basically we are distinguishing on a lock level between two different lock types. First a normal lock type where we just get maybe a request ID and so on. And second we got a metric lock type. So we removed this whole Riemann infrastructure, sorry. And we are now sending over the metric data through locks too because in the past we had the problem that if somebody is using some strange programming language or some unknown technology, they probably had to implement by their own the Riemann agent. Basically for standard in and standard, or for standard error and standard out, every programming language is able to lock to standard error and standard out. And that was the reason why we decided to send everything through lock messages just to that nobody can complain, I don't have a client for this specific monitoring tool. From there we are handing over all the data into a Spark cluster. This basically done to get a replacement for Riemann. Spark itself can do the same as Riemann, from my perspective, with a benefit of scaling. So now we can even if there is a huge traffic in the lock system we can just scale out into Spark and we are done. On the output side we are storing all the data in Elasticsearch where the teams can build nice dashboards on top. And the alerting is done through VictorOps, which is kind of a patriot duty. And there will be several more systems consuming the locked data because you can do some predictions on top of logging data, for instance. You can, yeah maybe some customer related data can be fetched there and so on and so forth. If we have a look back to the challenges we faced in the beginning, the overview, yeah it's still a tricky thing because you don't get this nice UI where you know, okay here's flowing from here to there, but I already mentioned the request ID we established in all our services. So if in lock message or if somebody is setting one of our endpoints, this request ID is taken through the whole system and at the end of the day you can have a look in Elasticsearch for this single request ID and you got in basically all the endpoints, this request from the outside, hit it. Maybe some other part where you can leverage these output things, build automatic graphs based on this data. Different technology I already mentioned, we are shipping everything through locks so if there is somebody not able to send out locks he is really for sure not taking the right technology. Different app scopes. In Re-Man we had a single Bosch deployment to install Re-Man and configure Re-Man and so on. With Spark we are now able to allow teams to deploy their own monitoring rules and hopefully the teams will do this to cover their specific needs of monitoring. And the platform influence since we got all the lock messages about response time, which error codes or which HTTP response codes and so on and so forth, we are able to identify if the whole performance of the platform is going up or going down. And so at the end we really got a good overview not about specific components, we can also get a really good overview of the entire system. Oh, I was faster than expected. Are there questions? Should we do a quick... What about system metrics? With CPU, disk space and so on and so forth. That's something we like to integrate there too. At the moment we are doing this with Re-Man. They're basically this Re-Man health tools. So maybe I have not mentioned that we're in a transition phase at the moment so the app locks and metrics are going through the Doppler approach. The other things are partially still going through Re-Man. And yeah, there is I guess in the NOAA project already some small piece where you can some example file how to get system metrics through Doppler. But yeah, let's, I'm not 100% sure. Oh, that's renegations. So there was definitely a learning curve for the teams but it was interesting. You can find like in almost any team you can find somebody that discovers his interest in operational tasks. So what we did is like because we had the Bosch experience, we started like kind of one day workshops with the teams where we told them the basics of Bosch. And from there on, I was like one or two guys in each team that was taking over the task and doing the Bosch deployments and playing around and improving them. Because for example, for a MongoDB, our team, we didn't have the experience that the developer teams has like what to tweak and how to improve the MongoDB, but these teams had it and they just had to apply that then to their Bosch deployments. Both yes and no. I guess to sum it up, you have to automate it somehow and if you're using puppet, you have to learn something and if you're using Bosch, you have to learn something as well. So at the end of the day, we could provide them an in-house workshop where they only have to pay some travel costs instead of hiring somebody from the outside who could do this, for instance. Thank you. Thank you.