 Hello everybody and welcome again to another OpenShift Commons briefing that today we've got Martin Edmier from Dynatrace and he's going to talk about monitoring microservices at scale on OpenShift. We're really happy to have him with us We've had a few technical bugs, but other than that we're going to get moving and I'll let Martin talk introduce himself a little bit more and the format of this is that You can ask questions in the chat while we're going through this and then there's Q&A afterwards And there's a few people on from Dynatrace who may be able to answer questions in chat But if there's really good questions, we may pause him and they can take a breath again or just save them and reiterate them in the Q&A and We'll run over your be a lot of time a little bit today because we had a late start, but We'll be happy to do so so go for it Martin Introduce yourself and show us what you can do with monitoring I give it a much as I am for having me and welcome everybody to this presentation So my name is Martin Edmier and I am a technology lead at the Dynatrace Innovation Lab If you happen to have any questions regarding our deep integration with OpenShift or on our Reddit partnership in general Please do not hesitate to reach out directly to me So who is Dynatrace? Dynatrace is the leading digital performance management solution By making business development on operations succeed in optimizing customer experience Modernizing their operations and accelerating innovation at Dynatrace We are proud partners with Reddit and it is our mission to create the best performance management solution around OpenShift OpenStack and Ansible And in this presentation, I want to demonstrate that we are serious about this So let me start with a quick introduction to why microservices When we talk about microservices, we talk about dozens to hundreds if not thousands of small interconnected components that each serve a single purpose The reason why microservices are so appealing Is that they have the potential to solve a variety of problems and the prices are seen these days microservices increase velocity by enabling independent development and deployment of code and As Rich Sharps, the senior director of product at Red Hat mentioned at the Reddit Summit in 2016 Today speed equals revenue microservices also support growth By allowing services to be scaled independently So depending on traffic scenario, we can add a few instances here or remove a few instances there Also microservices faster innovation They enable experimentation and fail path to create the right product and they do so by enforcing Agile product centered and self-enabled teams Now these features make microservices look good on paper But what do they look like in production? So this screenshot has been taken from one of South America's largest e-commerce production environments Serving more than 3000 services as you can see on the left distributed over more than 10,000 containers in seven data centers So building a system composed of microservices effective means building a fault tolerant highly dynamic distributed system And that is really a complex thing to do Let's take a look at an important detail here as you can see from those red colored nodes which are faulty services You can see on the left That a total number of 30 of 30 services have failed in the entire service landscape But most importantly and this is what Dynatrace understands like no other tool It's that none of these faulty services had any impact on the customer's online applications, right? See if you have a distributed fault tolerant highly dynamic system You do not want to be alerted For each single cpu spike or any problem that you see Unless it affects your users So now I would like to present how Dynatrace can help you manage the complexities when designing And running microservices on open shifts by discussing free important learnings that we've made along the way So learning number one is that microservices are complex In fact microservices they can themselves be fairly simple on the inside However, the overall complexity does not vanish Instead the complexity shifts into the surrounding environment empty intercommunication between services We also say that microservices have a lower inner Versus a higher outer complexity in the connected parts So let's talk about environmental complexity first So what is the microservice platform composed of? We need things like service discovery configuration management routing load balancing execution environments like docker messaging channels, but also things like deployment automation And monitoring are considered key capabilities of a microservices platform And if you're using open shift The complexity of building and running such an architecture are nicely abstracted by convenient tooling But still you're going to build ship and run your services on a significantly complex distributed software stack And essentially you use a pass because there's no real sense in crafting your own platform And in the clustered environment You also don't really care about how your containers are scheduled But that doesn't mean that you should fly blind So drilling down here To see some more detail shows the dinitrace Compute a topology model of the entire system would reflect the real-time relations And dynamics of your application Right down to the data center that they run in Including any services processes or docker containers which we use interchangeably in in this view and hosts Now this is important for two reasons So number one is that Customers have told us that they had no idea what their system actually looked like until we showed them And reason number two is That having the possibility to introspect a system at its various layers in detail And in real time helps people in business Death and operations make better decisions. And this is what dinitrace is about And I want to add that all of this information that you see here Is also discovered for you. You don't have to do anything. It's kind of serial configuration out of the box So let me talk about into service complexity now One design principle of microservices is to do one thing and do one thing well Which aids separation of concerns and improves composability Now to remain composable microservices demand robust apis Now imagine the desired scenario where your teams own separate microservices How can you be sure that the resulting services interact as designed? Of course with dinitrace we cannot validate that your services behave as intended by their designers But we make you see how your services are composed and how often and in which ways they interact in real time So what do you see here on this screen? Is what do we call the service flow? And you see that we have multiple services connected to each other's right. It's not a simple system We have back-end service. We have bookend services journey services And we see how often and in which ways those services interact and how those service calls Contribute to the overall response time Now here is another way to look at the same service Now we see on the left side and the main part of the screen We see the the median response time We see the failure rate of that service. We also see how many services invoke the service. So the fan in They also see how many services are getting invoked by these services in this case 11 services We also see that this service is not a single process. It's actually running on four different tomcat instances And we also see that the service communicates with two queues asynchronously to communicate with other services And if you click on any of these But this here in the UI you will get further information on what these interactions look like on behalf of this very service What we also see on the right side Is that we have and this is something that would affect the composability. So how well your services can be composed That we have a certain rest endpoint that has a hundred percent failure rate. So certainly There is a bad implementation happening here, right We also see which are the most time consuming requests time consuming requests are not only slow requests, but are also called A large number of times, right? So they contribute considerably to the overall response time of this service And then we also see The generally slowest requests on that service. So those are your candidates for improvements, right And if you take another look on the left side What you see here on this time series chart is As I selected here the response time Okay So we also see that the response time Had a problem. We had a response time degradation We had various events over over that time frame, right And we can also drill down into the problem and take a look at the problem details And I want to do this a little later on during this presentation Now I talked a little bit about the Impa service complexity. What about The intra service complexity of things that go on inside the services that we have to talk about So as an example here A new stack article mentioned that Synchronous rest communication Effectively turns microservices back into monolith So when you read about performance in microservices many companies many enterprises, they They actually understood That there is no way around asynchronous communication, right because there's so much communication going on Over the network, which is considerably slower than communicating Um in in memory as we used to in monolithic applications Um synchronous communication is just not the way to go And enforcing asynchronous communication Uh and cues and decentralized data management Um bring another level of complexity Eventual consistency another thing like that which make your systems really hard to handle and oversee Another thing that I'd like to mention is also complexity that happens inside your microservices Is that your threading model and the way that you interact with your service interact with other services Dramatically affect the efficiency and the resource consumption of your services So what can you do in Dynatrace? So here is um What do we call the screen the response time analysis of a service? and here you see at a glance How or how the response time is distributed? In terms of interaction with other services and cues database usage But also service execution, which means the code executed For this particular service at hand, right? And there's a lot of things that you can drill down into here But what I'd like to show you is that on the very bottom You see what we call the pure path So here you can see the actual method invocations which take place inside the service and let me repeat again You don't all this is done automatically for you. You don't have to do this by hand, right? This is why we've built Dynatrace and this is the level of convenience that we wanted to create for large-scale distributed microservice environments so the learning number two Is that microservices don't fail independently In a highly distributed microservices environment the old saying The more moving parts the higher the likelihood for failure certainly holds true And since failures can happen anytime you have to become fault tolerant That is you have to ensure that your application can tolerate a potentially high number of simultaneous failures without compromising customer experience And just because microservices can be deployed independently. That is if defined correctly It doesn't mean that they fail independently So more often than not They fail in a cascade And what you see here is a feature in Dynatrace That allows you to analyze the failure Or the problem as we call it of your services And if you look on the right This is actually an animation that you can start Which begins Right before the problem occurred And then shows you how the problem evolved and how it eventually disappeared disappeared over time Because most problems that we deal with in production are of transient nature Think of a CPU spike The important thing is that you can replay and see How did This particular problem affect my service and then how did it populate and bubble up to the user And did it affect my user at all or didn't it affect my user at all? right, so This is a very helpful feature and gives a lot of insight into what's actually happening Inside a problem and I would like to add That we correlate A singular instance Incidents that happen in your system and we understand The dynamics of those systems and we see we we can automatically deduce Whether they belong for the same problem or not So instead of alerting you during the night and saying that the CPU spiked or that tomcat crashed We give you the replay functionality so that you can look at it As a single problem and you don't have to look at all these uncorrelated incidents that you usually find In log messaging based solutions So let's assume now that your application has failed The questions that you would like to have answered immediately are What's affected? What's the impact and what's the root cause? And the bandwidth rates will have more than 10 years of experience in building a performance monitoring solutions And that's why we give you this information at a glance So this is the analysis of a particular problem And we see at the first side what's affected we have three applications Which has been affected in a given timeframe You see the timeframe here We also see immediately that the problem affected or affects real users If we look at the table below, we see that the problem Has already recovered and that it affected three applications 15 services and two infrastructure components, right? So it's actually good to know that the problem has already recovered or that the applications and services and infrastructure components have recovered But let's take a look at What what was the impact? So we already know that three applications have been impacted and in total Around 1500 user actions have been impacted per minute. So this has been A user visible problem, right? We also see if we look further down That one of the applications had a slowdown to 2.5 minutes As a median value And we also see that all browsers All user actions Has been affected on all operating systems across all geolocations Now the nice thing about this view is If the problem had only existed in an area around Boston or in an area around Vancouver We would have told you so here, right? And we also let you know what's the root cause And in this case We've analyzed more than 90 million dependencies For to end for this problem And analyzed to the root cause has been a cpu saturation On these two nodes with the given name, right? And I want to take one of the take a step back here and go back to the number of the 90 million dependencies Which we analyzed here. So 90 million dependencies. Where does this large number come from? So when you install Dynatrace for OpenShift We automatically detect how your processes communicate. We automatically detect how the services communicate. We automatically baseline the metrics like response time And and many other things, right? So those are dependencies which we analyze So that we in the end can come up with a clear image of what Has been affected in your application and and whether it affected your users or not. Okay So learning number two learning number three is the network isn't reliable Well, let's quickly talk about the role of the network So when you migrate from a monolithic to a microservice oriented architecture You trade fast in-memory communication Too much slower into service communication across networks And let's take a look at the service flow again if you look at those two Uh services the churny service and the check destination service So if you want to Validate whether what we've done here actually makes sense So you can see that the churny service For 99 percent of its invocations within 99 percent the churny service is called Invokes the check destination service So that's about one time for requests, right? so if If communicating with services contributes considerably To your response time, right? You might reconsider whether A high level of distribution of your microservice actually makes sense Or if you might want to re migrate those services into a single service, right? so What is the network infrastructure? It sells becomes a limiting factor in your application And those things indeed happen So what do you see here is from our dam from our dam application? But I recently had a discussion With cto of a company in europe Whose goal is no less than becoming the next generation ebay in europe And what their problem has been that they had a high number of tcp Packet retransmissions and they didn't know where it came from And as a rule of thumb if you have a highly networked application that's communicating Widely across the network Then a tcp retransmission of six percent Already renders your application almost unusable Now in this scenario and also in the scenario that the cto told me about They had a tcp retransmission Percentage of around 10 percent And they found out that it was due to a defect pin on one of their network cables So how could you potentially figure out this problem if you only look at your log messages? You possibly can't And this is what we automatically detect for you. And this is why base lining and automated base lining is so crucial Okay, so let me talk about how to How can you how can you get dynatrace into open shift? So let me introduce you to dynatrace one agent Dynatrace one agent is our single agent technology that allows you to monitor your applications Your services your processes your docker containers your hosts and the underlying infrastructure Without any configuration out of the box And this is how it works So basically there are there are two major options And the option number one that i'm referring to here is the one that we That we would say is the preferred option because it gives you the entire picture And this is what we called what we call dynatrace one agent for full stack monitoring So what you would do is you would roll out one agent On the host machines that make up your open shift kubernetes clusters And once you've done that you will get Insights into everything that runs on top of your machines Including the open shift and kubernetes infrastructure And all your containerized processes running on top of the platform So we've also created an ansible role which is available on the ansible galaxy for the automated deployment Of of dynatrace one agent for full stack monitoring on your cluster nodes. So this is very convenient another option that you could choose And this also includes the full stack monitoring approach is to run The dynatrace one agent inside a docker container And we actually in in this scenario one agent does not run in the docker container But we use a docker container as a vhyco to roll out One agent on the nodes That that make up your cluster Through a privileged access right. So this is what it actually looks like And we're proud that since a few days We are our dynatrace one agent Container has been read as container certified Which means that the container is secure And entrusted and is ready to use an open shift and on Atomic based hosts So this has been the preferred approach and you can use it on any open shift platform Where you have direct access to the host level But what happens if you use open shift in a managed scenario like open shift dedicated Or open shifts online develop a preview right what can you do? So what we did is we provide dynatrace one agent for pass monitoring If you don't have access to the mode level, it's not a blocker. What you can do is you can use dynatrace one agent for pass And conveniently included into each of your docker containers that you would like to have monitored by dynatrace And the way that you can do this is by using either the open shift command line interface. You can also use The the nifty s2i or a source to image tool that's also provided by reted But you can also fold back to A traditional docker files if that's what you like to do, right? And if you're interested in doing so, there are a lot of tutorials which we provide on the open shift block So what this gives you Is that you can use dynatrace with open shift on any open shift platform and for any open shift offering whether that's open shift origin or Open shift container platform as it's called now open shift dedicated or open shift online And if you're interested in learning how you can roll out one agent Either for full site monitoring Or for your docker containers, please take a look at the block article. So dynatrace team has provided on blogs at openshift.com You can also refer to our landing page. You can simply find us by searching for open shift monitoring on google and then Directly solitaire landing page and once you're there Please check out the dynatrace free trial and please provide feedback Martin there's a number of questions Okay, there are a number of questions that are coming in probably more on the technical side of how all this works um pause and uh, the first one, um Kind of answered it a little bit. Um, from a technical perspective, how does this all work? Is there an agent that runs in each container and how does it trace a call through many containers? And does it add header information to all the packets? A number of questions in one And then another one that's kind of can you repeat the last question, please? And how does it? How does the headers does it add header information to all the packets? Okay all right, so, um How does it work to to inject managers one agent into containers? So Basically, we've divided the installation of one agent for paths Into three simple steps, right? So the first step is to download one agent for paths from from from dynatrace And this will install dynatrace one agent inside your container So this can be a container that you create using source to image It can also be a container that you create using The oc command line interface You can also use this approach to inject Dynatrace one agent for paths Into existing applications that already run on open shift by leveraging oc patch or oc edits, right? And and the approach is basically that you install the agent into each of your containers So once the installation has been done for java, this is very simple. It's a simple single command That takes approximately 20 megabytes of space in your container And then we provide additional convenience on top of that because The one agent for paths has to be injected into your particular application process In terms of java java, this would be a jvm But to make your life Convenient is you just have to source a configuration file that already provides The necessary java options So that when the java virtual machine is started automatically picks up the agent And that's step number two step number three for you would be to start your java application and that's it, right? So in order so what you will get when using the pass approach is that you will get All information you can't you cannot obviously get any information From the hosts that you know are the foundation Of your open shift cluster But you can get all the information of your processes of your services and of your applications including real user management, right? So and the other technical question regarding header is that when you use spinach is one agent and a web server For real user management, for example Then it would add add header information So that we can keep track of your entire transaction From the web browser or your mobile clients Across all the back end peers and back end services that are running on on open shift Okay, I hope this answer the question is not Please feel free to reach out to me directly. I will In a minute show the email address again One last question here about what what overhead can be expected if we deploy the agent inside the workload containers Yeah, so from depends on what you mean with overhead, but if you if you talk about Let's start with hard disk space So this really depends on the technology So let's say for java the the agent costs you no more than 20 megabytes of space Inside your containers and I think that's a pretty fair amount of space So if you think in terms of performance We've taken A lot of caution Because we know that we integrate deeply into your application And for example java depending on how deep for your instrument your application But we take around 1 to 2 percent of the performance of your application And and we believe that for the amount of information that we give for you out of the box This is also a pretty pretty fair amount All right. I think that answer to this question. We're saying thanks Why don't you put up that last Slide that you had with your email address if there are more questions Yes, I want to do that. I just want to quickly present a quick outlook What what's coming up next? There's a lot of things coming down the pipe and we want to integrate more deeply with the open shift platform But what I would like to tell you about is that we would like to Have dynatrace inside continuous delivery as we call it also work on open shift So if you look up the the ecosystem around open open shift and the work that has been done By the developer experience team of open shifts It's what what they have is they have set up a continuous delivery Project that you can quickly spin up by just using an open shift template, right or kubernetes kubernetes template So what dynatrace has been doing for for many years in the past? Is that we have helped customers Understand whether their applications are good enough To be pushed into production or not So by analyzing their automated tests meaning integration tests and acceptance tests whether they use Whether they test rest apis or whether they test the user interface browser user browser ui web ui It's that we can understand Whether this would have any severe performance implications or not, right? So we call this an architectural validation Before you go into production So what we say is I want to keep it short now because we're already over time Is don't just optimize for speed. We all talk about speed, right? But it's it's also about quality. So instead release fast and with certainty And we want to provide you with a use case where you can Use dynatrace not only in production But also in pre-production to understand whether your containers that you have an open shift Are fixed for purpose in production or not, okay? So in detail this means that we can help you identify a bad code before it gets checked in We collect performance metrics from automated tests And we feedback those metrics into Jenkins And are able to also stop bad builds if you want us to do so, but that's an optional thing, right? But this is what we're able to do So if there's only one thing that you take away from from this open shift comes breathing I would like you to be it the AAA as we call it This means the dynatrace provides discovery out of the box It comes with also baselining of all the important metrics that we determine from your processes From your containers from your services in your applications It also comes with automatic problem analysis That means if you don't have to manually correlate Thousands and millions of incidents we understand how these incidents relate to each others And serve them to you conveniently as single problems That you can replay and understand how they evolved and how they actually vanished over time if they've been transferred Okay So please if you have any further questions Try out the free trial give us feedback And let us know how we can further improve and help you succeed In your journey Thank you very much Martin there's one more question that just came in john It's sort of a follow-up on that header one you had earlier and How do you ensure the header information persists through various microservices and he gives an example the example is If arrest call goes to a serve to service a two seconds later arrest call goes to service b How do you know they're related? Yeah, I'm not sure if I can give a medical question because that goes deeply into how we How we implement the agent but if john if you send me an email I will send you an an appropriate answer and can also Bring you in contact with a development team member who who knows really how this has been done Okay, so please do reach out And if you if you can't get directly to him just post it the question on the mailing list for Openship commons, which most of you should be on if you're not Find up soon. It's on commons.openship.org And you can just join and add you to the main mailing list Again, thanks martin. Thanks for suffering through the technical sound check And we're really pleased to have you here We'll definitely do this again in probably in in 2017 because I do I would love to see this running on um I have a couple of large-scale Folks that are using that could use this and I'd love to see that some of the Working through that and and and just see it live in an action though. I do realize that it is a In the interest in time. It's probably very hard to Demo that live But it could be it could be an interesting thing to see um So once again, thank you. Um, and folks if you have questions for martin or About dynatrace just reach out directly to them. Um, and we'll be doing this again next week And hopefully you'll join us all for the Openship commons gathering in seattle and martin are I think there's a couple folks from dynatrace who are coming to that that have registered. So All right, um, you can ask them in person in seattle on november 7th at the Openship gathering But take care everybody and we'll talk to you all soon