 There come everybody and I know this is the last session today about the last of the tracks I'm actually very excited to see so many people here when I saw that this is a I'm so late on the third day of a Completely packed program. I thought I'm going to have maybe like 10 people here And then I saw I'm going to go against dr. Nick talking about conqueror So I thought I'm going to have three people here. I'm I'm very delighted to actually see such a big audience here My name is Johannes. I'm a software engineer for pivotal. I've been working on cloud foundry for the last three years Always be working on the logging in metrics team Early on the locker-gator team. I'm very passionate about metrics and logging Whenever I say that it sounds really weird and awkward. I would not recommend using that as a pickup line but It's certainly something that did I I grew very accustomed to working on and this is where they want to advance the product in so naturally then the When the call for papers came along, I thought I would want to talk about application performance monitoring in cloud foundry In order to make my title very buzzword compliant I thought I call it application performance monitoring in the cloud native age Which leads me to the first point. What does it actually mean to be cloud native? I took a tiny little survey asked three people if they could give me a good definition about what it means to be cloud native One of them was able to talk intelligently about it. So that's not I found there's not a lot of information around there So I thought how about actually define it, but before I go ahead and define it I I want to share this one tweet that I saw yesterday by one of my co-workers who taco If you are participating in this game now would be a really good time to have a drink I Just want to make sure that I'm going to mention a cloud native a few times now. So keep your hip flask out So cloud native when I when actually looked it up. I came across this definition here It means you should your application deployment should be container packaged It should be dynamically managed and microservice oriented Now all of these three things have some implication on what that means for for monitoring for example that dynamically managed means Your app is going to be managed by a platform if your app is crashing it might get restarted So do you want to get alerted in the middle of the night whenever your containers restarting? Maybe maybe not because the platform is going to take care of for you You really want to make sure that you get alerted when there's customer impact But that does not necessarily mean anymore when your app is crashing So whenever I approach a topic like application performance monitoring I always try to approach it with those three very important questions. Why and what and how? So this is kind of how I I'd structured my talk So the first why do we want to monitor you all are here? So I assume this is should be very very simple But I want to share this one anecdote, but I think it's actually so Telling and it should be the reason why we are all here who remembers the night capital group It was just four years ago. It's I think known as the Software bug that caused the most amount of money the night capital group 2012 was really big in the stock trading market and they did really well and they had a very intelligent system and So then 2012 for August 1st. They deployed a new version of their software 45 minutes later. They were out of their 45 minutes later. They were out of business This is kind of what happened here They lost in 45 minutes they were running this bad code and they lost $400 million And then they were acquired for a cheap bargain Three months later by another company. They didn't recover from this. The code that was in there just Made the system by a Stock they overpriced and so they lost so much money that they couldn't recover The one thing that I find very interesting about this and how I think this is very related to our topic here I I mean as a software engineer. I have pushed out bucks. I've pushed out bad code So that shouldn't be that interesting here, but they had this code run in a production system for 45 minutes Until it was way too late if they had good monitoring in place Why didn't they catch that after five minutes or ten minutes? They might have been able to recover from that, but they had it run until it was too late And I think this is like the one lesson that yes, please actually makes let's make sure that that we have good monitoring in place So that our company doesn't go out of business And then the other reason why I think monitoring is very important Sharing a little bit of a personal anecdote here at when I just fresh out of college I joined this startup there I was one of the two engineers our monitoring in a learning system typically Looked like this Customer couldn't reach our system customer called the CEO and CEO called me There was made a very unhappy customer made a very unhappy CEO and my wife wasn't very happy either because those calls were typically in the middle of the night so this kind of like was me back then I Was like when I could sleep I should try to get sleep in because I was on call 24 7 and my customers called me Please let's try to not be these people that I don't want to very too much about my sleep I want if monitoring in place it allows me to sleep quietly and sound at night So these are I think some some of the reasons why we monitor Let's try to get good sleep as engineers and also let's make sure that our company stays in business Now next what do we monitor and This is I think a Topic where I might have to disappoint you because unfortunately, I do not have a silver bullet here for you What we have to monitor really depends so much on what kind of application you have in most cases Yes, it might be that latency is exactly what you want to look at Latency has a huge impact on user satisfaction In other cases you might want to look at your error rate or Some completely other metric, but it depends really on what kind of app you have And this is I think where it depends is such a Unsatisfactory answer here, but so let's actually look at some of these metrics you might want to look at so for example, there's latency Mike Villinger from dinatrace Mentioned this in his lightning talk yesterday in the morning Then Amazon's latency drops by 100 milliseconds. They're losing 1% of their conversion rate. So that's Probably a metric they should watch So if I had a Website that's basically a storefront. I probably would watch my latency If you have a marketing website out there and you want to encourage your Customers to download marketing material the bounce rate is very important So maybe that's what you want to monitor if you have a batch processing system that is Processing satellite images. Maybe you just want to look at error rate or your queue length But nevertheless, I think the one thing that is very important Remember the business value of your app Your apps are out there not for technical reasons. I mean latency and and bounce rate and errors They those are all really great Metrics to watch but your your app should generate business value and you should actually keep a handle on Understanding what's the business value of my app and you should watch that very closely because in the end That's why you have your app deployed somewhere in a production environment So the Chunk of my talk is going to be actually spent on how do we want it? How do we monitor? and so they're There are few a few topics that I want to mention So as I said in the beginning being cloud native means microservice oriented Microsoft is oriented means you have a request come to the first server the first server is going to Fire requests to all your other microservices now if the end user complains That the request is taking too long for him you At first don't know which exactly which service exactly is taking so long And it will be your responsibility to figure out which service has to be tuned or which service is faulty which service is responsible for For having negative impact on the user's interaction and this is where request tracing comes in when we talk about request tracing The first thing that always comes to mind is a sipkin sipkin is this project that got started by Twitter actually Was built upon a research paper published by Google About a system called depra that does exactly that in a microservice architecture You look at all your notes and you see exactly where your Latencies is generated So here's a very simple example on how to do request tracing in With with a Cloud Foundry app Imagine you push a spring boot app with spring boot. It's very simple to enable the spring sleuth project which is Includes open sipkin that library You just have to put the library the sleuth library in your class path and you're going to get this all for free so imagine that we have This microservice architecture with the four services The visualization visualization that you're going to get like looks like this year There you have your first request going to service one service one is firing your request to service two Which fires requests to service three and four and you see exactly there is my bike of of the latency generated In when we talk about request tracing, there's typically what we call a trace Which is the one overarching request and then there are spans span is That unit of work that is used by a server to actually process a request So in this case, we have exactly seven spans for for all one trace because that is the service generating requests to other services and processing those requests I Do have a Live demo, but honestly that live demo is not that much more interesting than the screenshots This is an app running on Cloud Foundry and if you look at the URL, this is actually deployed to Pivot web services It is the example straight from the Spring sleuth documentation and so I can just generate a request to my service one here and can search for my request Service one, there's my search button there find traces and you'll see You'll see traces Let's go for news first And so this is a trace that we just generated there you can then dial in look in all the details that you see here So if you have a microservice architecture, this is a very powerful tool this tells you exactly What service has to be tuned what service has bad performance so you can actually really? Assume in on the one thing that you have to improve So my service architecture, please use something like this that allows you to do request tracing and again spring boot and and Cloud Foundry are really your frenzy because they make this in really really easy to get this all up and running Next one of one of the very obvious choices you have an application pushed onto Cloud Foundry You you have to monitor it. They are really really good monitoring solutions out there It's very simple to bind one of their services and restate your app and you will get Out of the box a good interface to actually monitor your app The three that I want to mention here because they are actually part of the Cloud Foundry foundation diner trace New Relic and App Dynamics I was going to try to look for representatives representatives for From those three companies and then wanted to see who's giving the most swag So and then I would use them as a demo here But nobody gave me any swag So I just I ended up using New Relic because honestly that was for me by far the easiest to actually set up It took it took literally a minute. I just went to Pivot web services. I created a New Relic service I bound the New Relic service to my spring music app and the result looks like this year and again just a tiny little Here's my mouse This is then what I what I got as a result Just very simple to provision and and out of the box. I I get this New Relic service now That allows me to really drill into Called as it called traces and and see they exactly my application source code my application is lower Even if you are not on Pivot web services, I I BM Bluemix has a very similar Service that comes out of the boxes. I think it's called Something like Monana something that allows you to monitor your applications We from pivotal actually the team that I'm working on we are working on on a solution for Monitoring your app that comes out of the box So if you are on Pivot web services, you can very simply go from the application manager to just click that link Just look at the metrics of your app You will see very basic information about network traffic on your app and CPU and memory and disk statistics, but that all comes out of the box So the cloud foundry ecosystem is certainly working on trying to make All these things much much easier for you. So If you for some reason cannot use one of these out of the box Monitoring services New Relic and AppDynamics and Dynatrace. They all come with downsides Maybe you don't want to push your monitoring data Through your firewall to a public server or maybe you don't want to pay for an on-prem installation There's still so much data generated just inside the cloud foundry system that you can build a very comprehensive monitoring solution With I want to say very little overhead yourself And so I want to over the next a few minutes I want to highlight some of the things that you can do right now without actually Paying any of these monitoring services And so one of the things that I actually worked with David Zabetti During a hack day a year ago. We were actually looking at the log output generated by an app We wanted to see if if you just look at the at the rate of log lines generated from your app Can you actually get some information out of that and I want to say typically yes If there's an anomaly in your app going on you probably we will see something happening with your log lines that could be Maybe it's just the sheer output of the logs that are generated But also what is the ratio between blocks written to standard out and standard error? What happens if I see suddenly much more standard error output? Maybe something is wrong with my app. And so David and Ivy were working on a CLI plug-in that I want to demo here It is very very simplistic and doesn't look all that great, but I want to show it anyways Because I want to show you how easy it is to To create Something that is of value can be of value And so the CLI plug-in we called it log Eliza So if I start this CLI plug-in I give it my application name and now it's just starting a web server And I'm going to on the web server And see if I find it here Here I'm just refreshing this this page and because my app has no Log output at all yet my log Eliza analyzes automatically and tries to make a judgment call saying yeah, something is not right here There's no log output at all. So let's try to generate some log output and I'm actually going to do this by hitting the application that is hosting my slides because that's my demo and Oh, suddenly I see logs coming in and all looks good because all the logs that are coming in are just going to stand it out And so this is in so down here. You would see The ratio between standard out and standard or my app is not producing any standard or so all is good But I wanted to show this to you that this is something that we packed together in four hours Just using the CLI plug-in architecture a big shout out to the CLI team writing CLI plug-ins is so easy So please give it a try This like writing writing a dashboard like this idea and can put up on your on your build monitor shouldn't take all that long And down here is a bonus. You actually see all the log lines So this is what you can do to just monitor our current log output And again all the credit goes to David's a Betty who actually did most of the work The Next I want to talk about latency and again, I want to show you CLI plug-in that actually wrote this morning and took maybe all of ten minutes to actually get that together because I mostly copied and pasted code from another CLI plug-in a good developer a lazy developer right so Latency when it comes to latency if you don't want to use any of these very great monitoring solutions like app dynamics or dinatrace or Neuralic you can still monitor latency because you get all the data delivered to you from the cloud foundry system and so The one thing that I want to show here is again my CLI plug-in Which I think I have in this window I called it the App nozzle the app nozzle what the app nozzle plug-in is doing It's actually it is Attaching to the firehose, but it's only looking for a message is coming for a specific app that I pass in The on this first question here to ask me what kind of messages are you interested in I'm interested in latency So I'm going to look at HTTP start stop, which is number four. Is it right? I can increase the font yes, and so what you see here I don't see anything right now because there's no request hitting my app So let's again generate some traffic by just going to my slides and Refreshing them a few times and In the back you see you see all these HTTP start stop messages coming in so for every request to my app I'm going to see now an HTTP start stop just Refreshing this page once would have triggered roughly 20 requests because of all the images in the talk and the CSS and JavaScript But so most importantly the one data that we really need to get out of here is there's a start time and a stop time Those obviously the given the difference of those two you can see actually how long it took to see that request The one thing that I want to mention in this in this regard now is when we talk about latency just as a general tip A lot of people are looking at percentiles They want to see oh, what's a 95th percentile form application? What's my what's Given that number they are trying to get a good judgment call on what's a user experience like right now Please be very careful with percentiles. They might not actually mean what you think they mean Like it's one this very simple example already. I can show you one request to my application triggers 20 requests Now if I if I don't really know what I'm doing I might look at the 95th percentile But given that there are 20 requests fired to my app it could very well be that The one main request that's actually loading all the content of the slides all the text is a one request that is Taking the longest the 95th percent I might look totally good because it's just fetching cached images and cached JavaScript and cached CSS that one request that That actually fetches content might take way longer and this is where percentiles can be very tricky If you blow yahoo's front page you're triggering I think hundred requests hundred HTTP requests So I think at that point any user would have I want to I think a 15% chance to be in in the top five slowest requests Rate because that might be one request one of those hundred requests might just be very slow So percentiles are hard. You really have to be very careful about using percentiles to To know what the user experience is like so be careful when you use them The few other things that I've wanted to mention I I'm a Java developer and I go developer So how to actually monitor and debug a Java and go app that's running in Cloud Foundry With Java the the really cool thing that is possible now that I don't think is as widely advertised as it should be Now that we can all SSH into containers. You actually can access The chain X endpoints in your Java app that's running on Cloud Foundry And so actually I don't want to take the credit here Matthew Sykes wrote this up in a very nice blog post very like really dug into see how this How we can get this to work? It is now actually quite simple So you just add these environment variables to your to your Java app that's running on Cloud Foundry And then once you have that running you start an SSH tunnel through the CF SSH command You start the SSH tunnel and so then afterwards you can access your chain X endpoints on port 5000 local host So this is something that is very very useful if you actually want to To monitor your Java applications that they're running Cloud Foundry because most of those monitoring solutions have a connector to chain X and so you can easily Forward then the chain X information to your monitoring solution. There is one interesting caveat here Obviously if your Java app is split across multiple instances, how do you exactly know which instance is coming from? Cloud Foundry as you covered the SSH command allows you to specify which instance you want to SSH to and so you can Actually do exactly exactly what you have to do here just by if you have 10 instances. You just have to start 10 SSH tunnels Yeah, and then not that there's much value in it But then you get a very nice check console picture and it shows you exactly all the things that are happening in your Java container So this is something that a lot of people have been asking for for the last few years and now it's actually really simple and easy to do This very similar you can do something for go apps I've been using go for the last three years. So that's certainly very like my passion is that's my speed spot With go you can very simply enable the deep the P profiling endpoints by just importing One package from from the go libraries from the go standard library and so once you have that imported go will try to Make all these profiling information available to the user It actually was a little bit tricky to get this to work. I have a on my github. There's a An application called P prof on CF you can just go there and Look at the source code what you and so we can't just see what you have to do to to get it to work Just to show you what the information looks like that you can get out of it. We can look at the example that I have up So this is like very simply those are the endpoints you get you get information about your your threats your go routines Your garbage collection all these things are available for any of your go apps by just Enabling the P prof libraries here you have again the interesting problems. What happens if you have multiple instances for your application The one thing that they've some creative ways to deal with this one would be using sticky sessions with with a session ID So there there are ways to actually make sure that you can get access to information for any for every container the There's only two more things that I want to mention One is Very very dear to me and is if this is the only thing that you retain from this talk, then I'm happy Please make sure that you monitor in all environments not just in your production environment monitor in your dev environment in your test environment again I have to quote what Mike Winning just said from from diner trace because I thought that was actually a very very cool quote And I don't think I get the quote exactly right, but he said something like This new world is not about just pushing out Bad code faster through automation and pipelines So you really want to make sure that the code that you're writing is good And you can do this by having good monitoring in place for your development and your test environment So that you know that before you push to prod your development in your test environment looks all good a Cloud Foundry is actually doing that. I think really well by By monitoring on all stages of the pipeline and there's good monitoring in place the same monitoring that's in place for production environments for Pivot web services is in place for For the stages in the pipeline before it hits Pivot web services And so I I really like like the way how we do it there and I can encourage you to just do the same and Then once you have this all in place You're really good monitoring system. Make sure you use it in the right way by having alerts set up correctly There's a project that I work on right now. That's all about alerting that'll be available for Pivot web services and Pivot Cloud Foundry soonish The very interesting thing about alerting that I want to mention here is Then when you set up your alerts Make sure that you look for anomalies and looking at an animal anomaly detection is not very easy it's actually Really difficult to figure out what your anomalies might be like is it because you might have a spike every Monday morning between 8 and noon because Your app is getting more requests during the time or you might have a spike every Christmas time because you're in the business of selling Christmas decorations or you're in the Tax industry and every tax day the week before the taxes you might get a big spike And so you really want to make sure that you are anticipating those anomalies and they are not necessarily triggering your alerting system Besides that I think Those are all the things that I wanted to share The just one last favor that I have to ask you if you are at all Interested in having your application emit custom metrics through the Locker-gator pipeline. Please let Jim Campbell know the product owner of the locker-gator team PC or somewhere and with that Thanks so much for coming. I really appreciate you all being here. Are there any questions? Yeah, within the platform itself It would be Interest so you mean the Cloud Foundry platform there would be Certainly absolutely possible because they are not very nice go libraries and most of the components are written in go They're very nice go libraries for for the sipkin So I don't think it would be a lot. It's just I think the orchestration effort is going to be Huge because you have to make sure that all these teams are working on the streamlined effort to get All their components instrumented so that they emit locks The logs that sipkin is expecting to the sipkin server So I don't think it would be all that hard but just orchestration and other questions. Thank you