 Hi. Good morning. Thank you for coming today. I know the people start to get sessioned out about this time in the conference like this. Glad to see some faces here. My name is Tony Irwin. I'm going to talk today about some work we've done with our Node.js microservice system to do some monitoring with open source tools. All of our microservices are Node.js, which is why it's in this track. Basically an overview of what we'll talk about. Some of you may have seen my presentation yesterday as well, but I'm giving an introduction to the Bluemix UI, which is the Cloud Foundry microservice system I work on. And it's architecture just very quickly to review that. Importance of monitoring a microservice system. One of the first things we learned as we started migrating to a microservice system is it's monitoring is just super important. And I think we probably underestimated that when we started. I want to give an overview of the monitoring architecture that we set up as well as some examples of actually using the monitoring data to do some problem analysis. I want to talk a bit about how you should go about building your own monitoring system and then a quick word at the end about synthetic measurements. So the Bluemix UI, as I alluded to, is the front-end to IBM's open cloud platform. It's a pretty large UI. It's made up of about 25 microservices at this point. It lets users view and manage Cloud Foundry resources, apps and services, or spaces, as well as other resource types like containers and virtual servers and such. And it runs within IBM's Cloud Foundry deployment within Bluemix product and just a couple small screenshots there at the bottom. The Bluemix UI architecture, as I've alluded to, is a microservice architecture. If you were here yesterday, you saw this same diagram. We started off as a monolithic application, a single Java backend, serving a single page app using the Dojo JavaScript framework. Over the last couple years, we've more or less gotten rid of the Java app and have, like I said, about 25 to 30 Node.js microservices. I'll say a few words about the importance of monitoring. I'm sure a lot of you have products in production and you may have also run into similar issues where problem occurs in the middle of the night. What do we do? So I think there are at least three big areas where it's important to have monitoring. One is for root cause analysis. The Bluemix UI is the most visible part of the Bluemix platform probably just because it's the front end. It acts as a canary in the mine shaft for the whole platform because we end up calling APIs across the board. We touch nearly every component. If you've ever worked on UIs, you know that you're the first line of defense anyway when a problem occurs. When a critical event or an outage occurs, it often starts with console is down, console is slow. I can't log in the console. It may or may not actually be a console code problem. In some cases, it has been, but it could also be a networking issue or a firewall issue or cod foundry could be struggling at any given time. So when we're at 2 a.m. and you get a pager duty, there's a problem. It's almost a matter of self preservation to be able to quickly do some root cause analysis and monitoring allows you to do that or helps you to do that. Ideally, you'd like to be able to auto detect problems with a monitoring system. So you don't want to wait necessarily until a user or a support person calls you up. It's nice if you can, by looking at the data, see that an API, for example, has started returning a bunch of errors. Let's just go ahead and send an alert to the team to do investigation before a user hits the problem. Another big reason to have monitoring is you really can't improve things that you can't measure and track over time. So if you have certain performance goals or quality goals, if you don't know what you're currently hitting, you can't really do much to improve it. Or you don't know if you're improving it or making it worse. The kinds of metrics we were especially interested in when we started working on this was that so with all of our microservices, we wanted to have information about every inbound and outbound request, inbound request being a call into the microservice and outbound request being calling an API that's outside of the microservice. So we wanted to get things like response time, error codes, the HTTP method, etc., so the kinds of things you would normally see on a request. We were also interested in lower-level details about memory usage, CPU usage of our apps, as well as app crashes. That's a signal of a problem if your app is crashing. Then finally, general health of our general ecosystem and dependencies. For example, we have a shared Redis session across our microservices. If we can't connect to Redis, then we're in trouble. So we do some monitoring to make sure we're able to connect and get the stuff we need out of Redis and send alerts if that starts to fail. So this is a diagram of our monitoring architecture at the top. It's similar to the previous diagram in some ways in that it shows the Bluemix UI client, which is the web browser. The proxy layer, all requests come into the proxy, which are then routed to those green boxes they're supposed to represent microservices in the UI that make up the UI. And so requests from the proxy are routed to those microservices appropriately. But then what I've added here is some blue boxes at the bottom, which are some additional apps we've added just dedicated to monitoring. And I've also added some lines from the you can see my cursor a little bit, these lines from the proxy and out from these microservices that go to MQTT. So we've added a piece of middleware to all of our microservices. So as requests come in and go out, we're publishing events to MQTT message queue, which is backed by the IBM Internet of Things. I guess there's a presentation on cat detection yesterday that talked about this as well. I don't have anything as cool as cats in this presentation, but so everything goes through MQTT. And then these orange lines coming out of the MQTT represent cases where our monitoring apps are subscribing to the events. So all of our microservices publish events to MQTT, then we have a couple other apps that are subscribing to those events. And our monitor storage box down here, its responsibility is to get those events, does some massaging of the data and then stores them in inflex DB, so that we have a persistent record of what's happening. And then, you know, over here to the far right side, we have Grafana, which is connected to our inflex DB. So we can see all this data in real time. We also have an alerting app or microservice that's similarly subscribing to MQTT. And it does some analysis, it tracks some numbers over time and will actually send, you know, post into Slack or send pager duty alerts if it starts to detect anomalies. And we also have this third app that I put into our monitoring category, which we call a space scanner, which is really just using cloud foundry APIs to kind of keep track of memory CPU usage of all of our instances, as well as, you know, looks at the app events and sees if there's been crashes, those sorts of things. And this is a textual view of that diagram. So I'm not going to read that to you here, but if you want a reference later. So using monitoring data, so we're putting all this data into inflex DB, and we've connected Grafana to it so we can look at it later. So we have a lot of if you're familiar, I don't know if you know how many people are familiar with Grafana, but it's an open source tool that allows you to create custom dashboards to pull data from from various data sources. We have a lot of dashboards that are like the one shown here. This happens to be real data from our proxy microservice over, I don't know, I think it's over about a 12 hour period or so. It includes things like total requests. So they're that top chart, it's probably a little bit hard to read, but where all the green is would be good requests. There's it's actually color coded. So you might see a little bit of yellow and potentially red. If we get 400 or 500 responses, but most of it's green. So that's good in this case. The second chart is overall response time. We include mean median 90% time. We think 90% time or 90. Some people use 95% is important because sometimes your, you know, your average looks okay, but you're getting these spikes along the way. So some people are getting slow responses that you would like to sort of knock out of the system. And the third chart we show here is error rate. So the number of 400 errors and 500 errors coming through, you know, we have a little steady stream of those some, you know, sometimes 400 errors are just, you know, some some bot on the internet or old bookmarks and things that don't, you know, map to you know, something we currently have some some some of our APIs return 400 errors as just per their spec. So so so seeing some yellow there isn't always a bad thing. So I often make the analogy to like a cardiologist reading echocardiograms or EKG, you know, can look at that and see, oh, you've had a heart attack, whereas most of us picked that up. And it's like, I don't know what this means. We've gotten used to looking at these charts. And we can usually pretty quickly tell if something doesn't look right. So in this case, same chart is on the previous one, but for a different time slice. And you can see a chunk there towards the middle, where there's a bunch of red that pops up, meaning we're getting a bunch of errors. The 90% response times have spiked. You can't probably can't see it out there, but about two minutes. So really means you're getting a bunch of timeouts. And then the error response chart, you see that the red line has spiked up a bit there. So we're getting a lot more errors that we don't like to see. And this this again is for the proxy, I should have pointed out even on the last dashboard, all these dashboards, we can look at individual microservices. So if you're interested in particular the catalog or the dashboard or any of the other apps in our system, we can look individually at those. So it's one thing to kind of see there's been a problem. As I mentioned before, root cause analysis is pretty important to be able to actually go in and see what the problem was and in fix it or find another expert on another team to try and fix it. We can so so from these dashboards, we can dive into deeper detail, and we can bring up some other dashboards that show us more information. This this chart. So one of the things are our app that stores stuff to influx DB does is try to categorize the URLs that it sees. You know, maybe a cloud found recall, maybe to the container service, maybe UAA. So it tries to basically tag each URL. So then we can build charts like this and I've I've blurred out the legend so as to not to implicate any specific components. But basically, and you probably have tough times seeing it, I thought it would render a little bit bigger. But these first 15 or so categories, if you look at the legend, the max time has been two minutes. So it's really a pretty widespread set of components that are returning, you know, that are timing out in this case. So it. So from this, we would tend to think it's a more systemic problem, rather than, you know, one of our individual microservices. Sometimes when we start to see these problems, the category, you know, we also categorize calls to our individual microservices. So we might see, for example, the catalog microservice pop up to the top of this category chart. And that gives us then someplace else to drill down deeper to. We also have the ability, you know, we look at Grafana, and we see those top level categories. But sometimes you need even a deeper level of detail. So we do have another little app that will pull all of the request data and put it into a tabular format that you can look at these. These are some calls over, I think through the proxy over some 24 hour period. You can see the category on the far left hand side. The target, which means which end point, which host is the request going to the HTTP method, the status, the URL path that was used the number of times it was called. And then, you know, total time is really the average time. And you get min and max and those sorts of things. So and then you can also drill down, you know, these are all hyperlinks. If you use our little UI for this, you can drill down deeper and see, see all the individual requests rather than having them grouped together with time stamps, and you have the stats for each individual request. And you can also get some additional statistics like the 95% time and standard deviation. Because sometimes again, we'll see requests that look pretty good, but they, you know, have these spikes. So the 95% time is higher than we like. Or we have a wider variance than we like. Wall of shame. So, so one thing to, you know, we want to improve over time. So we've started to what I affectionately call walls of shame, where we can use that details view from the previous chart, and you know, do things like show the 10 slowest APIs across the system. You can set count, you know, set count thresholds. So maybe, you know, I care about APIs that have been called 1000 times over the last 24 hours, you know, what are the 10 slowest. You can also do filter by errors. So you want to see which, you know, APIs have returned the most errors in the last X number of hours. So we're using that to try to get, you know, send this out to teams regularly, and get some pressure, I guess to, you know, I use shame, I put it in quotes, because, you know, we're all in it together. But I think, you know, the goal for your team would be to not have your API show up on the wall of shame. We're also interested in memory and CPU usage crashes. This chart shows, so we do all this monitoring for all of our dev and staging and production systems. So I guess that's also an important lesson, maybe to point out that it is you want to see how things are doing in your dev and test before you actually promote to production. And this was actually an example where we found this last, the last chart on here, these are Node.js apps, mind you, is CPU usage. So normally in a, in a Node app, you expect very low CPU usage, especially for one that you are using to serve UIs. And you can see the CPU usage is just steadily going up until later in this time period where we put in a fix and went back down to what should be. So we found this in development. So this bug causing the CPU usage never got out to production. So talk a little bit about if you want to use some of these principles yourself. I'd actually planned on publishing some of the code that we use to publish these metrics. But a couple weeks ago, I learned about a project that's actually been developed by IBM. Called Atmetrics. It's an open source project. I think Mike mentioned it in his presentation yesterday. It shares a fair amount in common with the middleware I mentioned earlier that we put into all of our microservices. But it goes even deeper to provide additional metrics. So if you wanted to build a system like this, I'd recommend taking a look at this open source package. In the slide deck, there is a link, sorry, notification from Lotus Notes. And it just sort of proves again that to me that IBM is a big place. So we sometimes have many people working on similar problems in slightly different ways. But you know what I think we'll probably end up doing is using app metrics for our middleware. But because it also sends data to MQTT. A lot of some of the other things we do where we write to nFlexDB or do some analysis and alerts, we should be able to build those systems using the Atmetrics module because we can just subscribe to MQTT and do some of our own data massaging and storage. And this is just a table of some of the additional things that Atmetrics can look at for you. So if you're using socket IO, for example, and you want to see metrics on your web sockets, that's included in Atmetrics, as well as a number of databases and things that if you're interacting with Cloud Int, actually I don't think Cloud Int is on the list, but MongoDB, and you want to see details about the queries you're doing and such, Atmetrics is capable of doing that. It can be configured to store data into Elasticsearch or StatsD. We kind of like nFlexDB in what we've been doing and the connection to Grafana. I've talked to the development team about possibly adding that adapter. Of course it's open source, so I could probably contribute that as well. And I just want to, so all the data we've talked about, sorry, all the data we've talked about so far is really focused on the server side. So I did want to give a quick mention that we also do some synthetic data collection where we have strips running outside of the product using SiteSpeed.io, which is another open source tool that will actually load your web pages and give you all sorts of information about page load time, the time to first show anything on the page, your DOM complete time, number of requests on the page, very detailed information you can get. We run these strips and store the data in Graphite and have built, again, more Grafana dashboards to look at this Graphite data. So we think this is important because some of the response times I mentioned earlier, again, our server side, there's networking and appliances between your server side and the browser. So it's also good to get a sense of how long those hops are taking, as well as sort of testing from different geo locations around the world. So we can run these strips from different VMs. If you're familiar with web page test, SiteSpeed.io will also allow you to invoke web page test, which you can do from different spots in the world. So we have a sense of how long things are taking due connect to our server from Australia versus from South America, for example, we have some of that information, which you can't really get from just the server side metrics I mentioned before. And that sort of takes us to the end. Any questions? Yep, go ahead. That we put that get sent to MQTT? Yeah, so we've got a piece of middleware and all of our node apps. So that when these are express node apps, so we've got some express middleware that when a request comes in to a microservice, the middleware sees, oh, we've got a request and it fires the data to so that it's published to MQTT. So basically each app is MQTT is a very lightweight protocol. They actually have a you know, when you install the module, there's C code behind the scenes. So you can really do a fire and forget because we didn't we didn't want metric collection to slow down our actual microservices. So I guess the fire hose is just each individual app sending events to MQTT. Any other questions? We're I'd say it's been mixed so far. It's especially in a company like IBM and I'm sure there's other you know, big companies represented here. It's like everyone seems to have kind of their own priorities. So you can approach a group and say, well, you know, this API isn't performing like we need it to for the UI to perform. And, you know, sometimes some groups are more receptive to that than others. So I think it's still a work in progress to figure out the best way to drive changes and best practices across a wider organization. I mean, you should also, we could also do the opposite experiment where you have a wall of pride, I guess, where you put the put the best performing APIs and things like that up there to also give some positive feedback. But we haven't quite done that yet. Okay, any other questions? If not, thanks very much, Tony. Thank you.