 Yeah, so this week, Bob is going to check us through Glorut and Monin. We're going to see what we can get out of the two monitoring tools. And yeah, it's going to be quite interesting, particularly interested in Glorut. So this is the third time we're having subadmin meetings this year. And yeah, so Bob. Thanks Tito. Good morning everybody. Let me see if I can share my screen. Okay, yeah, good morning. Everybody see the screen alright. We should be seeing a browser window with Glorut.org. Yes, we can. Yeah, I can see it. Yeah, okay, so I don't know. Today I spent a little bit of time just going through this useful little tool that we found has solved many problems over the last few years. I discovered it about four years ago, four or five years ago. We used it to diagnose some issues that they were having with a measles immunization campaign in Bangladesh. Anyway, as most of you on this call know, managing DHS2 can be a tricky business. It's a big and complicated piece of software. Everybody has different kinds of user loads, different kinds of sizes of databases, different types of programs that they're running. And so sometimes it can be hard to know what's going on when things are slow or when things are crashing. And so it's really important to have some good monitoring and diagnostic tools on. And certainly one of the most useful ones I found is this Glorut. Glorut is particularly, it's a JVM agent really. It's essentially looking at what's going on inside your Java virtual machine. And it's got some useful built-in graphs and metrics that give you some good eyes on what's happening. But it's only one part of the picture, of course. I mean, as you know, the DHS2 in order to make it run, there are other pieces involved as well. Importantly, the database and also the web proxy and also the operating system that it runs upon. If you've been using the DHS tools that I know some of you have for some years, even if not some people have set up independently, you'll know that it's also important to monitor other aspects of the system, other than the JVM. And by default, we installed this thing called Moonin, I've got it running here. And thanks to Kwame and Oswald, we're actually looking at the live system in Ghana. Just because looking at these monitoring tools on test databases, I won't really tell you much. When you have some live data, it's important to look at. The thing about Moonin is that by default, you have hundreds and hundreds of graphs. Probably too many. That's a big problem with a lot of monitoring tools. You have lots and lots of information. It's important to know which ones are useful. When I am looking at Moonin, normally the kind of things typically I would look at and put in this overview, go to the host machine and look at the system. This is one of the most one-star points and you have a look at what your CPU usage is like. If you've got a lot of excess CPU usage, then you know you've got some kind of problem to investigate. There's a little bit of a problem here, all right, that I hadn't really figured out before. This is usually, you see on the CPU, this is kind of normal use down the bottom. There's this purple stuff up above. The purple stuff is IO weight. I've not had the opportunity to investigate this, but Oswald and Kwame, this is something to make a note of. The CPU is spending a lot of time in IO weight. I don't know if anybody can give me a suggestion why they think that is. Why would a CPU be spending so much time in IO weight? What does that indicate? So, I think I had that issue before from Serlion. It's because the RAM is old. If you have a very old, some of these hosting provider, what they do is that they give you a specification and then they don't tell you the type of RAM. And if the RAM is very, very old, it spends a lot of time on IO weights and then you have these spikes and then you don't know why it occurs. But then when we requested to them based on the specification of the RAM, we send a request to them that we want them to change the RAM. That was fixed. I think we had that problem for a couple of times. I think we contacted you and you yourself was not surprised at it. But that's the main reason why I wrote. But you mean the disk? With IO, you're typically talking about the disk. I think it's the technology. It's better to use the SSD disk rather than that disk. I think it can improve the thing. IO weights generally, what it means is that the CPU is waiting for an IO operation to complete. And that's usually a disk operation. This server generally has been, and I know, Gerald, you had that problem with DaddyServe that they gave you very bad disks. This server generally has been performing reasonably well. I'm not sure why you have this here, but it's something that we probably should investigate. But anyway, this is one of the first things I look at, right? Look at the CPU usage. We're not here to solve problems today. We're just looking at the diagnostic tools. So yeah, this is something to look at. And Oswald Kwame, I think we should make a note. Let's see what's going on with that disk. We don't expect to see that. This is running on LinNode. LinNode disks are usually very good. So it might be something happening. Other thing that's in to look at, yeah, the memory usage. Memory usage here, what we can see is, and this is on the host system. This is a 64 gig host. It's got a lot of this purple stuff on the top. This was a reboot that happened last night. I'll talk a little bit about the reboot later. And so this yellow stuff is actually unused RAM. You might think you need to have a lot of unused RAM, but that's not necessarily the case. I mean, you pay good money for RAM, so you do want to use it. You just want to make sure that you don't run out. Now, what the operating system will do when it's got free RAM to spare, this particularly the database, it's going to make use of that as cash. So seeing all of this RAM being used as cash, it's not a bad thing to see. That's a good thing to see. We can see that's been fairly healthy over a period of time. This is like over the past week. So RAM seems to be quite well allocated on this machine. There's quite a lot being used. And whatever's left over is being used as this cash, which is a good thing. We don't like to see too much swapping. There is a bit of swap happening. This is this red stuff on the top. Something else I think we want to investigate because too much swapping is excessive use of the disk. It's generally not good for a database. I don't want to spend too much time on Moonean. That's usually why I start. I look at that. The other thing that's very useful to look at on Moonean is your database, your Postgres. Moonean database plugin. We should look at the disk first as well. Let's just go up here and look at the disk. One of the things that generally seems to determine the difference between a well-performing DHS2 system and a well-performing database and one that isn't performing so well is disk latency. And so I always have a look at this disk latency here. The disk latency you can see generally is OK. This SDB is getting high 27 milliseconds. Generally, these kind of figures here blow 10 milliseconds or into the microseconds is more what you like to see. What I can see here is generally those figures are good, except for a little bit of extra activity that seems to happen in this dev SDB. The dev SDB, as far as I recall on this system, is the swap disk. So it could be that some of these high latency figures and some of that swap memory that we're seeing used from the other graph is related to some issue with the swap. So again, this points to something that probably should be investigated to see whether we can improve some performance. It's not good to see maximum latency figures like this. You've got high latency. This is when your CPU is going to be an IO weight. And you always want to see your latency figures somewhere less than 10 milliseconds. For the main disk of SDA, that's where the database is sitting on, latency figures are very good. It's never over 1 millisecond. So the actual access to the disk and stuff is good. There's something going on with the swapping. That's maybe not something we need to look into. But disk usage obviously is useful to keep an eye on. And you can see here with a disk usage that we're currently using 50%. You see this little spike that happens. You expect that that happens in the middle of the night when the analytics is running. I think we look at it over the month. I mean, look at it over the year. It's funny we haven't been seeing this. Usually expected to see this spike happening every day. We don't need to worry about running out of disk space at the moment. We've seen a kind of growth happening here. Something happened back in October. We don't know what it is. Maybe Oswald and Kwame will know. But for some reason, our disk usage, it must have cleared out a lot of files all of a sudden for some reason. I don't know why that was. Yeah, I think back in October, a station in town came up on the same machine. One of the invasions before we do an update. I think that is part of it. Okay, so whatever happened back in October? It was almost the update. The backup spread to take off some backup that was installed on that day. Okay. Anyway, whatever you did back in October was a good thing. Because now it looks like your excuse, which is quite stable. It's growing a little bit each week, but it's not going crazy. The other thing I look at on Moonin is the Postgres plugin. And probably the most important graph to look at on this one is the connections. Again, this is reasonably healthy. When you see 80 like this, approximately 80 connections, I think there are two databases here. One is a staging one. Okay, let's look at it on here. There's the DIMMS database and then this is a staging server here. It's averaging around 80 connections. What that typically means is that the disk, the pool, the database pool, and we're going to look at this in a bit, probably just set as a default value. Interesting. When you look at the comment in the DHS2.com file where you set the maximum connections, it says the default is 40. But I think that documentation needs to be updated because the default is actually 80, not 40. So you can see what actually was the case here was that the number of database connections was set as at the default. It hadn't been set to anything higher. One of the things we did last night, because I thought it was maybe a bit low, is we increased that to 200. I'll tell you a little bit about why we did that. When you look at the GlowRoot. But this generally is a useful graph to look at. Kind of things to look out for here. You see there's a bit of time with some of these connections. They're waiting for the lock. You kind of expect that if you see too much of it, then you know you're getting into some kind of problem that connections are deadlocking or blocking one another. But generally this pattern seems to be fairly consistent. It's important to realize, I guess, that the way Moonean works, Moonean is a fairly simple tool. It just takes a sample every five minutes. So you're getting a kind of average picture of a snapshot of how things look every five minutes. That gives you quite a lot of information. But also it means that you're not getting the kind of instantaneous information what's happening in between those five-minute intervals. And I think what is happening here, even though you don't see it, is that there's spikes happening here where sometimes you actually look for quite a lot of connections in a short time. That's simply something to do with the way the DHS2 works. I've increasingly come to understand that a lot of the API calls, it may be a single API call, but it can sometimes have tens or even hundreds of database queries that happen as a result of a single API call. And that can mean quite a sudden rush of database connections required, which wouldn't necessarily show on here. Anyway, we'll look at that more when we go to the code. The other plugins on here, okay, the actual plugin, the Nginx plugin is kind of similar. This kind of thing is it just give you a good idea of what your daily load is like. Your access is per seconds, typically peaking at around 50. And is that more or less always useful? I find these kind of long-term graphs quite useful. I understand, you know, how has the system been changing over time? Not hugely. It's got to been a little bit busier in 2023 than it was in 2022. And the load is, yeah, getting up to, it was about 70 per second earlier this month. I guess that's probably related to, because this is an aggregate server. This must have been the week when most of the reporting was happening. That's when you expect to see a higher load. There's a plugin that's missing on here that hasn't been installed, which is the Tomcat plugin. As well, we need to talk about this. We haven't enabled the Tomcat plugin on Moonin. Tomcat plugin has some quite useful information to look at as well. Unfortunately, we won't be able to look at it here now because it's not plugged in. Okay, so that's your kind of very broad overview of what's going on, as Moonin gives you a reasonable picture of whether your system is functioning healthily. And I suppose, yeah, my big advice when looking at Moonin is, you know, you have these handful of charts that you know is useful, amongst quite a lot that are less useful. And it's always good to look at them. If you see something strange happening, then also look at, well, what's been happening over a longer period of time? Are you seeing anything drastically changed? So, for example, I went to that disk. We saw this problem. We were seeing this problem with swapping onto disk. So I, wait, is this something new or has it been happening forever? It looks like this is a problem with, as always been there. It's not something that's recently developed. Something that, yeah, probably should have been investigated back here as to why it was happening. It doesn't seem to be causing huge problems at the moment. There's quite a lot of excess CPU capacity on here, as you can see. All right. That's Moonin. But the main thing I really wanted to talk about was this Glowroot. If you're looking for the source of Glowroot, this way comes from glowroot.org. You can read a little bit about it. What it says is that having it on your system causes very, very little overhead. And our experience has been that is certainly the case. There's no great performance problem with putting the agent on. And as we will see, it gives you lots of information. How you actually do it? Well, again, if you're using the kind of automated installation tools, it'll do it itself. But I can show you here, I think. Let's just have a quick look. It's installed on the Tomcat container itself. And you'll find it at the place where all your Java Office and things are set. The place where you set your Java Office with the standards Ubuntu or Debian installed Tomcat is in this file, et cetera, default Tomcat.9. And in here, we can see that... I haven't looked at this file before. Okay, I see it now. You've got 18 gigabytes of RAM allocated to that. JVM seems to be about a good amount. We'll look at a little bit of that in a bit. And this last line here, essentially all you have to do is put GlowrootJava file somewhere and then add this line to your Java Office, say minus Java agent. And that'll start up this Glowroot thing. It has a web-based interface. It listens on ports 4,000. I should be able to see that. Here it is. There's your Tomcat running there and this 4,000. You see it's the same PID. It's running. It's the same Tomcat. It's listening on a different port. That's your Glowroot profiler. It does mean that if you want to look at these Glowroot charts and things, you do need to enable a location on your proxy server to access that port. So that should be opened on the file. Yeah, we're allowing proxy server to access 4,000 on port 8080. So that's simply so that we're able to look at a Glowroot. Okay, so let's have a look at it. This is kind of the main graph of the page you come to when you log into GlowrootJava. I had it showing you the web transactions. And you can change the time period here for the last four hours, the last half an hour, to the last 30 days. Again, like with Moonee, it's good to see how we're looking today compared to the way things are generally in the month. And yeah, you can see that things were a bit quiet and then they got a bit busier, then they got busier and busier still. And this week is starting to get quieter again. And I suppose this is because we've come to the end of the reporting period. General load on here. Yeah, that throughput is getting up to about 3,500 minutes. Busy enough server. Ghana is a big country. I think their DIM system is quite big. On the left hand side is a list of all the API end events, ordered in terms of how much load they're putting on the server. So one of the things we can see from here over the last 30 days is that 61% of the load is related to these analytic calls. One of the things that you can conclude from that, often if you're trying to just generally improve the performance, there's a law to this. It's called Amdahl's law, which says if you're trying to improve performance, essentially you want to try to find the thing that's consuming most of the time. And if you make that a little bit better, your whole system is going to be better. If you find something that's used very rarely, a very small part of the system, but it's actually very inefficient, and you improve that, then the impact on the whole system is going to be much less. So it's useful to see requests ordered like this, because if you were trying to do a general improvement on things, you would start here and say, well, this is analytics, obviously, is using quite a lot of CPU. Anything we can do to improve that we should, followed by this custom data set reports. It's much less than analytics, but it would be the next target to look at. So as you can see from global, it gives you a lot of information, just looking at this front page. Looking at it, not in terms of throughput, but in terms of its response time, I think this is actually the front page. You can see that the response times are generally quite flat and quite low, and every now and again we get quite big peaks, quite big peaks happening. What we'd like to do, obviously, is to reduce those peaks, right? And so we have less. As I said, I did a little bit of a tweak on here last night in the hope that we'll actually reduce a bit. Things haven't been as bad the last couple of days as they were the week before. Last week, obviously, was a little bit stressful. We can have a quick look at what was going on last week. One of the things that's a little bit interesting looking at this is that, again, a lot of this time for these long-running requests, they're not actually doing anything, they're waiting. And then quite a lot of JDBC collection going on. Usually when you're seeing very slow response times, what you expect to see is that there are very heavy JDBC queries going on. But as you can see here, the contribution of those JDBC queries is not so much. So if there's any slowness in this system, it's not really about slow queries. Whatever's causing it to be occasionally a bit slow is related to something else. And it seemed to me that this waiting that it's doing and this time it's taking, getting connections is potentially a bit of a problem. Let's go back and see what's been happening today. This is, I don't know, what time is it in Ghana now? I think we're just about over the hump, a little bit of throughput again. We looked like we hit our peak time about an hour ago. Does that sound about right, Oswald? And now we're still busy. It's kind of over the hump. It's about 12 midday in Ghana now. Okay. So it looks like things are slowing down a bit. We're kind of just about over the hump. I think that's fairly difficult. If I look at it over two days, yeah, a very rapid, very rapid leap of activity in the morning and then things by 12 o'clock or 12 o'clock my time things start to ease up again. So the traffic pattern today is very similar to the way the traffic pattern was yesterday. How has the performance been? Well, let's look at the response times today and yesterday. We had a nasty little spike yesterday around about there. Then another nasty little spike a little bit later on in the afternoon. We haven't seen any of that yet. So it's a good thing. We'll know a little bit better towards the end of the day, how it looks. If you do have, okay, the other thing that's, if you want to get into more looking at, looking at this in general picture, if you like. The other thing that's worth looking at besides just the web transactions, I'm going to come back to the web transactions in a bit, is to look at the JVM itself. And by default, it's showing you this, this graph with the deep memory usage. And you can see this deep memory usage is kind of bouncing around, around nine gigabytes, picking up to about 11 or 12. We know this machine has got 18 gigabytes allocated to it. So this is a very healthy place to be. There's no problem with the RAM allocation. If you do have problems with, with basically not having allocated enough RAM, then often the symptom that you see is actually excess CPU usage. It's kind of a little bit counterintuitive to expect if you see too much CPU being used that you need more CPU. But often that's an indication that you're struggling with RAM. The reason for that being that when the, when the JVM starts running out of memory, it has to start getting much more active with his garbage collector. So the garbage collector becomes very busy. And sometimes you find then that 90% of the CPU is just being used up by, by trying to keep reclaiming memory. Those kind of problems are quite easy to diagnose with blow root. If you look at, you look at the amount of, of time that garbage collector is being used. These two graphs, the old generation and the young generation, you would see, this thing is peaking at 5.8 times per seconds. That's not, that's not, this is milliseconds. So it's being used like 5.8 milliseconds per second. That's not catastrophic at all. This is just a garbage generator doing its normal business. And the old generation garbage collector is hardly being used at all. So this is an example of a system where there is no real issues to be concerned about a memory allocation. You might even argue, perhaps it's got too much. If you needed to get that RAM and allocated to something else, maybe give us, maybe you could even give a little bit more to the database server, a little bit less to the Tomcat and it would still be running all right. You can look at the actual CPU usage for these graphs as well. Again, we're not going to see anything very interesting. You can see it's the green and the purple, generally very low CPU usage. So yeah, the JVM graphs, these ones, generally they're very useful to look at when, when you've got problems with CPU and RAM. This would give you an indication that, you know, the things are not properly configured. In this case, fortunately or unfortunately, it depends which way we look it up. We've got no real traumatic things to look at here. Everything seems to be looking good. So let's go back to the transactions. I'm still half hoping that something bad will happen, but I'm also very happy that nothing bad is happening. What I was suspecting is that because the connection pool hadn't been changed from its defaults, and because this is quite a busy server, I don't know, you've got like, there was about 8,000 users the last time I looked, maybe it's more, maybe it's less now, but it's quite a lot of users we saw from the throughput that it's got, it is with fairly heavy traffic. It probably needed to have a higher number of connections than what we were giving it with that default pool size of 80. I've changed and the fact that it's spending quite a lot of time waiting, the fact that we're seeing little to time where it's taking a lot of time getting the connection that's sometimes an indication that the pool is not configured quite how it should be. To show you the changes we made to the pool last night. Let's first of all just look at the diagnostic. The other thing that's worth looking at is the slow traces. Now slow traces can be a bit slow to open, if there's too many of them. So I find it sometimes is useful. Let's look at a smaller time period. We're just looking over the last seven hours and let's look at what's been slow in the last seven hours. How we can see right through the night everything is fairly quiet. There's a few things here taking quite a long time. That took five minutes, this took five minutes. This must be something related to analytics. What are these things? It doesn't get you into the detail of it. These are some very big analytics requests. I'm not sure the detail of what it's about. It's a heavy one. It's taking five minutes to run. And this one, the reason for the long time it's actually just the query itself. So this thing is doing quite a heavy query. I hope that this is not a query that's going to happen too many times. If you suspect that something you can do to improve the query, you can look here at the actual queries itself. And you can see this is the culprit here is this one. This query here. For some reason it's slow. It looks like it's going over a number of years. The yearly from an Olympics. 2020 and 2019 and 2018. If we reasonably this thing was causing us a problem, then it would be a case go to the database, perhaps taking this query as it is and running it through, explain and analyze and find out exactly what's slow about it. And maybe seeing if there's ways it could be optimized. I can't give you a diagnosis here and now. Looking at it, I would have to, as I put it through the analyze and see. Again, you can make a note. It might be worth. I don't know how often this happens, whether it's just a one off or. Happens that are interesting time very early in the morning. Somebody is running this. This thing here is similar amount of time. I would like to be the same query. Obviously. It's an analytics query. Somebody different. And it's taking quite a long time to run. So we've got a pickle enough server, I guess where it's got some analytics requests, which are very heavy. And I've got one suggestion. I made one suggestion last night about something to do to perhaps avoid the problem. If it happens a lot. A useful thing. If you do have a troublesome query and you want to raise it with developers, I think one of the most useful things about Glowroot is that it's a good. It's a good way of communicating. If you have problems instead of, you know, going to go back to the developer saying, no, my system is very slow. You can tell them very specifically. I've got this query here that is very slow. Is there anything that you can do about it? And Glowroot gives you a nice way of doing that. We can take one of these things here. And say, well, this query is blurring me. Is very slow. Can you analyze it? If you click on this thing here, download the trace. The reason you can download this whole queue as a zip file there. And then you can send that trace file for somebody else to look at. This is much better than taking us. People often take screenshots. Right. They take a screenshot of this and show it to me and say, Bob, what's going on? It's much easier if they just download the zip file and send it to me. Because then I can go into the query, start some things and see what's going on. Okay. One of the suspicions I had. And I'm going to look at not so much. This is a classic one here that's often very slow. Now this gets metadata action. But typically this takes about five seconds. One of the reasons I pick on this one is because when you look at the, when you look at the request, one of the things you note is that it's only one request, but it results in 300 queries to the database. So a lot of little queries going on as a result of one request. So you can imagine if this comes, or if two or three of these come up at the same time, then suddenly there's going to be a lot of attempts to gain, to gain database connections and to execute them. So it's not so much that particular queries are slow. If I look at this, well, in fact, one query is quite slow. I've seen this before, selected element being slow. That's something to kind of maybe look at. But the fact is also that there are many, many, many, many different queries. And sometimes if you have a flurry of queries like that, you may find that your database pool kind of instantaneously becomes a bit clogged up. And that's the reason why I made a few changes. I'm going to show you the changes that I made in the comp file last night with Oswald. And we just have a quick look again to see whether it's made, had much impact. And then we open up for questions. So, yeah, if you've got DHS, your DHS comp file is in here. I'm going to try to avoid database password, but particularly as it's being recorded, otherwise we'll have to, this is good, I'm looking at the file from the bottom. One of the things that wasn't figured before was this server-side cache. This is something that I'd strongly advise you should always enable, right? The server-side cache. Because basically every time you make an analytics query, if you cache that at the server-side, then the next time somebody makes the same query. And this is often the case if you go and you open a dashboard and then you log out. If you go back in again, you give back to the same dashboard, you're going to be making the same query again. So dashboards do cause the same queries to be executed many, many times. It's a really good idea to cache them like this. That reduces the strain on the database a bit. Oswald, I don't know. I don't have access to the front-end. We were talking last night. Did you find that your dashboards are loading a little bit this morning than yesterday? Yes. I can't take this one. I'm doing a lot more. Is it kind of noticeably different? Yeah. Because it should be if you enable the cache. Because it means that those dashboards should be coming now without having to always go back to database to get the data. So that's one change we made. And definitely this is an aggregate system. I know a lot of people are more interested in tracker performance problems, but today we've been looking at an aggregate system. This thing is, it can make a very big difference. As we saw on the Glowroot graphs, most of traffic on the server, most of the heavy traffic on the server is analytics related. So anything we can do to improve that. So the other change that they've made on here, just up here somewhere. Yeah, here. The pool size wasn't set, right? It was just commented out. You see the comment says default 40. It's actually wrong. The default is actually 80. I checked that the maximum connections that's been configured on the database is 400. So we can easily handle more than 80 connections. So we increased that from 80 to 200. And these two settings here, this is something I discovered some months back. I think it was looking at your system in Rwanda. We saw mostly where you had an external database that one of the things that the pool does, it maintains these pool of connections. And when it takes out a new connection, it'll test it to see whether the connection is good. And the problem is when you have a request like the one I showed you there, it's going to have 300 different queries executed off it. When you have a request like that, and it maybe suddenly requires a lot of connections, then the process of acquiring new collections can be quite heavy. So one of the things that you're able to do is to say, well, don't test the connection when you check it out. Just check it out and try it. It works, it works, it doesn't, it doesn't. But when you check it back in, in other words, when you make it available for some other transaction to use it, instead of that point instead. So we're not putting bad connections back into the pool. So those are the three changes related database pool that we tried out from last night. And we want to have a look through the course of today. So far from this morning, everything looks okay. It's not dramatically different from yesterday, except it does seem that dashboards seem to be loading a little bit faster. And we haven't seen any spikes yet, but it's too early to say whether we are going to or not. I need to keep an eye on this. Let's go back to this thing out of the way. I mean, we're actually over the big peak at this time. If we look at the response times over the course of day, eight hour day, everything is pretty good, right? Averaging 100 milliseconds, no big spikes. The slow traces that we're seeing. The other thing to worth looking at, in fact, is the percentiles. Centiles just shows you 50% of the connections are down here. So 50% of the connections, I mean, they're giving you a response in 7.9 milliseconds. Up to 95% of connections are giving you a response time of 200 milliseconds, which is not bad. And we're having very few, and we saw some example there. So there are a few quite heavy, but they're only 1% in the 99th percentile where we're seeing a couple of these going up quite a bit higher. We can see how does that happen compared to yesterday. Let's go back two days. And, yeah, so far it looks like we managed to maintain the flat, even on those heavy ones. But as you can see from here, if you were looking yesterday morning, it wasn't actually looking that different either. But you're interested to see when we look at this again in the afternoon, have we got rid of some of those? Okay, that's enough for me. We can talk about Glowroot and looking at different kinds of problems with it all day long, but I want to stop there and open up for any questions. And thanks to the Ghana Health Service team, this piece of Ghana in waiting for allowing us to use their server as an example. If anybody else wants to volunteer theirs, in a subsequent week I'm very happy to do a performance review of. So particularly if you have something which is creating particular problems, we can look at it together and not give away any data necessarily. All right, any questions? Hi Bob, this is Mahindra and recently I have joined his Pindya. So actually my question is that I have set up one server and there is no any Alexi container. And I have set up DHS2 application as per the document. And I have also set up Glowroot over there as the instructions. But as this is a blank database, there is no any data right now, but I'm not able to see any kind of like there is API slash 37 or analytics, like the data volumes, data elements, these kind of logs I'm not able to see over there. Well, if you have, I mean, if you just set up a blank database. Yes, yes, yes. Then, then, then, yeah, you're not going to be seeing any traffic. Okay, okay, okay. That's why for this demo today, I had to ask permission to use, to look at the Ghana one, because if I just set it up on my laptop or something, it works fine. But then there's nothing really to see. So yeah, what I'd suggest you, what you should do is put Glowroot on one of those production servers. Okay, fine, fine. Thanks. If it's one that you, that you know is having some trouble, then you can share what you see. A question, a question. Yeah. Well, thank you for your presentation. And now I have to question questions. First is, can we set these environments in a way that they can be centralized? What I mean, let's suppose environment where you have multiple servers, and maybe the system administrator want to get all the information from multiple server in just one server. Is this possible? Secondly, what is your take on the difference between this and Prometheus and Grafana? Thank you. Okay, two good questions. First of all, I'm not sure who will take first. Yeah, Glowroot can be set up in a centralized way. You need to look at the documentation. I'll show it here somewhere, but you can have a centralized collector for Glowroot. Have it, you know, have everything centralized to a Glowroot panel. It's not something I've done often because first of all, it only makes sense if you are managing quite a lot of systems. I know in Rwanda you are now managing quite a lot. And, you know, it tends to be particularly JVM related. All of these metrics, in fact, are just coming off JMX. On the question of Prometheus and Grafana, what I will say is that they're both very good looking. They're both very good looking environments, particularly Grafana gives you very nice looking dashboards. Quite a lot of configuration involved. In order to package something out of the box to give you good value, it's quite hard to do in a general way. I think if you're setting it up in your own environment and you have time to tinker on it and customize it, I think it is possible to do it quite... You can get all of these same metrics, for example. All the metrics that you're getting here off Glowroot, they are just coming out of JMX and you can query your JMX with Prometheus. But it would take you a lot of effort and a lot of work to put together views like what you're getting off this one. So, yeah, my view of those monitoring... I like them a lot, but it takes a lot of effort to configure your environment properly. One of the things I know that... I mean, one of the reasons that we have Moonin running, for example, is not because Moonin is very beautiful. Moonin is quite horrible, I think. But it's really easy to set up. And so almost for free, you get lots of useful information. One of the things that I know Tito has been looking at is also Xabix. I quite like Xabix. It's much better looking than Moonin. There's a little bit more configuration work required to get a good environment going. Prometheus, I think if you've got a very docker-centric environment particularly, you have everything running on docker, then Prometheus works quite well and it doesn't require docker, obviously, it fits into that paradigm quite well. We have some metrics exposed on DHS2 using Prometheus. I've had a look at some of them a few years back. I don't think there's anything particularly there that's gonna give you more information than you're getting off this and also the proxy log file. So I haven't put too much effort into it yet, but I'm very happy to look at anything that people might have done in the community. If you've set up a nice Prometheus Grafana environment, the other one which is very popular is the Elk Stack. The Elastic Search and Longly and Kibana. I think that Danny and the Solid Lions people are very fond of the Elk Stack. Manzi, that's it. I think they're very nice. They're quite hard to package in a way that they're all gonna come pre-configured out of the box. Thank you. Let's see who is on here. Okay Bob, I've got a couple of queries. If any of these systems have a notification or alerts system to send out, if anything has been recorded out of the ordinary? Yeah. So who's this? Sorry, it's Damian from MSF. Hi there. Yeah, alerts is a really important thing. I don't know if, actually not checked, whether you can do alerting of glow. It's more of a diagnostic tool and a monitoring. Mooding for sure. If you look at Mooding, you see there are, you can set thresholds here on all of these matrix. So you can see there's six critical. Something has, has triggered something over the threshold here. I don't know what it was. Yeah, that's a, that's a skew stitch on this thing, which is a virtual disk anyway, to make any difference. And yeah, this is the same thing. It's just this, this directory is causing an alert. So Mooding, you can configure thresholds on all of the metrics and the default way of configuring and alerting of that would be through email. So it's a little bit, a little bit old fashioned. That's the way I've done it in the past. It would be possible, I suppose, to, to send those email to a, a kind of custom SMTP service gateway where you could convert them into Webhooks or something else. But yeah, the standard alerting mechanism of Mooding is using, is using email. Alerting is a general topic. And I think we need to talk some more about, because in the past we've, we've tended to advise folks to set up email alerting for all kinds of things. Email is a little bit tricky for folks to set up. I find in terms of organizing, they're making sure their DNS settings are correct and the DMARC settings and making sure stuff doesn't end up in spam. I'm quite interested in using, using things like Telegram for alerting, because Telegram bots are actually quite a low barrier of entry, quite easy to set up. But it would be a matter then of linking in some of these monitoring systems to your alerting system. So yeah, my answer is I'm not sure. I think I read somewhere that you can do an alerting upload. I've never done it. I think you can. Moonin, yeah, certainly alerting is, is built into it. But it's, it's email alerting. Okay. You highlight an important topic we need to talk some more about is setting up alerting systems of various sorts. Yeah. Cause this requires some manual looking and it's not really an auto feature. Yeah. We look at day by day. Yeah. It's not adequate certainly if you're running a lot of systems. Yeah. And here in OCB, we have a lot of Docker, Dockerized service arranged. So we have at least four or five environments for the HIV. So yeah, it'll be good for us as well. Yeah. We'll probably bring the, the Glowroot stuff is as a, it's just, it's just essentially using JMX and they are all kinds of, of agents you can use to get alerts off, off JMX metrics. I think even Prometheus, for example, Prometheus who will gather your JMX metrics, metrics and you could configure alerting off that. The Glowroot is more, it's more of a diagnostic tool, I guess, than a monitoring tool. I go to, go to Glowroot when I have problems. I usually look at my moon in to see whether, whether my system is, is being well behaved or not. Okay. Good to know. Yeah. Cause you can, you can configure Glowroot within your, your Docker container with your, your Tomcat as well. This means you probably have to make some adjustments to your Tomcat configuration. Okay. So I see the time is up. Yeah. Can I ask a last question? Yeah. I want to know if there is a best practice to set up a password on the morning monitor. There are what? Sorry. To set up a password on the morning monitors. Yeah. You do that. You do that through your, through your proxy. It's not there by default. You have, you have to configure basic off on the proxy. I'll make a note of that after just stick to, make sure there's some documentation or how that should be done. We do have a script that does that automatically. We need to maybe, maybe we should integrate that into Tito's Ansible scripts. So that there has a password by default. Yeah. If you just install moon in the basic way, it's going to be open, which is generally not what you want. I can show you how to do that. Maybe, because I'll show you the little, the little configuration snippet that you need to do. Okay. Bob, may I just ask one question? So, who is this? Yeah. No, this is Gerald. Yes. So the question that I want to ask is like, in a case where India is a spike and probably you are not, because like, for instance, it took me a couple of time for me to notice that there was an error that somebody was actually doing a query of the whole database. Right. I looked through the, the, the, the cat from cat log and I couldn't see, I couldn't actually see what, what was going on. But then when I looked through the globe, which I then found out that there was somebody who was doing a huge query and that query was actually taking the system down. So in such a situation, is there any way we can configure notification system? Let's say for instance, when, when the, when the request takes some amount of time, let's say like 10 minutes, it sends a notification that there is a query or something or there isn't, there is something you need to look at. And I think there are a couple of systems now that are integrated with telegram. Is that possible for, if it's not only email, but is there a possibility for us to, to integrate it with telegram so that it just sends you a notification that something is going on with yourself. I don't know if that is possible using a globe or money. Yeah. I think there are two things there, general. I mean, first of all, the telegram thing, I think we, we want to, we want to make it more kind of generic mechanism to make it, to make it easy for people to set up telegram bots for getting all sorts of alerts related to DHSG. So yeah, that is a active area we're looking at currently. Particularly on long running queries. Yeah. That would, that would certainly be something that you could, you could get alerts off. And then, and then yeah, I'm not sure I'm getting a telegram alert to say you have a long running web request. Sometimes a little bit hard because you know, you know that it's run for long after it's finished. You'd have to, sometimes what's easiest to look at in fact, not so much the web request and usually the kind of scenario that you're talking about results in the long database query. One of the useful graphs on the press plugin on moon in is the long running queries. Where are we? You can see here. You have queries here, which take quite a long time. They kind of finish off what I've seen in some cases. And I think you would have seen it in this case here. It's like a triangle like this. And this is a query, which just keeps going and going and going and going and going and going. One of the things we could do on moon in is in a, in an alert here, right? A threshold. If you're query. These are not bad enough. Okay. Some of them have taken like five minutes. You could put in an alert that you've got queries running for more than three minutes or something. You want to make sure you don't end up with too much of that's the important thing, I guess, in your postgres configuration is to make sure so that you log those queries. So that if you get queries running with a certain amount of time that you log them. But yeah, where you'd get the best place of getting the data from. If you want it, if you want it kind of live, you would have to query the JMX. Or you could get it from PG stats table on postgres to postgres will tell you, you know, in a live sense, which query is currently running and how long it has been running for. Or you can get it after the fact. If you look in your proxy logs, the advantage looking in your proxy logs. And again, you could have a script that does this. That would send you an alert if any query has, has run for more than four minutes. Let's say you said then you also get information about potentially the user name, certainly the IP address, things like that. But I think the important thing in your case, Gerald was that this thing had been happening and you didn't know it was happening. If I remember correctly. And that was partly because you didn't have any kind of general monitoring solution there at all. We'll continue to look into it and see what we can do. But basically, yeah, the globe was extremely helpful because we had the challenge. And it seems as if somebody had a third party layer, where in the query were coming from that third party layer onto our system, through the API. And so regularly on a specific period of the day, that query is being executed. And so we were struggling to find out what was the problem. And firstly, the first thing we noticed that was that the, the, the, the alarm that was allocated was not sufficient. And then secondly, we were able to identify the user and then eliminate the user. So that was, that was extremely helpful. Yeah. I mean, often this kind of detective work is, is going to multi-layered. I mean, often in my case, I find Moonean gives me a first indication that something doesn't look right. Glowroot gives me a more detailed, detailed picture of what exactly is not right. And then you look in the logs from there to find out, well, if Glowroot is telling me that they've got this particular request is causing problems. Then I need to go and look in the proxy logs. And then we're going to, Glowroot is telling me that they've got this particular request is causing problems. Then I need to go and look in there in the proxy log to find out who was initiated these requests and whether they're happening regularly, whether it's a machine, what the user agent is that kind of thing. We spend some time maybe in a subsequent session talking about log files and how to, how to make sense out of them, how to analyze them. Okay. We are a good bit over time now. I'm going to have to, I'm going to have to leave it there. I can see a spike just beginning to happen in Ghana. As we talk. Yes. So thank you Bob. That's been a very informative session. So on a lighter note also is that we are planning for the integration academies and serve admin administration academies in Rwanda. In the next, I think two months, March 22nd to 23rd yet. So I'll share the links on chat for those who are going to be interested. Okay. We are, we're totally interested, but for some of us are like, like I mentioned the last time in the group, for some of us it's going to be a personal. It's going to be a personal. It's going to be a personal improvement. And when you're considering the cost, that's almost like $5,000. Right. Because based on all the expenditure that you've been budgeted. Though I'm looking for partner who might support, but is it possible for us to have a discount for instance, for those who are doing both? I don't know. If you're going to do both, it's impossible for us to have discount. I'm just pleading for myself. I don't know. You save on one set of flight tickets. You don't have to get, you don't have to go to two academies. I don't, I don't know general. I don't do the money. Unfortunately. These are the kind of questions we need to raise with Alice. I don't think we've talked really, but nobody's raised this before. The thing I'm having a discount for the two. I'm not sure. I think the, the, the, the, the, the, the, the conference packages. So I'm not sure how much room for manoeuvre there would be in that. Okay. Okay. Okay. If there is a discount, I'll let you know. Nope. No problem. No problem. We'll be interesting if it was a quasi conference like, um, in person and virtual as well. Like have different costs. Structures. Yeah, I think, I mean, what we might try to do, um, because we know a lot of people won't be able to get there. Um, we'd try to do some sessions virtually as well. Um, and I don't think there's any reason to have any cost associated with those at all. So we'll try to have some free virtual sessions. But I don't think we'd be, I don't think we can do the whole thing virtually because it's, I don't know. You kind of lose some of the benefit of being together when you, you're also trying to manage the zoom and everything else. Yep. But we can talk, we can talk more about that in the weeks to come. Thanks Bob for that. That was really interesting from my perspective. Okay. Thanks guys for, for attending. Good ideas or things that you want to look at for next week. Let us know. Um, we're trying to, we're trying to build out a calendar of things to talk about over the months ahead. Um, always happy to get suggestions. I'm interested in what's going on with this. Okay. Thanks Tito. Thank you. Thanks for inviting me. Thank you. Thank you. Thank you. Thanks. Thanks. I don't know. Thanks for everybody. Thanks so much.