 Morning everyone. For those of you who don't know me, I've been dabbling with airline for well over 20 years now and if you go back to well the mid 90s and you wanted to work with airline you had one potential employer you could go to and that was Ericsson. And so there I was straight out of university. It was that stage in my life where you thought you knew absolutely everything until you learned something new and then you knew absolutely everything again and and there was a started off at Ericsson you know passionate about airline wanted to work with airline and obviously what do they do they do very very large-scale telecom infrastructure projects. So these are projects which have such a large life cycle that you basically never get to see your code go into production. So if you take it I mean 1996 I was working on Ericsson's ADSL system so Ericsson's broadband system you know broadband you know got they started rolling broadband out in you know 2000 2001. I had colleagues working on you know GPRS which eventually became 3G which eventually became 4G LTE and even there you know once you had committed your code to clear case and yes there was a time where clear case was was hip and trendy and you're cutting edge believe it or not that was it you know you maybe got a bug report back from your testing department but anyway you fixed it and you know you you didn't commit your code you checked it in and and that was it. You actually never got to experience a life cycle you know you're everything in production but one of the things which I didn't quite understand you know when I was at Ericsson is that we were given and these were large projects with projects with 50 you know to up to 100 developers we were given very strict guidelines we had to follow and we did as we were told you know we followed these guidelines and it wasn't until about a decade later so 2003 2004 when I was actually on call 24 7 supporting a system and and at that point that's when I got my aha moment as to why you know we were being given these guidelines which we had to follow. A quick show of hands here how many of you have had to wake up in the middle of the night to your handle an outage. So quite a few note I've been on call 24 7 I did not raise my hand and the reason well I'm gonna well and the reason is you know I applied in this system it was a system I had written myself I applied you know what we had learned at Ericsson yeah when working and it was a large-scale messaging system which I will refer to as we go along now you all know well telecoms you know there's one thing which you can be sure about and it's when you pick up that phone you'll hear the two on the other end it's almost certain and if it doesn't you know back then you know we know we knew that you know it would make the front pages of the newspapers it was in the law and let the legislation stated that your telecom systems your telecom infrastructure could not go down even though you know there's a hurricane or a natural catastrophe you still had to be able to dial the emergency services it was the law and secondly there was you know there were massive penalties inflicted on the infrastructure providers in case there was an outage where you know they could prove that the outage was as a result of the fault of the infrastructure provider so massive penalties and that meant that you actually shipped code which you know gave you five nines availability you know five nines is about five and a half minutes of downtime per year and that includes all software maintenance and you know what applied back then still applies pretty much today no matter if you're dealing with online gaming you know financial systems e-commerce sites or messaging solutions you know the only difference is that the newspapers have you know been replaced by your users angry users and social media and the key you know to achieving your five nines availability is making sure that you have a system with no single points of failure do we have any New Yorkers here no New Yorkers does anyone recognize this building what is it okay it big opposite it became secret more recently so this is this is the old AT&T long lines building in Manhattan lower Manhattan in New York some 33 Thomas Street in Rebecca and it is actually an example of brutalist architecture it has no windows it is kind of clad with Swedish granite and it's actually built to withstand a nuclear fallout and what did they put in this building phone switches it opened in 1974 to house AT&T's phone switches you know at the time 1974 switches were mechanical you needed a lot of space and the basement of this building has a tank a petrol tank which will power the generators for two weeks after an electricity failure so you can go in and cut the power lines to this building for two weeks you know past you know for two weeks after you cutting the power lines it will still continue switching your phone calls and you know it was actually the only building after 9 11 south of Canal Street which was still fully functional so even despite you know you had no water supplies you had no you had no electricity you could still dial the main emergency services whom you got through and I think the bottom line here is you know you need you know to have a system which is fully resilient which never fails and never stops you need at least two of everything in this particular example well you know you need you know two power supplies redundant power supplies but you also need redundant hardware not only redundant hardware you need redundant networks you need you need you know if you're using cloud providers you need to basically deploy in in heterogeneous cloud I mean just look at what happened at Amazon earlier this week and even there you're not always a single point of failure is not always enough there was a phone switch which had been switching the international trunks for of a for a major telco in a city of eight million inhabitants and the switch had been there running for three years without anyone having to touch it so the soft was so solid that in the three years it was running they never did any upgrades and they never had to reboot the system so after three years and it was all running in the back end and after three years you know Ericsson went to this customer said listen we're aware that you've not had any issues but we need to upgrade because we're not maintaining and supporting this code anymore and so what they did is you know it was free releases they had to upgrade so instead of doing a live upgrade which they usually do they said it let's just reboot one node you know we will fail over the traffic on the standby node and we reboot one node bring it up to the latest release then we failed the traffic back to this new node and get yeah and upgrade the standby node and so they went in they did the upgrade on one node they tried to reboot it and the system wouldn't start you know the switch that particular board would not swap start at all and so all of a sudden they panicked and decided to upgrade the standby node which had become the primary node so taking it down so it takes two minutes to reboot so for two minutes users won't be able to make any international calls we can live with that you know it was in the middle of the night it was 2 a.m. hopefully no one's gonna notice and they went in on the standby node which had become the primary they tried to they upgraded the software and they tried to reboot everything and once again that node wouldn't even start up and so for one week they tried to figure out what was going wrong and this was a foreign country far far away a colleague of mine had to end up flying to this country smuggling boards in his jacket actually had them in his jacket walking through customs who's had the FedEx demo shipped them they would have been stuck for curse in customs for weeks you're putting in the newborn you're getting the switch up and running again and in the post mortem analysis they did they went in and you know they took apart this hardware and realized that the boot sector of the hard drive is the external sector these are surmitic hermetically sealed hard drives and the fact that the system had not rebooted for three years meant that the thin layer of dust had collected on the external boot sectors of the hard disk and so three years later when they decided to access that sector the head you know just got stuck on this thin layer of dust so I mean this story is there to remind you that yeah it's not always about the software and you know no single points of failure are not often enough but you know in this talk I will actually be focusing on the software itself and if you look at various studies you've got out there you know they pretty much all point to the same thing that you know somewhere between 60 to 80 percent depending on who you believe of your actual cost is going to be maintenance so of the whole software life cycle is going to be maintenance and my argument here today is that you know this the total amount here can be reduced drastically if you spend more time on your requirements specification design encoding phases looking at monitoring looking at the actual visibility you have in your system and remember yeah I used to write the code I used to throw it over you know committed to clear case and not think about monitoring and I think most developers think and reason that way because they do not you know they're not the ones be who you know get woken up in the middle of the night to clean up the mess they've created and often if you are you make sure that you know you learn and you make sure that you are you're not the one who has to deal with it and you know but by investing more in monitoring what you do is you get the visibility into what's going on in the system you collect a lot of information and you get to act on this information and you use this information for two reasons you know preemptive support so you try to detect failure you try to detect issues before the escalating cause and outage and secondly you know you will have outages you will have issues it's softer you're dealing with but the second part is actually trying to figure out when something went wrong why it went wrong and then put in the early warning signs and the fixes to actually make sure it never happens again so what we tend to call post mortem debugging and this is the secret source for achieving five nines availability it's monitoring and you know to combine with no single points of failure now when I talk about monitoring there are three things which you need to do and you need to think about the first one is metrics so you know metrics are usually obtained by by polling a particular value at any point in time so let's take you know memory for example you might go in and pull how much memory is the beam virtual machine using at this point in time how many users are how many active sessions how many active user sessions do we currently have in our system or yeah another value is you know how many login attempts failed in the last hour yeah maybe you might be trying to find you know try to try to discover you're someone trying to hack into your email account has been known to happen and you want to see your failed attempts at you know the failed login attempts right there the second you know the second that's the second met not the metric but the second you know mount the blue sorry the second data you want to look at is log ins logs so logs is an entry in a file or a database that records an event or a state change in your environment so it could be once again it could be a user logging on and you actually log what this user was and if it was successful if it failed it could be just logging a an operator typing something in the console so first or second line support typing something in the console at 2 in the morning and as a result of typing this something in 2 in the morning causes an outage you know when you're investigating the outage you might want to know you know exactly what happened to lead up to the outage it could be logging a network partition networking issues so it's a mixture of it's a it's a mixture of yeah kind of system events and events in your business logic and thirdly and this is something I've rarely seen used outside of the telco space it's alarms now alarms are a subset of logs which have a state associated with it and the state means that you can actually raise an alarm when you're in danger of something happening and you can clear it when you're no longer in danger so and the clearing usually happens as a result of a resolution so resolution could either be manual triggered manually or automated so let's take you've got your system and all of a sudden you get an 80% disc full alarm so you start running the risk of running out of running out of this space so by raising an alarm you'll alert someone in your DevOps team and remind them that it's time to go in and do some housekeeping go in and delete some you delete your temporary files and delete logs you no longer need and another type of alarm is usually triggered by thresholds so think of you hitting you assume your system allows you to have a maximum of 10,000 users simultaneously connected and you know that if you go beyond this 10,000 users they're gonna have a much worse experience they're gonna have a degradation of their experience the latency is gonna go up so their whole user experience will will go down so what you do well let's stop other users from coming into the system or let's raise an alarm if we get hit 10,000 we allow other users to keep on coming into the system but let's let's go in and let's go in and push let's go in and deploy more hardware let's go in and deploy new instances which can handle this this is this extra load so you monitor a counter so you monitor a metric which is a number of logged in users and if you hit a certain threshold you go in and you trigger an action which could be automated or manual and you know when we look at this type of metrics logs and alarms there are two types which we need to think about one are the system metrics so the metrics which are related to the system itself so this could be once again memory it could be operating system or networking specific metrics or they could be business metrics so business metrics you know relate to your specific application so it could be the number of users logged on it could be the number of you know failed credit card transactions it could be the number of failed logged in attempts and so on it could be the length of a particular session as well and there are different types of users you know who will use you know these two different types of metrics give you a few examples I think you know one of the metrics you know to actually monitor is memory utilization and we're seeing a lot of how many of you in who have live system actually monitor your memory utilization it's quite a few do you monitor just the total memory utilization of the beam or do you monitor the individual memory types total memory yeah you all monitor total memory which is great because it's better than nothing it's similar to documentation you know little documentation is better than nothing but you know we don't we monitor absolutely everything so we will monitor in this particular case you know how much memory is the atom table using how much memory is the airline term storage using how much memory is your binary heap using how much process memory is being used you know by all of your processes how much memory is actually your we monitor in how much memory the code is actually using and the system memory system memory as well so we actually know if there's a memory leakage exactly where and what has caused this memory leakage if we look at this graph you see a little bump right here and what's bumping up you probably can't read it it's it's a little bump here which then gets reflected up there the top the top line is the total memory usage which is what you monitor now you go in you see that little bump there you won't have a clue over what caused it but by looking down here we see that it's the code memory which bumps up and the system memory what this means is that on the 5th of November at night we did a software upgrade we load in new modules into the system and then we monitored the system to ensure it was stable and you know closer to midnight we then went in and made the system permanent that means we purged all the old versions of the module and made the system you know run on the latest versions now what was interesting here is that you know within you know three four days the amount of process memory increased by about 50% and so you know why is that happening you know is this increase is this increase going to continue until we run out of memory could it be that we're maybe not we might have to trigger a garbage collection of the process a forced garbage collection of the process this if your memory or process will only garbage collect when the need to access the when they need the memory so there will be cases where they're sitting on a lot of memory which actually don't need and you might want to go in and force a garbage collection could it be that we've got many processes so every user session could be a new process so you know could that be the reason why you know what why the memory is increasing so then you what you then do is you go in and look at the number of processes which are there go in and look at how often the garbage collection is triggered and you know these are also metrics you can go in and look at and here's another example and what this what this example here measures is the total message queue length of all the processes taken at regular polls so taken on average taken I think about once a minute so once a minute you know we make take the sum of all of the message queues of all of the processes and we found this you know when we are stress testing a system which caused a crash so we were trying to regenerate an error and we managed to regenerate it and the system had run out of memory and if the system runs out of memory well you tend to get a massive crash dump which you know in sometimes can be gigabytes large so you can have a lot of data to look through not recommended other cases you don't get the crash dump because you might have a script in the background which tries to restart your airline your beam virtual machine and whilst the beam virtual machine is right in the crash dump the heart script says no no no we need to restart the system as quickly as possible oh beam is hanging so it goes in and unconditionally it terminates the beam you know to then be able to go in and restart it and what the beams actually doing is it's trying to flush a massive crash dump to file and fails at doing so so you'll get your crash dump which is 0 bytes long so you don't get the information you're looking for and so in this particular example you know we ran out of memory we saw that the process memory was spiking so as a result of the process memory spiking the first thing we did was let's look at the message cues and we saw that right right in the run up to the crash we had an over 154,000 messages in the process queue node crashed it was automatically restarted by script and you know we had a run up right there which you know went you know Q which went back up to about 40 thousand probably as a result of the recovery took a few hours and after few hours it's the system stabilized itself again and you know the message you became you know got to the level it was supposed to get to you following me here yeah so these are all things which yeah which you know you need to keep in mind and monitor now there are several different types of of counters so you've got the counters which usually allow you to increment and decrement a value you've got gauges where you actually go in and pull a value at a particular point in time you've got histograms which are readings over time so you're looking at you know what is your latency you know what is what what is your throughput in requests per second per minute per hour per day and finally you've got spirals which is a polling over a particular time window so you know how many logins that I have in the last hour and which then moves as your yeah as your timeline moves now going in and investigating this crash we started looking at you know what caused this massive message queue in the run up to the crash and we went in and found you know we went in and found that you know right after right after the crash so what I'm talking about here is you know this message queue right here what happened right after the crash on a particular node on one of the one of the four nodes we were monitoring now this this example comes from Mongoose I am which is an instant messaging server and this was actually some of the live data we saw that after the crash in the message queue and one of the servers was up at around four thousand requests four thousand messages so it was a really long message queue and the actual recovery and the restart was really really slow and it took four hours to actually recover from the backlog of you know messages you know after the crash and this backlog right here was about half a million users which you know because of the node crash had been kicked off the since the messaging system you know the socket connections had been terminated so upon setting up the socket connections again you know we had to go through the whole log in the whole log in performance and it took you know it took a long time it was a really slow recovery after the crash and just to make sure that you know there wasn't any bottleneck in the system we went in every log in usually entails amnesia transaction which is a very expensive operation and we saw down here that the total number of ongoing simultaneous ongoing transactions was not spiking it was fairly low and we saw it also here this was this is what we call an incremental counter it's a counter which only increments which only goes up which shows the total number of transactions happening in the system and we saw see that you know after about four hours we hit about half a million transactions if you notice right there and that meant you know half a million users and now signed lot sign no logged on again and we're active and so this is a level of visibility you usually need when you're trying to figure out what's gone wrong and what's caused an issue and I'll come back to counters a little bit later so alongside counters here another critical tool to actually figure out what's gone wrong is logs and a log and we will usually reflect a system event in in the airline virtual machine or your network or your operating system and it could also be an event which triggers a state change in your business logic so it could be a user once again logging on logging off being denied access to the system and you tend to either log you know place your logs in append only files wrap around files or wrap around logs or an entry in a time series database and here is an example you know which I think you also have in elixir which is the sassel logs a system architecture support library as soon as something goes wrong in a process it logs an error report and the error report contains things you know which you know what caused the exception which which contained exception and if the exception then results in a crash the process itself you know and this is from OTP behaviors will create a crash report once again which gets logged in your sassel logs if you've got your configuration flag set correctly the supervisor picks up the fact that the process is terminated abnormally and goes in and issues a supervisor report setting up I've just trapped an exit signal from this process it's crashed these are the reasons this is when I started it you know and you know this is all the data I know about it and after which it goes in and restarts it so you might not be aware of it but in your system you actually will have logs you're just by setting a few configuration files which tell you every time one of your OTP behaviors terminates abnormally now it's not good enough to have these logs you need to be able to react on them we had a customer who had when we went in and did a site audit who had about 30 40 services all implemented on the beam and they had a total of about 200 erlang nodes running to 200 nodes you know 200 VMs out there running and the only way for them to realize that the process that crashed and then being restarted was to log on to the machine log on to the erlang shell start the report browser in the shell and find these crash reports so obviously you're asking me have you had any issues have you had any problems that's oh no everything's working fine but somewhere some users were you know we're getting back an error you know when the process crash and had to retry so it was a very small you it's not subset but it was still happening and they weren't aware of it you know your logs need to be analyzed so always go in aggregate them consolidate them you know if you have crash reports you push them off to log lead or log stash or you know whatever system it is you're using to aggregate your logs now once again just like with logs there are two different types of logs you've got your system logs and you've got logs which pertain to your business metrics and especially if you've got a mission critical system which deals with money or you know deals with you know with something money financial transactions it deals with something which requires an audit trail which requires billing you need to put yourselves in place an audit trail describing what's happens and there was one time when you know the system I was supporting which never woke me up I receive an email from support saying oh this is a very very angry user they had subscribed to a premium service where you know they'd be told that their football you know they'd have the two football scores texts of soccer here in America their soccer scores texted to them and you know they claimed they only received you know the soccer scores after the match and so you know this system you know I was managing had about a million SMS which I was supporting had about a million SMSes going through it every day so million SMSes this is one user complaining where is my text so okay what's his number and first thing when the system and yeah when the requests enter the system we logged it we logged the times we had a timestamp we logged the phone number we phone logged the text which had to be sent to the user it was encrypted so you know we couldn't read it but we looked it was logged and we also logged a unique identifier for that particular request and that unique identified and followed that request throughout the whole system so I found three texts which had to be sent which were sent to this user looking at the timestamp you know you could guess it was first goal second goal final score so I took I took the the unique identifier of the first request which is generated by your system and I used it to you know to look at all the logical checks we did on this user is this user yeah was this user was this user a prepay or a post bay a user so are they paying in advance if that's the case 90% of all your premium rate text your two prepay use to two prepay users failed because they don't have enough credit on the account and so no it was a post bay okay he was allowed to receive premium rate SMSes and his account wasn't blocked it was still active so all of this was logged in the second log so I then took that unique identifier and looked at where we sent that request to the SMSC so we sent it off to the network and the network returned a unique identifier back which is okay this is the identifier I you know you can use to identify this particular text I took that identifier and I then looked at the logs for the delivery reports and in there I immediately found a delivery report which said handset detached that means it was either turned off or it was out of coverage the retry was every 30 minutes so every 30 minutes I got a retry a retry failed because the handset was still detached until after the match where all of the free SMSes were delivered in quick succession you know when the handset became available again took me a minute a minute and a half to go in and just prove my innocence literally can you imagine had you not had these logs you would have gone through the code you would have gone through crash reports you know forget accessing the SMSC's you know or you know any of the network logs because you know that's another department we don't speak to them they don't speak to us it's you know it wouldn't have happened but we had that level of granularity and visibility and we needed to have it because we charge for these texts and if a user came in and said hey you've charged me for something I wasn't the one you needed to quickly show that you know either admit guilt or approve your innocence so I ended up you know formulating I'm not a sports big sportsman my fans I ended up formulating reply back to this friend of mine who was yeah I'd sent a report report request saying okay you know you know with the logs showing okay no listen you know you've got to tell this user to either keep his phone on or or you know if his phone was on moved to a better network with better coverage because you know he was his handset was detached and by the way you could suggest he also gets a life you know how can he get so mad at the tech you know not receiving his football scores come on to which I got back or apply oh this is the CEO of the company who was pissed off and not only I a few years later also realized he was the chairman of that football club but so it's it's good that you know you've got always a forward line first line support which filters off your your your sarcasm and bad comments but this is just you know the story and what did we use you know we had you know it took a minute a minute and a half and you know allowed me then to focus on once again preemptive support on making sure that the system would not go down at night and these logs you know these business metric these logs are used you know not only for troubleshooting and improving your innocence they're used for billing we use them for billing revenue assurance used them for audit logs audits so they checked you know how much we believed we had you know generated in revenue and then they checked the logs on the SMSC side and made sure there was a match and if that you know if that match was deviated by half a percent it triggered a massive investigation it's used by marketing you know marketing need to know you know how many texts are we sending how long is the duration of a session how many users are logging on what features are they using you know what items in the shopping cart are they deleting so you know it goes beyond actually ensuring that you get five nines but you know I love you know Pat Helen's Helen's you know really wise words here that you know the truth is that the log you know the truth is the log you know the database is just a cache of a subset of the logs and that's what you need to keep in mind when you yeah when you start doing your logs on a business level and also your keep separate files you know we've had once again people who've never had to wake up in the middle of the night to deal with it you're putting all of the logs in the same file it will become a bottleneck and it will become a mess to actually try and follow especially if you've got millions of requests coming through that the third is alarms and you know alarms are met when certain alarms are raised you raise an alarm oh by the way when it comes to log logs you know you've got you've got logger in you've got logger in elixir now I'm not sure logger will withstand the ability to you know to create you know when we look at these high throughput systems we often have to write our bespoke logging applications because you know we need to cater for probably 50 60,000 log entries per second you know say each request you know the 10,000 requests come through your system at any one time each request might maybe entail seven state changes so you might want to do seven log entries you're looking at 70 200,000 log entries per second and you know logger logger you know they won't handle this this throughput this level of throughput so most of the time you know we've had to handle you know do our own and go in and highly optimize them for the type of operating system we were running on you know for that particular file system or you know be at a database or whatnot and the third you know critical thing which once again I mentioned earlier and I'll say it again because it's important I very rarely see it being used is alarming and alarms are raised when certain criteria are met so whenever you start running out of disk space for example you raise an alarm and they're cleared when there's either automatic or manual intervention which happens so you know you get an 80% disk full alarm you could go in and trigger a script which will tar all your logs and bring it back down to under 80% you could you know trigger a script which is housekeeping and then yeah at 80% you know you have severities listed with your alarm so an 80% disk full alarm might be considered minor you know it's don't wake up in the middle of the night don't pull someone out you know from you know don't don't page anyone out but if it's during office hours yeah hey someone should be looking at this you then have a critical alarms so then you have a major alarm sorry major alarms and that could hit when you instead of 80% this full you teach 90% this full and a major alarm might warrant you you know getting someone who's on call you know before they go to bed so you know between you're 7 a.m. and 10 p.m. send out you know your pager duty request and make sure that they go in and have a look and make sure that you know you won't run out of disk space and then you are you and and at that point you know you could as an example you know automate the script as well which will go in and reduce you delete some of the logs not just time but actually delete them and reduce the wraparound time for them and then you'll have critical alarms which could be 95% disk full and they'll have another certain different SLA they'll have another set of actions so that could be going in and pulling you know actually waking someone out of bed because you then get this five you know you actually get a you know you'll get this yeah you have a risk of you know the system running out of disk space and then crashing and yeah and that could be in the middle of the night now there are two different types of alarms or state-based alarms so they that basically means that they originate from the VM itself from the node itself and it could be a pool of sockets as soon as you know towards a database assume you've got a pool of five sockets and three of them go down you only have two sockets left as soon as that happens you might want to raise an alarm to alert the operator that you're having connectivity problems with your database and then as soon as those three sockets are restored you clear it immediately so these are state-based alarms the ones you know the examples always using telecoms is you open a cabinet door you might want to raise a minor alarm your your fan your ventilation stops working and you want to raise a major alarm yeah because you you you run the risk of overheating your switch so and the third is and then the different type of alarm are threshold-based alarms and I've seen threshold-based alarms being used but that's usually you know they originally usually originate from Nagios who will go in and collect some metrics and you know and you know if you hit certain thresholds either lower with upper bounds or lower bounds it will go in and generate these alarms so take Amazon if you have no purchases within the last minute there's something wrong here you know even though the system might be still be up and running you know you want to go in and yeah have someone look at it because it's not normally even if it's in the middle of the night or you might have you know upper thresholds so you know that this a particular node will handle 100 users raise an alarm as soon as you know you've hit that threshold deploy more hardware and here are just some examples from the previous example which I was showing you where we actually go in and raise alarms whenever you know process message cues become too large and we would this this came from when we were trying to reproduce the error I was showing you earlier in in a controlled environment and by stress testing all the logins we actually realized that you know a process actually ended up getting over a million messages in its broke message queue and you know what was causing it well one of the items was Mesa transactions failing and another alarm which was raised was that we'd hit in the airline VM the system limit now there's a limited number of ports you can have processes airline term storages sockets and their default values but you can override them when you start up the airline then when you start up the beam and in this particular case we had hit a system limit which was the ETS table count where we had hit you know to our 2053 and that meant every Mesa transaction goes in and creates a new table a new to ETS table which it then deletes after so what that meant is that at any one time we had a maximum of 2,000 users being allowed to log on to the system at any one time and logging on to the system in instant messaging is a really expensive operation you need to authenticate them you need to get the roster you need to you know tell everyone that they're online and that they're offline and you need to propagate the data to other nodes and looking further I guess what there was another alarm here which was also major which was basically telling us that we were running different versions of the MongoDB driver on separate nodes so MongoDB wasn't our choice it was you know customers always right and our mess up here was that we had done a manual deployment of the system and in one of the four nodes we had not upgraded the MongoDB driver and that was literally slowing down on one of the nodes the nodes which was causing problems which then crashed all of the all of the requests it was a buggy MongoDB what was a buggy driver we went in we fixed it and all of a sudden you know this is their emetic mean you know of the number of requests coming in and all of a sudden we see that driver you know the speed of the throughput of that particular driver coming back up to normal so you know problem solved so you know moral the story is you know that the route to five nines is you know five nines availability so that's you know about five and a half minutes of downtime per year including software upgrades is to gather and analyze the information you receive and you use it for two reasons one is you know support automation and support automation allows you to do something called preemptive support you know preemptive support means that as soon as you get early warnings about things which might happen take action and you need to automate that action through scripts you want to do you know end-to-end monitoring of your system so you want to have probes outside of your system as well raising alarms we got called out once and we went in and looked at a system hey it's fully functional it's working and then we realized that there was no traffic going through it at all and this was a system administrator who had messed up a firewall connection but once again yeah we missed that it was the customers who were calling in and complaining or they're complaining on social media and your preemptive support automations is an attempt to predict disruption and act on them and the second reason you know to collect all these metrics is things will go wrong you're dealing with software you will get a crash you yeah and you don't you know you want to reduce the time you spend troubleshooting what went wrong and you want to make sure that you know these crashes never happen again that you put in the early warning signs and you know doing posts more than debugging if you don't have a snapshot of the system is look like looking for needle in a haystack and it is looking for like looking for you say even if you have a snapshot of the system so you know I was lucky with my SMS it took me one to one and a half minutes to figure it out had I not had that visibility you know I would have just a bit of guilt say yeah there's a problem here we have no idea what it is but you know we'll raise a ticket and look into it yeah and that would have you know really wasted hours and hours and hours if not days so remember you know pretend this is your colleague who gets woken up in the middle of the night to address an issue you have caused and not only you know is he your colleague he knows where you live so don't give them the excuse to visit you know provide them with the necessary visibility which allows them to isolate and address the issue and this visibility needs to be planned in as part of your requirements you need to start thinking about it in the design phase in development phase of the system it is not something you can add in as an afterthought any questions no question so yes happens all the time I think one of the most recent ones was it was an instant messaging system for a huge social network and they went in and upgraded the security on on the on the wraps so the mobile devices they basically enabled the security enabled encryption and all of a sudden you know as users went in and upgraded their apps we started seeing a huge degradation of the performance a huge massive degradation and we actually you know ended up you know you're realizing okay they've enabled the security but once again we there was very it was really hard to figure out what was causing the lock contention in the VM and we figured that you understand that you know it's usually has to do with lock contention when you can't find anything in your regular business metrics and there's actually a flag which when you compile the airline VM which you can set which hits performance but every time there's a lock it will give you all of that information to you and it took us about two weeks to go in and figure that one out to figure out you know to recompile and deploy in the live scenarios it was really hard to recreate also in a test lab redeploy one node with with you know with the airline with airline VM compiled with this flag on and immediately we started seeing that the decryption module which we used put in place a global lock so the encryption module that the device Darwin and if we were using had been compiled with a flag which only allows one decryption at a time and yeah so you know it's you know these are the types of challenges we see on a day-to-day basis other ones include well multicore and you know looking at non-uniform memory access or multicore and bugs in the VM which are related to those yes yes there's there's no one-size-fits-all I use Nagios you know as an example for alarming and threshold based alarms but it's our customers will use absolutely everything it depends no so the log throughput I was talking about all we did is we created files which we rotated every hour and and then you're stored offline this was yeah so I think and that's usually the easiest way because you get you know you tend to use drivers you do you write you know your code in C and and you end up getting that level throughput and then you know it just completely depends you know you wouldn't yeah that might work here in some case it might not the particular example I gave you I use grep but all all of those you have to quickly navigate through them in some case you might want to push them to a database as well you know any other questions no there is you know a bit of suggested reading when if you go to the airline solutions blog we have a few blog posts on rabbit MQ and operations so if you do use rabbit MQ I really warmly recommend it and there was one blog post I wrote out of pure frustration when I was trying to explain you know how to do monitoring to someone straight out of University and they replied oh I can do it it's only airline code and so I basically ended up writing an email back saying yes it is only airline code but dot dot dot and and we ended up turning into a blog post it's called you know DevOps from the trenches which are one we recommend and there's also you know for those of you have designing for scalability which is the book that the last book I wrote for Riley the last chapter in that book covers all of this in detail yeah all right well thank you very much I'll be around so feel free to ask