 Hello, everyone. So today, I'll be presenting the self-analysing and auto remediating cloud platform certification. My name's Ronnie Fields, and I'm with PayPal. Originally, we had planned for Lax Sharma to also be presenting with me, but unfortunately, due to certain circumstances, he cannot come today. So first, I'd like to start with a little bit about our company, PayPal. As everyone knows, it's a massive fintech company, and there is no shortage of companies underneath the PayPal umbrella, such as Braintree, Venmo, and Zoom. And naturally, with such a large company, there are so many developers. We have, in fact, 3,000 engineers. We have 10 million lines of code, 1,000 releases a year. The statistics go on and on and on. And naturally, with any kind of large development system, you have no shortage. In fact, you actually have a very large open-stat cloud. And you can kind of compare it to Alice holding up the world. You have the open-stat cloud holding up the PayPal ecosystem. And in fact, we span three regions and span 12 availability zones as well. Naturally, with such a large infrastructure, we need our infrastructure to be robust as well. So the number one thing here is that the health of our infrastructure is critical to both our business and financial models. Not only that, we have a global presence as well. In fact, we are in most markets. And we also have a hugely distributed customer base as well. In fact, we are also one of the largest fintech companies around. So next, I'd like to present a number. 70%. 70% of all issues in our environment are actually reoccurring issues. So the question is, what's stopping us from creating any kind of system which will simply look into any kind of problem in our environment and then figure out if it's a part of that 70%. And then with high accuracy, we can go in and fix the issue. This is actually what health checks are and auto remediation is. And it's split up into four fundamental basic components. First, we have the health check execution, which is actually a data gathering organ, you can call it. Then based on the information we collect from the health checks, we then can perform monitoring and learning. Nothing new with that. Where we deviate from the standard is that we go one step further and we do test analytics and auto debugging. This is simply where we take a look at all the information that we have, as well as historical trends. And we will go in and say, hey, have we seen this before? Have we? If we have, then we can actually go on to step four and say, hey, since we've seen this before, what can we do to actually fix this? Is this a situation where a simple restart of a service is required? Can we remove this component from our environment to fix the issue? What can we do? So that's the fundamentals of health checks and auto remediation. Next, the pillar of automated health checks and remediation, which is the automated health checks. These are essentially tests, but these are hyper-fine, fine-tuned tests which are designed to operate in a production environment. Furthermore, they are designed to gather tons and tons and tons of data. In fact, we run these tests 24-7 on all availability zones. And in order to manage such a massive system, we use a combination of GitHub, Jenkins, and then we take the data, we database it, and then we display it with graphs. And in essence, through these automated health checks, we create what can best be described as a snapshot of the health of our entire ecosystem. And using this data, obviously, because we're dashboarding, we can also provide supplementary monitoring, which can actually handle certain niches which traditional monitoring cannot handle. And at any issues that we find in the environment, we will then always be sure to analyze it, debug it, and track it. So let's go on to the actual architecture. Number one, the health checks as mentioned. When we do the health checks, we not only collect certain kinds of test data, but we also collect and store log data for everything that can be collected. We store this information, specifically the log data in a log database. And for the health check specific data, we actually will go ahead and store that in a secondary set of databases, and we do various more human-oriented stuff, such as alerting and monitoring, and then we will go ahead and we will merge our data with the logs alerts analyzer. Now let's move on to the actual interesting part of it, the brain. Actually, we have two brains. The first brain is the pattern search engine. This is pretty simple. All we do is we take a look at what's going on in environment. Is there something anomalous going on? Is there anything we can really capitalize on? If there is, we will try to identify it via our pattern-based search engine. And if we cannot identify it, if this is new to our environment, we make a note of it via the issue logger and storage system. Depending on the results of our rule-based pattern engine, if we do find an issue, we then continue to the next step. We will take a look at our historical trends. We will see, can we actually fix this? Have we seen this before? If we have, fantastic. And if we haven't, well, again, that's an issue that we have to bring up with the issue logger and storage system. But assuming that it is there, we can just have a system go in and remediate whatever is going on. Now I'd like to go into a few examples here. First would be hot hypervisors. Basically, imagine if you would, theoretical situation, you have a multitude of VMs which are failing, but they all have some commonality. Specifically, there is one hypervisor, which is a common link between all of those. Now for a human, they can go in and see, okay, we do have an issue with that hypervisor. And even though the raw metrics, which are showing traditional monitoring data about that hypervisor, might show that hypervisor as being healthy, we don't actually know if that's the truth. But using the system, we can see, oh wow, look at that. All of these VMs, which are failing to properly boot up, they're all on a single hypervisor. So using the log data, as well as hypervisor distribution, and as well as any secondary data that might be available, the system can go in, take a look, and see, oh, okay, so this is a problem specific to one hypervisor. Let's go in and as per my historical trends, I can see that I should either restart the hypervisor or simply take it out of traffic. And depending on whatever it sees is more appropriate for that specific situation, it goes ahead and does that. And if required, it'll also alert for RCA. In fact, you'll find that a trend to be, for these examples, alert for RCA, because I mean you can never have enough human interaction despite a fully automated system. The next one would be port binding errors. Imagine if you would a situation where OVS is down. Oh, that's bad. So due to the miscellaneous failure patterns as well as any kind of log data available, we can actually have our system say, oh, looking at historic trends, OVS is down. Let's go ahead and fix it. So in this situation, due to the details of that specific occurrence, it might decide that simply restarting OVS could be what's required. And of course we alert for further RCA. The final example would be stuck volumes. And you know, it's a typical situation, volumes are stuck in creating, what do you do? You look at the logs, you look at the failure patterns, and depending on the results, you then can find that the storage backend might actually be offline for whatever reason. So you go in, you restart the backend and you basically alert for further RCA if that's required. So what's next? We plan to actually take the system that we have and expand it to include predictive analysis. So basically, can we figure out what's gonna happen before it actually happens? We also plan to open source the two major brains of our system, which would be the pattern search engine, as well as the remediation suggestion engine. And next, we also plan to expand our system to include more low level components of a cloud, such as networking, as well as the more incredibly nuanced components of a cloud, which would be configuration management. Now ideally, I would like to take some questions, but I don't believe I actually have the time for that because I've been given the two minute signal. But I just wanna say thank you everyone and I hope you have a nice afternoon.