 Is it? Okay. Good. What for? Should I wait for the crowd to disperse? All right. We're going to get started with the next talk. So I'll get that sticker handing out. Can you go on the hallway, please? Okay. Great. So I'm Matt Trinich and I'm going to be talking about the title is Better Testing Through Statistics, but it's really about the approach, the data approach we took to analyzing CI test results in the OpenStack project. So in the OpenStack project, we've got a pretty involved CI pipeline. So whenever any developer in the project wants to push a code, push a change into any OpenStack project, it automatically spins up a lot of work. And each one of these jobs is running in a VM in a public cloud provider. So when you push the patch, it spins up all of these test jobs, which has run on donated cloud resources, mostly it's all on clouds that actually run OpenStack. So it's kind of dog-fooding. And some of these jobs are pretty simple, style rule checkers, Python unit tests. But some of these jobs, DevStack Tempest and Multinode Grenade are actually running real clouds. So it spins up a VM, deploys an OpenStack cloud inside of that VM, and then we hit that cloud with a test suite which makes API requests against the cloud, which spins up second-level guests. And it does a lot of work. Glossing over a bit of the details on how the CI works for brevity's sake. But what this means in aggregate is when you push a change, you're running between five and 25 clouds. DevStack is the tool we use for DevTest deploying an OpenStack cloud. So depending on the project you're pushing the patch to, spinning up between five and 25 clouds. As part of that, we're running about 10,000 integration tests, about a thousand and a half tests per each cloud that we spin up. As part of that test run for each cloud, we're launching 151 guests. That number has actually gone up more recently. And we're generating a gigabyte of logs uncompressed for each run. That's actually tripled since I generated these numbers. I should update them. In aggregate, we run about 12,500 jobs a day. And our failure rates are actually really low. The failure rate of an individual tempest test, which is an integration test there, is about 0.01%. So it's not entirely a fair metric because some tests are more likely to fail than others, but if you look at them all equally, that's the failure rate. And an individual run likely has a chance of failing of 0.77%, which is still less than 1%. But when you're running 12,500 jobs a day, that's something that a developer is going to notice. Also, if you push a patch and it works fine and it fails as part of that 0.77%, you'll notice because you'll be blocked by the CI system. So this is just a graph I generated to show the scale. These are the number of tests we're running in a given day from the end of 2016. And you can see we run a lot of tests in peaks and valleys based on developer, awake time, and who's on vacation. You can see there's a nice gap in December right here when everyone takes the end of the year off. That's just to show the scale of how much throughput we're having to go through the system. So we have some tools to deal with this. The first one was we save all of the logs from all of those test runs. Every test we run generates a bunch of artifacts, whether that's logs from the OpenStack services that are running, the test results, or any other data like syslog, kernellog, servicelogs, like Apache logs, all of that is stored on disk. And we store about 10 terabytes worth of data at any given point of time of logs. That ends up being about four months' worth of logs from test runs. That's as far back as we can get with the amount of free resources we have donated from corporate sponsors. The problem is dealing with all of this ad hoc doesn't give you a good view. So if you looked at those results from before, you know, if you're an individual developer, you push a patch, you see it failed. If there's nondeterminism in the system, which there clearly is if we have a 0.77% failure rate, looking at one result at a time is not going to help you track that down because it's impossible to find something that fails so infrequently just by looking one at a time. Also, because of the nature of how the CI system works, finding performance regressions is almost impossible because we're running on heterogeneous environments on different public cloud providers. So we have noisy neighbor problems. We have differing hardware problems, a lot of different systems that we're running all the identical tests on. So figuring out when something performed poorly just by looking at one result doesn't actually tell you anything. And also an interesting question that I often was thinking about was figuring out how often something passes or fails like an individual test is almost impossible when you're just looking at one result at a time. So we had to come up with an approach to deal with all of this data. It starts by, instead of looking at the one test result you're dealing with, look at everything. Look at things at a larger scale. Start using statistics and data mining techniques to find trends in open stack, not the tests or the CI system, but in the thing we're actually testing. So we have all of this data. We can leverage other techniques to figure out what's going on in the thing we're testing, even if the tests aren't purpose-built for performance or some other trend we're analyzing. We have to make sure all of the data we're using is open and accessible to everyone and that there are APIs for accessing it all. If the data is just a blob of 10 terabytes of log files you can download all at once. That's not helpful, but if you have a way to query it that is useful. So we started introducing systems on top of the CI system to start analyzing the data. The first one was Graphite. This was introduced long before I even thought about any of this. And we just, all of the infrastructure services that run the CI system push data into a StatsD demon that gets rendered on Graphite, which includes also job results. So you can use this to see how often an individual job run fails. So like a test job, one of those boxes on the first slide. But it can't dive in deeper than that. It's also time-based because it's StatsD and it's a time series database. So you can't actually see the individual job it goes through. You can see in a certain bucket of time how many things happened. It's counter-based, but you can't see what that came from. And linking that back is important if you're trying to find what changed, what broke, what made something worse. We also have Grafana, which uses the same demon, but it provides a little bit easier to use interface for generating graphs. A lot of projects in the open stack use this to maintain dashboards for test failure rates. Same data set, same limitations. It's just prettier graphs and easier to use. We also have an elk stack, Elasticsearch log-cabana. So all of those large artifacts that we generate get pumped into the elk stack. We have a cluster running. I think it's trying to remember how many Elasticsearch nodes it is. I think it's at least a dozen. We hit some scaling limitations with it because we're running on publicly donated public cloud resources. We can't... A lot of the scaling things for Elasticsearch is just throw more memory at it. We actually can't because we're on a public cloud with limited resources. But it gives us 10 days of results and it gives us a search engine on top of job artifacts so we can start looking for trends in logs and start searching for certain patterns, which is very useful. We have a tool called StackViz, which allows you to go out to analyze test results. It lets you look at the test result, an individual test result at a high level, but also visualize what it's going on. So in this particular case, it's an integration test suite, the tempest that I mentioned earlier, and it runs with four parallel workers making API requests, and you can see which tests are running at the same time. It also highlights which one failed and shows why it failed and where it failed and how long it took and information about that. In a different way, it helps people understand what's going on in the system under test. And this comes to the big thing that we constructed relatively recently. It's been only about a year or so called OpenStack Health, which is basically a visualization dashboard for what is going on in OpenStack that we're testing. I'm sorry the picture is so small, but it's an interactive website and I didn't want to do a live demo to get it in cleanly. But this basically just renders exactly what's going on in the test system at any given point of time and lets you dive down really deep into individual test results and you can start abstracting trends from all of the data graphically through a web interface, which is very useful for a lot of people. So that's the website if people wanted to go play with it. It doesn't really mean a lot because we don't know what's being tested, but it's cool and lots of pretty graphs and lines. It's designed for the developer community to be a single access point for all of the data from the CI system, which we call the gate. And right now it leverages two things that I haven't talked about, subunit to SQL and elastic recheck to source all of the data for generating those visualizations. The architecture is really simple. It's a REST API server written in Python with Flask, and then it just queries the multiple backends and it's designed to be modular. So as we add more backends or other things that we want to integrate the data into for visualization purposes, we can just plug it in pretty easily. And then there's the JavaScript frontend which just runs client side written in Angular with D3 to generate all of the graphs. So about the data sources for OpenStack Health, because this is also some of the places where we've started doing different things than we haven't seen elsewhere. Subunit to SQL is a project I started. It's designed to store test result data in a SQL database. Pretty simple. It stores the overall run and then the test, whether it passed or failed and how long it took, or the time stamps around it, and then a bunch of metadata. And Subunit to SQL is just a library that provides a Python API and a database schema for interacting with that database to define how to store data and access it. There are also some CLI utilities to store data into the database and retrieve data from the database as Subunit V2, which is a protocol for test results that's used in some Python test runners. The OpenStack Infra project that runs all of the CI infrastructure actually runs a public SQL server that you can log into. The credentials are documented. It's a terrible, terrible idea to run SQL on an open port on the Internet and tell people, here's how you log in. But it's something we do because we wanted open data access for everyone. I've given this presentation a few times. I'm still waiting for the denial of service attack where someone just loops, show process list and crashes my SQL. But it hasn't happened yet. And because of our resource constraints on that SQL server, we only have six months' worth of results, which is more than enough for most purposes. The architecture which we get the results into is a bit convoluted and involves an asynchronous message passing. I'm not going to go into details. I thought I actually removed the slide. I'm sorry. The other tool I was going to mention is Elastic Recheck, which is the other source of data for OpenStack Health. And it's designed to answer one question. Have you seen this recently? So part of the problem we were having with the non-deterministic failures, with those small failure rates, is how do we know when they're happening more than once? We look at one test result and we see the failure pattern and that's it. And there are some people like myself and a few others who look at these failures pretty frequently and we'd have a mental model of common failure patterns and a bug that goes with it. But we don't really scale and people also have a hard time keeping track of more than a dozen or so specific failure cases. So we wrote Elastic Recheck to leverage the Elk stack that we run to define queries that fingerprint failure conditions that are tied to a bug. And we can look at when a failure occurs and report back to the user. It says, this test failed probably because of this bug. And we report that back to Garrett and IRC. And then we also have a dashboard view independent of OpenStack Health which shows failure trends across different all the different conditions. When we started Elastic Recheck we thought between the people who were working in this space, we thought, okay, maybe we've got 10 to 20 different failures conditions that we know about in the CI system. Turns out at any given point we track about 150 to 200 race conditions that are going on in the test system. This is just a snapshot of the graphs that we generate for the dashboard view. So here you can see a bug where Libvert Block Migration stalls during the test job. It's got a bug that's been confirmed and we hit 14 failures in the past 24 hours and 50 failures in the past 10 days and you can see the frequency of it. And it's been a very useful tool. It also lets us identify when a chain if we can catch a failure condition soon enough it also lets us identify when it started so we can try to track it down to what introduced the braking of the race condition. The other thing it lets us do is you can see up there in the other corner it says uncategorized. It lets us track failure conditions in the CI system. We have no idea about what's causing them which is very useful for identifying race conditions and sorting things out. So what all of this lets us do is make decisions based on data. For a while I was the maintainer of Tempest Test Suite and a common question we had was should we skip this test? Is this test buggy? We had no data on that. It also lets us identify tests that are actually useful. We have 1,500 tests in that test suite. How do we know all of them are actually useful? Are they just hitting an API surface that only queries the database so we're essentially testing the database query? We don't really need to be testing that and we can look at the test results to see how frequently is it failing and how it's actually catching bugs. And also when things do go bad it lets us figure out trends in failures to isolate it to specific configuration or cloud regions or other environmental things about the test run. It's been very useful just to have data to point to for making a decision because otherwise you're just guessing. You have no idea what you're making is the right decision or not. The other thing which I think is pretty cool is it lets us find trends among the noise. Performance regressions are the obvious example for this. So as I said before we run in a heterogeneous environment on multiple cloud providers but we're running real things. We can't do real performance testing in the CI system because of the nature of public clouds and different clouds. But if we collect all of the data we can start seeing trends. So this is an example from I think January where someone introduced a change to the volume system in OpenStack and it caused performance regression and you can see the average moved up but still really noisy data. The scale makes it look small but that's 50 seconds. One horizontal bar is 50 seconds so that's really big distribution but looking at it high level and just graphing an average lets us see that there was a performance regression and we caught it and were able to fix it in the period of a week because we had the system in place. We can also use the same techniques for finding race conditions and analyzing all sorts of things just by looking at the data at all. It's the same standard big data approach that everyone likes talking about and just applying the test results to a real system under test you can find real value. There are some issues we have with the system as we constructed it. Right now we've got too many varied data sources I mean that's half the presentation is just going through all the different things we have. The solution was supposed to be OpenStack Health but contribution has been pretty limited so we can't we haven't grown OpenStack Health to solve that varied data source limitation and even if we integrate all of the data you still have to keep in your head all of the limitations with the different data sources. OpenStack Health uses subunit to SQL so it only uses two different types of test data and if the infrastructure falls over we don't capture that because of that graph that I skipped over if the infrastructure falls apart we don't actually collect results for the database. An elastic search we're limited to document searching which in our case is one line which is a limitation if you're constructing queries you have to know you have to know that and trying to figure out a way how to make this simpler for people to use has been a big challenge which I think contributes to the fact that people don't contribute. But moving forward some of the things we want to do is integrate all of the thing all of those data sources into OpenStack Health so we have a more complete picture of what's going on that developers can interact with to know how to interact with the API directly if they're not interested in that. We also want to use the data to optimize how we run the tests all of the tests at all of the levels run in parallel and they're scheduling involved to figure out how we run things we have all of this data figuring out a way to pipe that back in so the scheduler can make smarter decisions about how to utilize our limited resources will be very valuable and also to start playing with some of the machine learning things and look at automating the failure detection a lot of cases when we're writing elastic recheck queries to fingerprint failure condition it's looking through a log and finding the stack trace and seeing if that corresponds to the API requests that cause the test to fail that's something a computer can do really easily we rely on intuition for certain pieces of it but there's a lot we could do with trying to automate that and be smarter about how we detect failures and that's something I think is really interesting and I wish I had more time to work on so here are some links to get some more information just to get repositories for some of the projects I mentioned and all of the people who work on this are active on the OpenStack developer mailing list and on OpenStack QA on the free node IRC and with that I think that's the last slide the next slide just says questions with question mark so if people have questions about time I normally give this talk with 40 or 50 minutes so I cut things down and I hope it was clear for everyone so are there any questions great you can also play around with the dashboard if people want if I have time I don't actually know if I have time you have like 5-ish minutes okay well it's good I cut the slides in half so if people want to see the dashboard I can try to bring it up on 1024 by 768 and see if that's actually usable see one okay let's make sure I'm on the wi-fi too which I'm not and all the time goes to this now yeah it's not getting the lease we can watch a bar move that's a good way to end it if it's just a bug probably below you well then it should be fair let me make sure it's trying to connect to the right one and not the one in the previous room I don't think a live demo is going to work today but I can show people once I get on the internet assuming I can and if there aren't any other questions that's all I had I'll sit here and keep trying until he says get out I won't tell you to get out yeah I'm just going to make a great YouTube video you can try but last month it's encrypted oh yeah it's not getting the lease you said try this one the wi-fi was supposed to be fixed on Linux we're going to have to roll to the next fix okay that's fine and finally connected but okay thanks everyone