 All right, I guess it's time to get started. So hi, everyone. It's good to see you. My name is Greg. I've been working on the SEP distributed software system since 2009. It's been a long time. Last time I gave this talk, it took about 37 minutes, and I tried to cut it down. But we'll see how it goes, which means that if you have a question that comes up that you want to ask, maybe raise your hand to yell out while I'm talking so that in case we run out of time, you get your questions answered. Who here came because they think they want to learn something about SEP? Good. OK, who here mostly does testing during their day job? Who mostly writes software during their day job? All right, good mix. Excellent. OK, so we got three slides that I'm going to do as fast as I can about SEP just so we know what problem we're dealing with. So SEP is a distributed storage system and provides optic block and file interfaces to people who want hundreds of terabytes or petabytes of data. A SEP cluster is built on top of the reliable autonomic distributed object store, which is responsible for actually storing data and serving IO. And that rados cluster is composed of mostly optic storage demons, which actually are involved in serving every IO and doing data replication and recovering from failures. And then a very small number of these monitors, which we really don't need to worry about right now, that just say, OK, here's Susan in the cluster, but aren't involved in data movement. So if you write an application that runs against the SEP rados cluster, then you've got a big cluster. Even a small one is probably several dozen demons running all together. And then the application just talks to it via a library binding, a set of library bindings. So if you want to test the SEP cluster, it's a little complicated. You can't just sort of go, like if you want to do power off testing, you can't just flip one switch and turn it off. Or maybe you have a big giant data center switch that you can turn off, but then you don't have a test system running either. So we needed some way to do testing of SEP. And so we came up with Tuthology. Tuthology is the academic study of cephalopods. So it's just a pun on the name. And it started about seven years ago. So in 2011, I had been working as a very wet behind the ears software developer for a couple of years. And it was mostly me and a couple other people sitting in a room together. And I was the new guy. Someone else had started the project several years earlier. But we really needed to formalize step testing. At least once, we had a situation where we ran a test, like FIO or something. And we said, hmm, there was a bug. And we tracked it down. And we said, oh, this if condition is backwards. So we swapped it. And then a few weeks later, one of the other three people in the room ran a test and tracked down and found a bug. And they tracked it down. And they said, huh, this if condition is backwards. And they swapped it. And we eventually realized that they were the same if condition. And it didn't take account of enough state. But because we were just sort of running test ad hoc out of shell scripts on our laptops when we felt like it, then we didn't have a good way of identifying that problem. And so we really needed a solution. Luckily, we hired a guy who was much smarter about testing and administration than I was named TV. And he set out to try and build us a test system. The first one involved using auto tests, but it just didn't work great. And we couldn't find anything that was accessible to us that would work for distributed systems because they were all mostly set up around single nodes. And we really, really needed to be able to say, hey, that machine needs to turn off. Or we need to kill this process running on that machine over there. So TV sat down. And he wrote some Python code. And he called it an orchestra module. And the purpose of orchestra was to let us SSH into remote machines and do things to them. And it sort of presented these machines as Python objects. One of them was a cluster object. And you could run commands on every node in the cluster or a filtered set of nodes in the cluster or get a single node out of the cluster and save that and run a bunch of commands on that one node. And that worked. It let us control clusters. And so then he created the tuthology test running system. Tuthology was built up around orchestra. And its job was to take a list of tasks that you gave it, like set up a stuff cluster and then mount a kernel client and then run some tests on it. And run those tasks on target nodes in ways that are defined by roles, which I will just illustrate here real quickly. So a set of targets is just a list of users and machines that you can log into with SSH without passwords so that the system can automatically log in. Roles are a mapping from kinds of demons into nodes in the cluster. Now, in our examples, these are gonna be SEF based because we're testing SEF, but the roles that are available are not defined by the tuthology framework. They're interpreted by the tasks. And so in this role, we have a monitor on each of three nodes and a couple of OSDs and this is a metadata server and then a client role. And then the tasks are lists of other kinds of Python code. And we got a question. This sounds a lot like your invented Ansible, no one knows this. It's still up. Okay, so I actually know very little about Ansible as a front end technology, but yes. Ansible was not really available at the time. Or maybe it was available, but no one had heard of it. Yeah. So, and then, but like this list of tasks, each of these tasks is actually a separate Python code file. And so the SEF one says, you know, install and set up SEF. This is a one for mounting the kernel client. And then this is a work unit one, which is a generic one that lets us invoke shell code tests. And it's going to invoke it on all of the clients in the system. And this particular one is the dbench.shell script. The tasks are kind of interesting. They can be something called context managers, which basically means that they don't need to execute on one go. The SEF task actually sets up a cluster and then it runs the yield command. And then another test can run after it while the SEF cluster is running. And then when they return execution flow, these back to the SEF task, then it does the tear down of the cluster in the cleanup. And then tuthology automatically can combine individual YAML files into a single test. So you can like have a file that's your targets to run on because these are my machines. And then you just like keep on changing the, the particular test you want to run by changing the kclient dbench.yaml file or something else. And it automatically combines those YAML fragments into a single test job, which I think is familiar in many test systems now, but I don't know, it was the first time I saw it. And when it's done with a job, the tuthology logs all that output into an archive directory you specify. So you have a set of, this is probably pretty small, but you have a remote folder that contains all the logs from all the nodes that were in the system. You have a tuthology log file that's the actual test running output. You have a summary.yaml file that specifies whether the test failed or succeeded. If it failed, it has a reason. Whatever, whatever failure condition it triggered it, the original config that was used, the version of stuff it was run on, stuff like that. Now that was seven years ago and TV was smart, but it was still only, you know, a couple weeks or maybe a couple months from time for him. So it's a little bigger and different today. Tuthology today is mostly, but not exclusively used inside of the upstream CPO lab. This is about 250 machines that are available to the CEP community. I think they're physically hosted by the Red Hat open source and standards group, but it's just, it's not controlled by anyone. It's anyone except for the CEP community itself. And it's devoted to running CEP tests full time and we grant SSH access to any engaged developer. So it's a big system and it takes a little more than just sitting there at your desktop and being like, oh, like run this test now. So first of all, we have a locking server so that we can say, hey, I need three nodes, give me three nodes that aren't gonna get trampled on by someone else. And you can use this to grab a target's file in addition to actually locking them. Second of all, maybe all the servers in the system are busy right now so you can't lock any, so you can schedule a job to run later at some later date instead of running it immediately from your desktop. And when you do that, it just assembles the ML files, puts them into a beanstalk queue and then we have a two-thology server running with a bunch of worker processes that sit there and go, hey, is there a job I can run off of the beanstalk queue? Oh, there is. Okay, this job takes three nodes. Let me lock three nodes. Hey, I locked three nodes. Let me execute this job. Hey, I'm done with the job. Let me store the log files. And instead of, yeah. So we have the lock server and we have schedule and instead of actually scheduling individual jobs, we instead tend to specify schedule suites. So suites, obviously, well, suites do test a specific category of the step interface. We have suites for the file system. We have suites for the underlying rate of the object store. We have suites for our S3 compliant gateway. We got suites for a lot of different things. And in particular, instead of being a full collection, a full listing of all the tests we wanna run, suites are actually directories of the ML fragments that we can combine. And this is really useful because it means that if we wanna add a new test for a new API we built up, then instead of having to do all of the different combinations of things we can test against that API, we just add a new API, a new test this API YAML fragment and the system picks it up against all the other stuff. So in this example, this is a trimmed down version of our rados verify suite. And we've got a couple of different YAML configs that are gonna run on everything and we've got a description of the cluster and we can choose whether we want to do thrashing against the cluster or not, which we'll talk about later. We have a couple of different local object stores that run on a single hard drive and so we can configure different ones of those and then we have different tasks. We've got one for testing whether the monitor is recovered for whether the API is actually work and for something that we call classes. And so when we run a run tuthology suite, then it'll first say, okay, well, we run, we include both the YAML files in the top level directory and then I go into the clusters directory and I'm gonna pick out some files and oh, this directory is tagged so that we actually use both files in this directory for all the tests but in the thrash directory, we pick one YAML file so I'm gonna pick the no, we're not gonna thrash YAML file. In the object store directory, we're gonna run on XFS with our file store backend and I'm gonna run the class tests and then the next test is all the same things except for running the API tests and the next test is all the same things except for running the monitor tests and then the next test, we're gonna switch to BlueStore and iterate through all the tests again. So it's a combinatorial explosion but it's very useful because when I add a new thing, it's like, hey, I need to verify that my new, I don't know, that my new thing works, that I add a new task that says, hey, like, test that I successfully mount and unmount correctly if I found a bug that does it 30 times, where 30 times in a row it doesn't work and then it still tests against all the backends and things. We have a whole bunch of suites, so coverage is pretty good and I mentioned the thrash directory so this is a very important part of testing Ceph. Ceph's purpose in life is to deal with failures so we actually need to test it against failures. You may have heard of the Netflix Chaos Monkey, that's a very popular concept today. It probably, we probably read a news article before you wrote this but it's just sort of the thing you have to do so it's one of those tasks that runs in the background and it runs around and does things to the cluster. We have thrashers that you can configure to turn OSDs on and off randomly but only so many at a time so the cluster should still work or maybe enough that it doesn't still work, you can configure it. We have thrashers that let you change the way data is sharded across the cluster or force a reshuffle of the data so it has to move while IO is still happening, things of that nature and these are very important to testing the recovery state space of stuff. So that's sort of an overview of the tuthology user interface but it's also important to talk about how we use it. So first of all, we test the development branch. I can write some code up and say, all right, I think this might be ready or I think this one thing works and I want to make sure that this one thing works but I'm not ready for a pull request yet so I can push that code to our cefci.git repository and we have a system that automatically builds packages out of anything that gets pushed there and then I can schedule a job in the CPLab to test that code and I can get it to email me or I or anyone else can just go look at our Polpito website that reports all the jobs, passes and failures and let you look at logs. Second of all, after branches are tested by the individual developer, we test every pull request that comes into the system. We use GitHub, we get a lot of pull requests from people that we don't know and a lot of pull requests from people we do know but don't have quite as good testing as we do so we have to test all that before we merge into the system. Tech leads and people who do a lot of reviews have some tools to help them but basically we go through, we make sure that a branch is in good enough state that we think it's worth actually testing on. We merge it into an integration branch with three or four or five other pull requests and we run them through whatever suite is appropriate or whatever sets of suites are appropriate and we look for problems. And again, those results are publicly available if there are no problems then hooray, we can merge that code assuming everyone agrees. If there are problems, we figure out why we go back and report to that pull request and hopefully we get an update that goes through it again. But it's very important, nothing merges into the CEP project without passing the appropriate set of tests. So our master branch, things get in sometimes but it's pretty stable. And finally, we actually do nightly testing. So CEP's a big complicated distributed system. Sometimes there are races that will only turn up one in 10 runs or one in 100 or even one in several thousand. And so we run nightly tests all the time against the in development code. We run nightly tests against the stable branches that we are supporting upstream. It's long-term stable, things of that nature. Depending on the tests, they'll mostly run every, they run between every other week and every night and those results are also publicly available. So that's all the ways that tuthology is great. There are some problems with it, some gaps that I'm gonna tell you about. But I do wanna, before I tell you the awful things about my project, I wanna say, okay, I think we're pretty good at testing. We do a lot of functional coverage. I've sort of run through just the framework about that we use for it but we have a lot of specific stuff inside of that framework to make it work well. We have this CEP test rados thing that can issue arbitrary numbers, like we run it for hours at a time of rados operations, of reads and writes and more complicated things against the cluster and verify that the results are exactly what we expect to get back. And we run that against all kinds of thrashers that do all kinds of terrible things to the system. We deliberately inject failures with like two demons. We inject failures within demons. We have code in the CEP base that we can set to just trigger an assert at important places and make sure that the system recovers correctly. We have code that can inject delays and delivering messages or in granting locks to try and expose those races. We go out and as it says, we fiddle with the raw objects or with the raw disk state and we make sure that the cluster detects that and recovers it in the ways we, and then recovers from it or at least responds to it in the ways that we expect. And doosology is not the only way we do testing. We have a make check build target in our repository that runs, we have, it's not as good as we'd like it to be but we do have unit tests that test actual like code modules built around the G test framework all the way up to user interface tests that turn on a very small local stuff demon cluster and issue all the commands that you can issue to the system and make sure that that reacts correctly when you issue the command or when you issue the command incorrectly. Some of the other components in the system such as our stuff ansible which we use for deployment in most cases or the stuff volume thing that we use for provisioning they all have their own test frameworks that they get used for. But with that said, there are some pretty important gaps that we've discovered over the years. First of all, doosology handles demons by SSH into the test node and directly invoking them. That gives us a lot of cool features very cheaply like the ability to issue standard in and receive standard out and signal stuff without going through hoops but it means that we don't test well now it's system D or in the past upstart or says five-minute scripts. And sometimes that can be a problem like when you expect your script to only allow three restarts in 10 minutes but actually it's just restarting infinitely and so when there's a failure that it causes an assert that you expect to kill the demon it just never goes away. Second of all, doosology does its own package and solves and cluster configuration. Now when doosology was created in 2011 we really expected that most of our users were going to write their own puppeter chef scripts and so this wasn't a big deal. But as we moved forward we had had the step deploy thing we now use a lot of stuff Ansible we may in the future start using a lot of Rook which is the Kubernetes operator for storage from the CNCF and other partners like Suza uses a thing called Deepsea and none of these package installation and cluster configuration systems get tested in our nightlies which means that they might be doing something wrong and we don't ever notice it. First of all finally, well not finally performance testing is very important we're a storage system like we're a distributed storage system so no one expects us to get 100% of every IOP on visible to every client but they do get angry if we degrade performance by 15% from release to release and we really don't have any way to do that in toothology right now. And finally scale testing is not really feasible it's like toothology's integration testing system we have that expectation that we'll finish within hours or else the scheduling doesn't really work right and we want them to run as little space as possible because we always want more space. There are some proposed solutions and some things that we've actually done so we aren't doing it yet but we are talking about this week or yes this next week we're gonna talk about starting to do things like we wanna expand the toothology framework API so we can restart and signal using the init system instead of issuing commands like hey kill all dash nine on this particular note over there or hey run seft OSD with the sets of arguments on that note over there and make sure that it doesn't die. We are talking about on Wednesday morning although we haven't announced it yet we're talking about writing a new API within the toothology framework so that we can request installs and then they can be satisfied by a thing that installs using stuff ansible or a thing that installs using Grook and Kubernetes or a thing that installs using deepsea and that the rest of the tasks don't care about that and just get given an install and we can switch to new ones as they become important and we actually test that these behave correctly. We don't have a great solution to performance testing right now. And so in the short term we actually have a new performance suite which is running the performance test that we do have from another solution but it runs on random nodes in the system so we're gathering data right now and seeing, hey, can we have this performance test fail if it's 10% slower than we expect or is that gonna turn up every third run? Things of that nature. In the long term, we really need to do some kind of analysis solution based off of the performance numbers and the nodes that they run on and the variations that we've seen and do in alerting but that'll be some kind of machine learning thing which we just don't have the expertise or knowledge to do right now. Finally, for scale testing we made a decision that actually an integration test framework is not a good place for long running scale tests because it's an integration system. So both in the community and within companies that build products around it we've got groups that are designing tests to run longer term in their own lab allocations. We might slice out part of the CP lab for some of these from time to time. They might, we might get it spaced from heart from partners or from downstreams but that's the immediate solution and we'd also like to in the medium term figure out how to mock up tests that simulate longer term systems. For instance, in CEPH usually the amount of data in the system is not where our problems come from. Problems tend to come from things like our S3 compatible object solution having too much metadata for the system to handle correctly but the actual data objects are fine. So we can start injecting very large metadata indices and then making sure that those behave correctly without having the data to back them up but none of that exists yet. Finally, there are some weaknesses about Tuthology itself in order to the framework itself not just how we use it. Right now, Tuthology has become strongly tied to CEPH. Orchestra should be easy to use elsewhere it's just a control system and when TV built it he really did design it to support any other multi node system but nobody else uses it so it's sort of gotten stuck a little more together. I say should have what I mean is to avoid this problem may or may not have been worth the expense but it really would have been good to find at least one other community to use Tuthology with us when we built this. If you have this problem today you should find something that exists instead of rolling your own. At this point we're not really worrying about it too much like there's stable code bases that fit our needs switching to something would be monstrously expensive and get us very little but we do occasionally keep our eye out and say hey, is there some other distributed storage system that could use a test framework? Maybe you'd like to work with us although we haven't gotten any takers yet. Another problem with Tuthology is that it's strongly tied to the actual CPO lab. It's supposed to deploy anywhere and there are other people running it but sometimes hard coded values like an expectation that you can pull packages from a CEPH.com domain sneak in and if you have your own development system because you have an old fork or something then you can't pull it from CEPH.com you need to pull from your own local service. The documentation is limited like we have a group of people that maintain and run the service and every several years go and go have a frenzy of updating for some portion of the documentation but there's a lot to work through if you actually want to set up a system. So, right, okay. So, but it's more than just other groups. If you need to write a test and debug it then you have to actually write the Python or local machine, push it to a repository that the Tuthology framework can pull from. You need to have machines locked. You need to invoke that and you need to wait through the machines to get imaged and for the test to install and to run before you can get any feedback. You need local, or you need built set packages now when TVAC initially wrote this we actually just all standardized on a particular version of a distribution and so we could build CEPH on our machine with just make and then it would SCP the files over to the remote nodes and invoke them directly but that became infeasible as we got more people and as we needed to test more distributions and more combinations of things and so we need packages now but that means you need to wait for CEPH packages to appear and all this stuff makes it hard for third parties to contribute and for new CEPH developers to do anything with their patches because they write a patch and they're like, I don't know what I can do. I ran and make check but I don't think I tested it at all. There are some solutions. If we had to do this again and this was important to us we should have made regular teardown and set up part of what we do. We have a lot of infrastructure and configuration as code but we don't invoke most of it very often. We have in the past found groups that came to work with us and they had trouble with setup and install and we sort of did a one time walkthrough with them to help them get going and we didn't take a lot of that back or and make a changes to the system to prevent it being a problem in the future. We made some changes like the packages deliberately but maybe we should have considered how people who didn't have lab access would be able to respond to that and done some kind of workaround or bifurcation or whatever. And in the past we actually had someone write something called tuthology open stack and that was pretty cool. It was a simple script you could invoke from wherever and if you had access to an open stack cluster like including a public one it would set up a tuthology instance and run a suite and then shut down and archive the logs to some storage system. So you could just do one shot tests but it had a few problems. The person who was most interested and it moved on to other things the open stack APIs at the time were ludicrously unstable and unworkable so it used the open stack CLI commands directly but those have also changed since then so it no longer really functions very well. There is at least one group still using it but they forked their using well they did fork tuthology to use it and it's diverged pretty wildly at this point. We have some better news there's a thing called vstart runner. So in the Ceph repository there's a script called vstart.shell and it turns on a local Ceph cluster just on your laptop or your development box or whatever. And the vstart runner provides you a restricted API a subset of the whole tuthology framework API and if you stick to that API then it can run it against those local demons. There's a lot of tests in the file system tree or in the Ceph of S file system suite that use this there are some tests in the Rados Gateway ofm S3 and Swift compatible object store suite that use this and it means that as we hopefully transition to more of the tests being built against this API then external first time contributors can start running tests with it on the thing that they most care about. And we are investing now a lot more work to foster a community. Well I just started a Ceph test weekly meeting a couple of months ago to discuss things. If you're interested it's Wednesday mornings my time in the Pacific coast and they said that URL and I'm gonna start sending out reminder emails. We had a Cepholicon conference several months ago in Beijing that sort of sparked this. We met a bunch of groups that we had never heard of before that were using Ceph and several of them were like oh yeah we have a development team and we were on tuthology and it was kind of a pain to get going but now we've got it hard-coded to our needs and there were like four of them and they had all done their own localization port and so we're trying to work with some of those groups to get those patches back. I mean that maybe not directly because they took their shortcuts too but figure out what those patches were and what we need to change in the upstream system to make it work better. And eventually we haven't started yet but there are now stable OpenStack APIs. I think LibCloud is the one that's come up and that actually works now and we'd like to rebuild tuthology OpenStack so that it keeps working and it's easier for people to deploy systems and that'll have other benefits for us too like the ability to test the tuthology framework itself in an OpenStack cloud instead of rolling it out to our infrastructure and hoping that nothing breaks. This is sort of the performance thing that I talked about before. We don't have good trend analysis tools in tuthology. We can robustly fail individual tests by looking for core dumps, by seeing if there are errors or warnings in the logs, by having a task run a test condition and failing on whatever arbitrary thing you cared about in Python. It's easy to look at the individual runs whether they passed or failed. Sweets show up as green or yellow or red and then individual jobs do the same but there's not a lot of granularity there to say oh a particular job has failed the last 10 times you've run it or hey this branch is broken or whatever. In the future, hopefully we get to fixing this but so far it's just not something that's been this hard enough but a lot of people would find this to be a problem. The sweets explode in size. Like I said, it's combinatorial so as you keep on adding yaml fragments it gets really, really large. The main rados suite is up over 124,000 jobs now. And there's no way to prioritize those tests within the foray work. There's no way of saying oh this is a very important failure to analyze new users don't really or resource constrained users don't know what they should run and you just sort of have to watch it and know whether a test matters if you're trying to get faster feedback. We do have some solutions. Well it's a couple years old now but there's a subset functionality that you can run so that instead of using all the combinatorial tests it makes sure that every fragment gets used but maybe not all of the fragments. And so you can specify hey I wanna run subset one of 200 and as long as you then go on to run subset two of 200 and three of 200 and four of 200 all the way up then you will eventually get the entire combinatorial tests and all of the individual pieces get run in every suite. But it lets you run them down much smaller so you can get a reasonable test system out of 397 jobs instead of a couple hundred thousand when we use this. All right yeah so that's that. We also have other things so you can filter that so when you build a test then it gets a name or a description that's just the names of all the YAML files so you can filter against, you can filter and only take the tests that match a particular YAML fragment or you can exclude the tests that match a particular YAML fragment. If you have a failure in a suite run that you decide it's a problem then you can go fix that code and then you can rerun only the tests that failed because you think that those are gonna be the ones that are most useful for you. And in the future we may do more stuff but again it's not become a critical issue yet. What is a critical issue for us is that scheduling is very primitive. Like I said it's a beanstalk queue and jobs just sort of get picked off and on the main like test node system or the test running system we've got like I don't know the exact number but it's something in the neighborhood of 50 tuthology worker demons that just sit there and go hey do you have a job for me? Oh hey you've got a job for me. Let me lock two nodes or lock five nodes or lock 10 nodes and if you lock five or 10 nodes then you've got 49 other people competing with you and they only want to lock two nodes. They always win. And sometimes because it's just a queue we'll have nightlies get scheduled but you know I get impatient and I jump the nightly queue with my test run and then a week later the scheduled nightly job is actually starting to get picked off the queue but at that point you know it's the stuff shot one on master branch from a week ago and we're running it through now but oh by the way we just like 20 minutes ago scheduled a new run on this week's code and that's just not very useful because eventually the queue gets backed up enough that we start killing nightlies but we've tested our old code and not our new code. There are and this is sort of an illustrative of an issue so we've gotten several hacks and we talk every several months about how this is a problem but then if you ask someone what problems they have with deutology they forget about this one because our minds have sort of been molded to it so this is something you want to be aware of when you're building test systems. Right now all of our jobs run on two nodes because you know we can mostly squeeze our jobs down into two nodes and then they all and then they actually get executed. One of our queuee people looks at the Beanstalk queue and says oh it's like 7,000 jobs long now and we have three rados nightly runs in there so I'll just kill the two oldest nightly runs. In the, soon it's not done yet but a really easy fix is to do something a little more intelligent with the locking where the workers can order up and queue for locking and say hey I need five nodes and then we wait for five nodes and then those get locked before you try and execute any other tests. In the future we're not quite sure how it's gonna happen yet. We've talked about actually just switching to Kubernetes and it can solve some of these problems but we really need to write a more robust scheduler or install a more robust scheduler so that we can do more intelligent things. Oh yeah, I've only got like two minutes left. Good thing I'm almost done. And then we wanna be more intelligent about the way our nightly runs work. So are we gonna stick with it? Well yes, we're keeping our tests. It's proven effective over seven years. We're incredibly stable for the things we're doing. We've got it running right now. It doesn't take a lot of maintenance to keep working the way it is working. But we are actively exploring ways to make it better for new users and developers. Should you do automated testing? Yes, you should do automated testing but let me tell you, you should probably not write your own framework. There are a lot more of them available today than there were seven years ago and a lot of projects have over the last decade written custom frameworks. So if you wanna build your own framework, you should think real hard about it. Are you really sure something doesn't work for you? If you're really sure then like, isn't there something that does mostly what you do? What do they do? Why is what they do insufficient? And what makes your needs different from other people's? You need to answer all these questions before you build your own testing framework or you're doing the wrong thing. If you answer those questions and you're really sure you need to, do the simplest thing you can. Small frameworks can always be combined into larger ones and try to keep your components discreet so they can be replaced later. That is the end of my talk. For more information, a lot of URLs. And we have like 30 seconds for more questions. Do we have a microphone? Hi, so I mainly work on Fedora QA. So I'm generally interested in integration testing. So something I notice often comes up with systems like this. I'm guessing the system is mainly built to test a potentially broken Ceph in a known good environment. Like you have known good Fedora environments, known good Ubuntu environments and then you run an unknown Ceph code based on those. Is it possible to sort of invert the flow so you run a known good Ceph in a possibly broken environment? And if not, have you thought about that? We haven't thought about it. At this point, we freshly image all our machines for every test because doing that, when you're trying to test your system, it's really obnoxious. But you could build broken images and run the test against that and see what happens. Right. So you can't at least sort of draw the images. I mean, it's not something we've ever discussed or actually tried, but yes, we could drop in a broken image and see what happened to it. Yes. How difficult is it to deploy tuthology on how to test the Ceph cluster? Question is how difficult is it to deploy tuthology to test the Ceph cluster? Yeah, so it's pretty difficult. I haven't ever done it personally. If you go way back in time using the tuthology OpenStack 1 made it pretty easy. And I think maybe if you go grab SUSE's fork, it still is. But I haven't actually done that myself. And there's a whole lot of services running here that we just sort of tangentially pushed on like the service that builds Ceph packages, the service that hosts and serves them up, the POPEDO service that stores job results. It's not as easy as we'd like it to be. But if you're interested, we would love the input on what makes it hard. Do you get 100% test passes on commits or do you aim for something lower? Do we get 100% test passes on commits? Not quite. We just run enough jobs that honestly, we notice whenever a package repository goes away. And so it's not unusual to have like one or two jobs out of a couple hundred fail on that just because of some network blip or connectivity issue. When there are failures, we audit them and we don't merge anything with new failures. But sometimes we do have known failures in a system that are pending on some other PR and we merge in that state. That has punished us in the past sometimes pretty badly. So like there's a reason people say you should be 100%. We're a lot closer to it than most systems like us that I know of, but we don't have a 100% programmatic requirement. It's all human driven. All right, well, I'll be around wearing a Ceph shirt for the rest of the day and also tomorrow. So come grab me if you want to have any questions and thanks for your time.