 This is, this is, this is a air your dirty laundry, talk about your mistakes kind of talk. And so it's, it's, if there's not a place to be embarrassed in this, I should be ashamed that I didn't tell enough of the places that I should be embarrassed about. This is one of those where, wow, how did I ever make that mistake? And how did I make that mistake? And why didn't I think of that before? And those kind of things. Yeah, yeah, I, I, I in fact am a manager by profession, therefore I can spin whatever, whatever phrasing we would like to use. Yes, absolutely. It's, it's an opportunity. It's not a mistake. It's an opportunity, right? Yeah, okay, or it's just a mistake and we'll call it an opportunity because that's as a manager that fits. Yeah. And yes, I also maintain software. So sorry, please don't, don't black, blacklist me. Don't, don't cancel me because I said that word manager. I as a, as a personal and a hobby and a G, I just enjoy it. I maintain software and I maintain same same software by writing tests. And so I make mistakes when I write tests and those tests sometimes are flawed or they're the wrong level or all sorts of things. And we'll, well, you're going to hear, you're going to hear more about that kind of thing than you probably wanted to hear today. Oh, that's right. Actually, although that one, that is, that is a level of testing. I dream of achieving that level. I like that level of testing the whole, hey, let's break things. Let's intentionally break things and see if we survive. I think that is, that is so cool. But no, I'm not there yet. Although, although I do some of that, I guess I do a little. Okay. So we, oh, no, we're still one minute away. Welcome everybody. I'm Mark Wait. This is intentional and unintentional compromises in test automation. I write software. I test software. I use software and the software that I use tends to have problems sometimes or get enhancements that need tests. So I want to talk today about some of the intentional compromises that I've made over the course of many years. I've been the maintainer of the Jenkins get plugin for six or seven years. In that time, I've made a bunch of mistakes. I've learned a bunch of things from those mistakes. And part of my goal here is to talk about what that taught me and maybe you can learn from my mistakes instead of making your own mistakes. So you get to make different mistakes. We'll talk about, I'll talk about some anticipated costs, things that I predicted and some of the things that were unintentional. That sort of accidents that happened and then invite you to make your choices. So first, there are intentional choices, intentional compromises that we make when we choose what we test, when we test and how we test. They are very much intentional choices, right? We might choose to test things at a unit level where we test an individual thing with no external connections. We might choose to test at a level where objects are interacting with each other. We might choose to test things at massive scale. We might choose to test all sorts of different ways, but we are making a choice what we test. We then make the additional choice when we run those tests. So which tests do we run and at what point and how we test each of those. So when we're in the what we test phase, we're asking questions like what things should we check in our automation and why. Now, the Jenkins Git plugin is installed on about 300,000 Jenkins controllers worldwide. So when I was doing commercial software, we were proud if we got to a thousand installations. With 300,000 installations, when I make a mistake in the Jenkins, so at this level, what we test, the first part of the story goes like this. The Jenkins Git plugin with 300,000 installations, you'd think, wow, it deserves an awful lot of tests. We want to be sure that it works well, that it works right. During the first 15 months of the life of that thing, so 15 months of life, there was not a single test anywhere in the code base. It was tested interactively and delivered to users. So the first question on what we test was, shall we write a test at all? Now, isn't that terrifying code without tests? And yet the maintainers, the creators of the thing, were not confident back so many years ago that Git at the time was going to be a relevant source control system at all. That it was going to matter to anybody, and their choice was, why write a test for code that I may throw away anyway? So the first choice we have to make is, should I write a test at all? Okay, then, which method should I test? Which conditionals? What shall I assert? Those are all good things, but ask the question, should I even write a test? Now, the next question, and this one highlights cost. Alright, when should we test? And the reason I ask this is because it is not economically free to execute a test. There is a cost. You can talk about either the electrical cost, or the computational cost, or the cost on AWS, or the environmental cost, or you pick it. There is a cost to the execution of a test. The Jenkins Git plugin has lived now for many years. Those original automated tests that were created have been executed tens of thousands of times. And you can guess that most of the times those tests passed, and they didn't tell anybody anything terribly useful. Because a passing test is a good comfort factor, but really, there might have been better ways to do that. So, should we test every commit? Should we test every release? Should we test whenever we feel like it? All valid questions to ask as you are considering test automation. Now, this is me being hypocritical, because that is not what I did, right? I know what my answer was. I ran the tests every commit. Every single commit got every single test run. And you know what? That was an expensive choice. And so maybe it's worth asking yourself the question, do I really have to run every test every time? Now, how do we test? Alright, this is another opportunity to get expensive very rapidly. Shall I test at the unit level? Oh, okay, those are pretty fast, typically, if I'm just talking to a single object with no external interactions. But if you're an integration system, like Jenkins is, or if you're calling command like Git, like Jenkins does, it's hard to believe a test that doesn't call command line Git of a tool that's using Git, right? It's not, it's hard to be credible. And therefore, it's going to get expensive because you're actually going to do integration level exercises. So, component level testing. Oh, but wait a sec. If I call BitBucket, I may get a different result than if I call GitHub. Now I'm not just testing local command line. I'm going out to actual live production servers to ask them questions. Again, I'm adding cost because I want to know something about that test. Now, tests change value over time. The most valuable time, at least for me personally, is during initial development. While I'm creating the code the very first time, tests are intensely valuable because they tell me something about the code that I'm working on right then. I see them now, I understand them, and they surface real problems immediately. So they are high value during initial development, but you know what? That value tapers dramatically because once they're passing, it's pretty rare that they fail. I have to make some change that would cause them to fail, and I'm not always making changes. So another question to ask is, is it worth keeping tests around long term given that their biggest value is short term, is in the initial development? Now again, I'm being hypocritical. The hypocrisy here is I don't remember the last time I deleted any significant volume of tests. And that hypocrisy is led by fear. I want the tests to tell me that I'm still okay, and I'm willing to spend to have them tell me that I'm still okay. But it's a choice, okay? So admit initial development is high value, maintenance, you got to wonder, is it really worth it? Okay, in the world of software testing, there are people who have given us a nice simplification of the software testing pyramid. They draw a pyramid with three layers typically. The three layers have some common labels that are trying to tell us, do more at the bottom layer, whereas typically units do less as you go upwards to integration tests and even less as you do end-to-end tests. And it's a good model. It's not a terrible model, but it is, like most models, a simplification. It says, yeah, write a bunch of unit tests and some integration tests and end-to-end tests. But it hides a bunch of the real things that are very much in software. It ignored performance. It didn't talk about complexity. It didn't talk about the interactions of human beings through the user interface. Stephen Fishman offers a different view. He suggests we should think of it as a honeycomb with things that are interconnected to each other, and we need to think along more axes than a simple straight up and down. I might even, Bobby, you could almost think of it as a physical carbon-object diamond, that when you look at the problem of software testing, if you think of all the facets on a diamond, you might see I shine light through that diamond, and I get a different pattern depending on the orientation of the thing. I get a different result depending on the wavelength of the light I'm shooting through that object. All sorts of different attributes associated with the physical object, the thing I'm trying to test, is much more complicated than a pyramid or really even a honeycomb illustrates. I've got to ask myself, hmm, what about security testing? What about compliance? What about performance issues? What about scalability issues? Any one of those things might justify an automated test. And yet, when I wrote code initially, I probably didn't write any of those, because my initial mental model was a pyramid. And I initially thought, oh, I better write mostly unit tests, and a few integration tests and an occasional end-to-end. It's much richer than that as we consider what software testing needs to be. So now, thinking about the anticipated costs that are associated with software tests and software test automation, again, initial development, later extensions, and ongoing care each have cost profiles. So during initial development, I'm focused more on the value I get from the test, because they're telling me about the code I'm creating. It's an exploration. It's a learning thing. And at that point, I'm actually pretty comfortable discarding a test if it's not really well suited to the thing I'm testing. I'm less attached to the code at that point, oddly enough. Things get worse, though. As time goes forward, the test value drops. It's a little helpful in the history, and I use it as a safety net. But its actual value is diminishing. As we get into this ongoing care, the long tail of the software development cycle, it's mostly for runtime, execution. I want the safety net feeling. And it'll give me some diagnosis if something surprises me. Okay, the example in the Git plug-in might be that command line Git decided to release a new version with a security fix. The security fix changes the behavior of the tool that I'm calling. That security fix, we hope, will be flagged by my test saying test failed because it was dependent on something whose behavior has been changed by a security fix. So I hope. Now, what's the actual reality? Most of the tests in the plug-in, most of the tests in that code base, kept right on passing, even with a significant change that locked down command line Git better than it was locked down before. Why? Because the tests that had been written did not exercise that problem. So there's more to it than just what should we test. It's also what do we hope to detect in the future. Now, there's one more cost-hiding here. Sometimes I write tests that are non-deterministic. I didn't mean to do it. I did not mean to write a flaky test. I really didn't. I think my tests are all perfectly reliable, but I sometimes create flakes. And now the cost to diagnose those flakes gets scary expensive because it's random when they fail. There's something in some condition that causes it to fail and then succeed in a later run. And my behavior of, oh, just try it again, doubles the cost of running that test. I ran it. It failed. I ran it again. It passed. I ran it twice. And I got no more useful information out of it. So all sorts of costs hiding in this ongoing care. Now, just so you're aware, I like Kent Beck's answer on flaky tests. One of the things he suggested was delete them. Just delete them. And now that is an emotionally fraught statement for me because I love that test. I wrote it. That is my friend. That's like my child. You're telling me to delete my work of art, but my work of art is flawed and flaky. And if it is, it's probably not worth keeping. So now what about the unintentional compromises? We've talked about the things where maybe I made real choices. What about the things where I didn't make a choice but I got blindsided? I was surprised. Users bring some of those. Environments bring some. And in long-lived projects, behavioral inertia brings some. So let's talk about each of those. On the user side, my tests don't always consider the fact that users, when they start using the product, are untrained. They just don't know what they're doing. They're inexperienced, but they have high expectations for what they should experience, how it should feel. And they probably are running in a different context. I have my little world where I work, and their world is where they work. So be aware. Different environments. I like my operating systems. I develop on Linux or on BSD or on macOS or on whatever. And my users may make this terrible choice that they think Windows is the best choice for their operating system. Because that's where they have to deliver value. And then I get bitten by the file system problem or by networking differences or by the fact that they're running some outdated thing of this or that in a software version. Any one of those can cause my... highlight unintended compromises in something I didn't test. Something I didn't even consider as I was creating those initial tests. Now in long-lived projects, there's also the expectation of the user that good or bad, things will continue to behave in the future as they did in the past. Behavioral inertia is the assumption from the user that things will keep doing what they did before. That means sometimes they become addicted to bugs that they really love. And we end up writing tests to check that this bug is still there. Or worse, we fix the bug, the users complain bitterly that they like the behavior before the fix was made and we have to put it back. And that's been my experience at least with a number of cases with the Git plug-in was I fixed a bug and I was confident this was a bug fix and it was valuable. And the users said thank you but we like the old behavior better. We don't care if you called it a bug. So be aware that real users can want something different than what you think they should. So in terms of your choices, your choices that I'd suggest choose sometimes to skip tests. Intentionally choose to say we're not going to run this test. Choose sometimes to throw tests away and choose willingly on occasion to test very differently. And that's really the talk. Thanks very much for coming to CDCon. Much appreciated for your time. And if you do have questions you're welcome to and if you want to hear more horror stories I am happy to tell horror stories. How do you make that choice? What would be the thought process to decide to do some of those? Good question. Skip tests. One of the things right now that the Jenkins project is confronting is we have a test suite that runs 200 parallel containers. Each of them running tests from 5 to 30 minutes and the cost is higher than we like. And our sponsoring organizations flinch when they see the cloud bill for that. So one of our questions is how actually valuable are these hundreds of parallel containers that are running these jobs in terms of answering real questions for us? And so the run them all in parallel hundreds at a time was a nice elegant test of Jenkins scalability. We're really quite proud of it. But we've asked ourselves in our question, what do we get out of an incremental run of that? And so the answer many times is not enough value to spend that money. So the skip question for me is usually cost driven. Now it might also be, hey, let's save the planet, let's not waste energy that we could have let somebody else use instead. But for us it was very much cash. It's cost. So does that give you one example? Yes, thank you. Perfect, thank you. Now on the discard test cases where we might choose to discard. Sorry Steve, go ahead. On that topic what do you think about just testing the incremental changes? So in your world let's say you're responsible for the git plugin and somebody makes a change to the subversion plugin and that's the only thing that changed but you're still going to run the 200 tests. Ah, good, good. Does that make sense? It does. So we've got a partitioning already that avoids that case in the general case. So I only see changes to the git plugin but there are already enough of those for me to worry that I'm wasting energy. I would take your question even a little narrower. Isn't there some way that we could conceptually identify impacted code from the latest change and only run the tests that touch the impacted code? Exactly. And oddly enough Atlassian's Clover product some years ago has had this kind of feature where it would say, look I see that your tests have this coverage layout and they touch these lines of code and therefore this delta should only run these tests and it's a really cool idea. Unfortunately, I've not seen anybody and Clover is now open source it's now open Clover and yet I've not seen anybody use it at a scale that I thought they would gain any real benefit from it. So it's a valid question and in fact there are even companies now that are responsible that are doing gather the data from tests and test reports and use that data to feed into an AI model to tell you which tests they think you should run to get best coverage and it's an attractive field right now and they're working that field to give us some hints of maybe we can still get almost all the benefit. Yeah, very good. Any other questions? Yes. Oh, microphone please. This is being recorded so it's important that you're on the mic. So I just wanted to your thoughts on TDD versus BDD. What do you really follow and if you think it is possible to achieve 100% TDD like today people are talking about. Oh, I love that question. Thank you very much. So the question was what's your opinion of test-driven development versus behavior-driven development? Okay, if I expand the acronyms. Okay, so test-driven development is a methodology wherein Kent Beck and others shared with us the idea is the first thing you do is you write a failing test then you write one or more lines of production code to pass the test then you write another failing test and then you write one or more lines of production code to pass that failing test. Have I given a fair description of TDD? Yes, this is correct. Okay, and I am TDD addicted in the sense that I transitioned a team in 2003 from waterfall to extreme programming using TDD and my team loved it. It was the best thing we ever did. However, recent experiences have reminded me that when I'm exploring and so I love TDD but when I'm exploring an area that I just don't understand recently I was adding an end-of-life monitor to Jenkins to tell people that their operating system is going to reach end-of-life. Now, why would I do that? Because we're not going to support them after their OS is end-of-life. So I'm trying to tell the Jenkins admin your OS is dead. Don't expect Jenkins to help you anymore. And while doing that I was having to learn some things about how to do UI definition in Jenkins and how to write the code. And TDD was absolutely my enemy in that case and throw away code. I was working on something that was teaching me more than it was implementing in production. So for me, at that point, TDD was a bad thing. Now, BDD, behavior-driven development I think is the idea that I can describe in high-level language an action that I expect to occur. Given this do this expect this result and then that high-level language is interpreted by something and used to execute the code that will perform those steps. And my challenge with BDD and my personal experience is limited there. Our attempts to use it did not succeed nearly as well as they did with TDD. Because, inevitably, that high-level expression that I was using, the realistic human-readable high-level expression was inadequate to assert enough things about what was actually happening that we missed many things that were broken because the assert this instead of being the several pages of assertions that it should have been was three or five or whatever. Now, this is just me. I can't claim for anybody else that for me, my experience with higher levels of testing has been oh, the pyramid is sort of right. I get much more value on a lower-level testing. I want to find things as low as I can and as I go up and BDD for me is up at that very top up in the end-to-end kind of testing. Did that answer your question? That makes sense. I don't know that it makes sense. Again, it's a shameful terrible thing to say it that way, right? Because I know what I've been taught. I know what people have said about behavior-driven development that it really is powerful and I'm sure there are people for whom it works very, very well. I'm unfortunately, I've not been one of those people. Thank you. Any other questions? Mike to Steve, please. We kind of touched upon this one. We're chatting before you started. Is it actually worthwhile to spend time writing tests instead of working on your ability to recover quickly? Oh, that is a fun lead question. I like that one. It goes into the whole chaos engineering theories about how quickly it doesn't matter if you actually have problems in your code and as long as you can recover very quickly you're actually better off because even if you went through and wrote all these tests and you missed something, you still have the same problem of recovery. Does that make sense, Mike? It does, and I think that's a brilliant concept and that's one that I've not done an adequate job of considering. I think the idea that I should be able to deploy code and maybe if I'm doing scale I deploy fractional. Maybe I deploy to 10% of my users or something, but deploy code and confidently roll it back if there's a problem. Maybe the best of all possible worlds. Now this is assuming probably that I'm software as a service because I've got to be able to control the rollback. Whereas in the 20 years ago days when I was delivering code to a customer site that was managing their databases there was no way for me to roll back their code in less than a multi month process. But if I've got if I control the deployment and I own the deployment why not consider maybe it's more valuable instead of investing in a test to invest in confident rollback so that or confident roll forward to previous state so that I don't have to do anything more than, hey I made a mistake roll it back and make it that easy. Good suggestion thank you. And that aligns with what I'd call Kosuke's experimental mode where the Git plugin for 15 months has no tests. Why? Because they tested interactively and they could roll it really quickly. And the small user base meant everybody who cared about it was willing to roll with him. Yeah, good. Any other questions? Have we reached time? Sorry, I don't mean to go beyond your time. We have reached time. You can't ask any more questions. Thank you.