 Hello, my name is Paul Albertella. I'm a consultant working for CodeThink in the UK. I'm going to talk to you today about automating system-level safety testing using open source tools. Now I've been working with complex software systems since 1990, ranging in scale from mainframes to mobile phones, and most recently Linux-based systems in automotive. I'm a long-standing user and advocate of open source software, which I originally encountered when developing for mobile platforms like Android and Mego. Working with automotive manufacturers fueled my interest in software engineering processes, and in particular how open source tools can be used to improve them. This led me to join CodeThink in 2019, where my work has mainly focused on safety. I'm a part of the ELISA project, which is about enabling Linux in safety applications, and my work there and at CodeThink has been developing a new approach to safety with open source software in mind. I've written and talked about this, most recently at OSS North America, and my talk today is turning to testing. Testing obviously plays an essential role in safety. When our software is part of a safety-critical system, we not only need to verify its functionality, but also to show that it can continue to function or transition safely to a degraded level of functionality when things go wrong. Validating software as part of its final system is clearly essential, but performing all software testing in a complete context can be a challenging and expensive proposition. This inevitably means that testing software happens in a variety of contexts and levels of integration, whether we are confirming that an individual component has been implemented according to its specification, or integrating a set of components to make sure they all play nicely together, or stress testing a fully integrated system on the final target hardware to make sure that it continues to achieve its safety goals even under DRS. Safety standards dealing with software tend to place a strong emphasis on verifying software components against detailed, formal specifications. This relies on system design verification to ensure that this specified behaviour as part of the system will achieve the wider system's safety goals. But there are aspects of system behaviour and misbehaviour that may be difficult to cover in sufficient detail in the design, especially for more complex systems. Safety is a system property rather than a property of individual components. While we can test components in isolation to ensure that their logic and implementation are sound, some properties may be dependent on the final system context, whether the underlying processor architecture or interactions with hardware or other software components, and we can't always account for these or make them visible in the design. Furthermore, the factors that can lead to a lack of safety may be emergent properties of that system, which means that they can only be properly understood by considering components together in context and with the system's overall safety goals in mind. System level testing is therefore always regarded as a vital step in validating a design, but if it is left too late in the system development process, this can mean that we discover unforeseen design problems or issues with component specifications where it is difficult or expensive to fix them. But when we're thinking about testing in the context of safety and of compliance with safety standards in particular, there are a number of other things that we need to consider. Although we will want to use tests to confirm that our software does what it is supposed to do, we are less interested in the happy paths and more interested in what happens when things go wrong, either with our software or with the other hardware and software components that it depended on. And in particular, we want to focus on things that can actually impact the system's safety goals. This means that we need to understand what those safety goals are and to understand how our component in our software is responsible for achieving them or contributing to them. This should be documented in our safety requirements, but these might not always be provided at an appropriate level of detail, which means that we may have to break them down into the level of detail we need, either for our software requirements or for our test cases. And even when we have requirements in sufficient detail, we still need to show how our tests relate to them, which requirements they're intended to verify and for what set of preconditions. And we also need to show how we can be confident that a set of tests actually achieves their objectives. Some level of human review is necessary here, but not sufficient. It will need to be supported by review criteria and metrics to help us quantify our confidence. Do the tests cover all of the required behaviour, conditions and interfaces? Does the test implementation actually verify the criteria that are expressed in the requirement? If the original requirement is at a high level, how has it been broken down into more detailed criteria that we can verify? Accomplishing all of these things is hard enough, but we also need our test results to be consistent and reproducible, which can be particularly hard for system level tests. Automation can obviously help to achieve this by ensuring that test steps, at least, are reproducible, but we also need a way to reliably reproduce the system that we're testing, as well as our test implementations and the environment within which they execute. An approach to these problems that I've been exploring in my work at Code Think and as part of ELISA is based on a methodology called STPA, which stands for System Theoretic Process Analysis. This technique was developed at MIT based on the work of Professor Nancy Levison, which originally focused on accident investigation. STPA provides a way to analyse complex critical systems using control theory, which allows us to model them using control structures that we can use to eliminate or reduce adverse events. In addition to the automated systems that are our principal focus in functional safety, it allows us to model both human and environmental interactions and organisation or even governmental control processes. By focusing on the specific outcomes that we want to avoid, as dictated by our safety goals, and the system conditions that can lead to these, it lets us focus our analysis on those aspects of the system behaviour that are most important to us from a safety perspective. But we can also model the system's other goals relating to performance or customer satisfaction or simple economics to ensure that these can be balanced against the imperatives of safety. The outputs of this analysis provide a rich basis for deriving test cases. As well as a consistent way to derive and document detailed requirements from system level safety goals and component level safety requirements. By identifying the losses that we wish to avoid and the system conditions that can lead to them, we're able to define the constraints, that is, system or component level criteria that must always be satisfied that must inform the system design and that our tests need to verify. We can also identify triggering events called unsafe control actions and specific worst case system conditions or loss scenarios that we can use as fault injection scenarios to validate the effectiveness of both our tests and the system's safety mechanisms. STPA goes some way towards addressing the challenges that I described when deriving sufficiently detailed requirements, as well as helping us to understand how those requirements relate to tests and to the safety goals of the system. But to ensure that we have a common understanding of our tests, we need them to be meaningful to the creators, to the analysts who define the system and safety requirements and to those responsible for implementing, reviewing and executing the tests. And in the last case, since we want to automate our tests, this means a machine rather than a human. A test scenario language is a structured and restricted form of natural language that we can use to document a test. It provides a logical syntax that can be passed by a computer program but uses language constructs that are also meaningful for humans. There are various models of this type of language, some of them popularized by agile and behavior-driven development movements, but the underlying concept has a long history and it's enjoying a bit of a renaissance thanks to its increasing adoption by the developers of self-driving cars. So now we have a way to derive and then describe suitable system-level tests. But there are still lots of questions for us to answer if we're going to be able to automate them, run them in a reproducible way and produce the kind of evidence we need to review them and to satisfy the requirements of safety standards. What is the system or the component that we're actually testing? What version of it have we built? Where did we get the source code that we used to build that version? How is it built? Can we build it reproducibly? What else is part of the system under test and how do we integrate the system that we're testing into that context? And what is the wider context in which it's being tested? What hardware are we testing it on? Or if we're testing it in a virtual environment such as QMU what instance of QMU could we use and how is that configured? What actions does the test involve? We can get this from our test scenario language. What observations need to be made? In other words, what information do we need to gather as a result of the test in order to evaluate it against the criteria that we have for success or failure? How are these results collected if the test is actually happening on a system other than the one where we're running through our scenarios? How are the results of those tests going to be published in a form that we can use to review later and share with safety assessors who are going to give us a certificate for our system? So now I'm going to talk about the open source tools that we're using to address these challenges. The first of these is Subplot, which is a set of tools for specifying, documenting and implementing automated tests for systems and software. Its focus is on producing a human-readable document of acceptance criteria and a program that automatically tests a system against those criteria. More specifically, it aims to help developers to specify the automated tests so that those tests can be executed, but also understood without requiring programming knowledge. While the examples that are provided by the project, which include the acceptance tests for the tools themselves, are application level rather than system level, the framework is designed to be extensible to any kind of scenario or test implementation language. There is current support for Python and Rust. Scenarios are written in markdown with some Subplot-specific annotations in fenced code blocks to assist with parsing. Tools are included for producing human-readable documents from these using pandoc. And the input documents can also include plant-UML or graph-fizz-code-box to add images to accompany the text. Binding files are written in YAML, which are then used to link the syntax elements of the scenario language to a test implementation. The best way to explain what that means is to show you an example. As you can see in this example from the Subplot documentation, the structure of the file is reasonably simple. It begins with some type definitions for two pieces of data that are used in the tests, one of them a string and the other an integer. It then identifies a pattern, that is, a sequence of texts to be included in a scenario definition, for a given clause, which includes the corresponding function that is to be called in the test implementation. The first example includes a cleanup function, which is called at the end of the test, for example, to release a resource that was assigned. The second example has a produces clause, which is how an item of test data is identified from the scenario text using the definitions from the start of the file. The third example illustrates how this data can then be used in a subsequent test step, and the final example shows how the result of a when step can be checked in a then clause. Because these files are independent of the actual test implementation, the same set of bindings could be used to drive implementations in different languages. And just so you can see how this works with the scenario language file, here is the other half of the Subplot example. Now, while this approach might seem over-elaborate for a relatively simple example like this, its advantages quickly become apparent when defining more complicated scenarios. In system level scenarios, for example, we may have a wide range of given clauses that are required to put the system under test into the correct state to begin actual testing. This might include a step to build the system under test, another to build a dependency with a patch to enable a software fault injection, and further steps to integrate the results as part of a system image and then to deploy it to a suitable test machine. Managing such build and integration steps effectively is where we turn to our second open source tool, Buildstream. This allows us to build our systems and tests deterministically so that we can be confident that we'll get identical results from the same set of inputs. Buildstream's definition files, which are again written in YAML, allow us to do this very precisely using a declarative syntax that can be used to specify unambiguous revisions for all build inputs. This includes both build and runtime dependencies, which can be very important as the behavior of a system may be affected by the tool chains that are used in its construction. The builds are also performed in a sandboxed environment, which can eliminate the possibility of untracked dependencies being incorporated as part of the construction process. Doing all of this building, especially if you're building from source code and having to pull all that source into your sandboxed environment, could be a huge overhead, but Buildstream has been designed to do this at scale. It can be configured to parallelize build actions to use multiple build instances using the remote execution API originally developed for Bazel. It also uses remote artifact caching to avoid rebuilding artifacts that have not changed since the last time they were built. And we can be confident that they haven't changed because we control all the inputs. And we can be certain that we have our inputs under control by using Buildstream as part of a continuous integration workflow following the Deterministic Construction Service design pattern. This, which you can see here in the form of an STPA control structure diagram, was developed for safety critical software construction of complex systems and was certified to the ISO 26262 safety standard. This allows us to prove that all the inputs to our construction process are under control by verifying the binary reproducibility outputs. If we can get identical binary results using our set of inputs, then we know that those inputs can't possibly have changed providing, of course, that we can also make that construction process binary reproducible. Now, these inputs include the build and test environments within which the construction and verification takes place, which includes the container images that are used by CIJobs, which in our setup are managed by GitLab. Having built all the software that we need for our tests, however, we all need one more open source tool to complete our setup. This is Lava, which lets us manage the hardware and virtual devices that are used when deploying our systems for testing. We can use Lava to define pools of devices with different characteristics, using tags. These might be used to distinguish between devices with different peripherals, for example, or different processor architectures. This is a web-based GUI that can provide us with a useful window on our devices, but most importantly, Lava allows us to define the connections that we use to interact with our managed devices from within a test. Hence, we can have a scenario-driven test running in a CIJob context, for example, a runner provided by GitLab. An asked Lava for an available device with the required characteristics to deploy a system image to it, boot the resultant system, and then interact with it via a system console or a suitable system service API to run a series of test steps and then collect the results. Here you can see an example of a Lava deployment, which CodeThink are using for long-term testing of upstream Linux kernel changes. It's only got a few devices, a QMU device, which is an emulated hardware device and two hardware devices, one of which you can see is offline at the moment because it's failing its health checks. If we have a look at this one, which is a Raspberry Pi, we can see that it has a health job to make sure that it's up and running so we know whether jobs can be scheduled to it and we can see that jobs have been scheduled to it and the status of them and actions that we can do on them to have a look at those job results. So let's put all these open-source tools together and see how they help us deal with the challenges of system-level testing. We start by using Subplot to help us define our tests using a scenario language. These scenarios are going to be derived from the safety analysis that we've performed on our system using STPA, which provides us with models of our system that focus on its safety goals and allows us to define meaningful safety requirements for the components that make up that system. By using a language that everyone can understand to define these tests, our safety analysts can review the scenarios to confirm that they meet the safety requirements and our developers can implement test scripts or programs that turn the descriptions into automated tests. The relationship between the scenario descriptions is captured in bindings files, which describe how the syntax patterns in the scenario text correspond to these test implementations. Subplot uses the scenarios and bindings to build our actual tests, but because we want to do all of our construction in a controlled and reproducible manner, we have BuildStream to manage the process, defining all the required inputs and leaving it to control the building environment. This control extends to obtaining the required inputs from their source repositories, typically using a precise and persistent revision reference, such as the SHA-1 commit hashes used by Git. Assembling all the elements we need for a test may involve building the tests and their dependencies, the system components that we want to test, the other elements of a system image that we would deploy to a device for testing, as well as tool chains and other build time dependencies that we might need during construction. By declaring all of these inputs in our build definitions, BuildStream is able to construct everything deterministically to allow all of the elements required for a test to be exactly and consistently reproduced. While in some cases this may require building or pulling in a lot of binaries from scratch, BuildStream uses remote artifact caching to ensure that we only need to build things when their inputs have actually changed, and we can parallelize independent build steps for a large or complex system using a remote execution service. A common cache can be shared by all of these build steps, as well as those for other software components or triggered by other CI jobs, which means that the artifacts from most inputs will very likely already have been built. Depending on the nature of the scenarios that we've defined, the tests produced by stop plot might be incorporated into a system image that we're going to deploy to a test device, or they might be used in a CI job context. In the latter case, this means that we have test scenarios that define a complex set of preconditions for executing a system level test, which may include the construction of a specific system element and deployment to a specific type of device. This is important because some of our safety-related scenarios are going to want to include fault injections, which can be implemented using patches to a software component that deliberately cause a malfunction so that we can verify the behavior of a safety mechanism that is intended to deal with such failures. We then use lava to manage the devices that we're going to deploy our software to in order to run our tests. Now, these may be hardware or virtual devices and can be grouped by common characteristics using tags. Lava not only provides tests with the means to request a suitable device, but also with the means to interact with it. This means that a scenario-driven test can deploy a system image to a specific type of device or even to two interacting devices, and then execute test steps on the resulting system, as well as collecting the results of the test. By bringing all of these elements together in a coordinated workflow, we can make it possible to automate system-level tests as part of a continuous integration workflow. When this forms the basis of a compliant development process for a safety-critical system, we can enable system-level testing to inform and complement the system design process, especially when we're using STPA to provide an overarching model of the system and component safety requirements. Instead of discovering issues with hardware integration or with inadequate component specification late in a product's development, we can start defining our system-level test scenarios as early as possible and begin by verifying them on virtual hardware. And because this system gives us fine-grain control over all of our inputs and our build and verification processes, we can confidently meet the requirements of safety standards and produce the evidence to support this as a direct output for our CI jobs. As Code Think were able to demonstrate with a deterministic construction service, it is possible to leverage the strengths and flexibility of open-source tools and working practices to address the challenges of developing complex safety-critical systems without compromising on the need for nearly specified and closely controlled development processes that safety-critical systems necessarily require. I hope this talk has inspired you to learn more and I'll be very happy to answer your questions.