 Hello everyone, my name is Paolo Bonzini, I'm a Distinguished Engineer at Red Hat and this is the question I'm going to try to answer today, is QMU too complex and what can we do about it? Now, if you've ever heard about Betterge's law of headlines, you may wonder if it applies in this case. The law states that any headline that ends with the question mark can be answered by the word no. So is QMU really not too complex? Unfortunately, the law doesn't apply in this case, QMU is indeed a complex program. And it's definitely too complex. On the other hand, what isn't? QMU me, as throughout the rest of the presentation, I would hopefully make it clear what I mean. So why is complexity a problem? Why should we even ask ourselves if QMU is too complex? First, complexity generates bugs. Some of them may be stupid bugs. Some of them may be security sensitive, but all of them are certainly annoying. Complexity makes code harder to review. It makes it harder to contribute to a project. And we want nothing of this, but to what extent can we eliminate complexity? Complexity can be divided into essential complexity and accidental complexity. Lots of QMU's complexity is essential. It's a property of the problem that QMU is trying to solve. For example, QMU has to do all of these things. It has to emulate and migrate guest devices. It provides a management interface, usually called the monitor. It deals with storage, including migrating and streaming it. It embeds a few network servers, such as a VNC server. It's also a portable program and a highly configurable one. In order to satisfy its user's performance needs, QMU has to support concurrent asynchronous IO. It supports TLS to guarantee secure communication. And TLS on a serial port, for example, may be something that most people don't use, but it solves real problems. It's there and it's part of the essential complexity of QMU. QMU has to provide hotlock capabilities. It has to provide stable CPU models after you upgrade the hardware and stable device models after you upgrade QMU itself. I have already mentioned live migration, and for many users, it's also important to use QMU with a distribution kernel rather than a custom-built kernel. Even being able to boot non-linux operating systems is a necessary feature for many QMU users, and it counts as essential complexity. It's also essential for many users to have an easy way to interact with the program, which is why QMU doesn't just have the JSON-based QMU interface, but also what's called the human-monitor interface. For the same reason, QMU provides easy options for command line news. Even if my customers are enterprise customers and they use live-vert and cube-vert and all that, I, myself, prefer using these easy options when developing. It's just so much more handy for quick testing of new features, for example, and for debugging. Also for debugging, QMU provides a disassembler, and it also supports multiple accelerators because of how portable it is. It supports KVM on Linux, Hypervisor Framework on MacOS, the Windows Hypervisor instructions, and even a new accelerator for NetBSD. QMU contains an object model with the ability of marshalling and unmarshalling those objects and the value of their properties between C-structs, JSON objects, strings for the command line, strings for the human-monitor, and even the graphical user interface of QMU is part of essential complexity. Nowadays, it's pretty common to use VNC with interfaces such as field manager or field viewer. But still, having a simple user interface is useful for prototyping. As a developer, you may also be interested in the complexities that tools bring in. These could be both internal and external tools. Why do we use tools as part of building QMU? Basically, it's a matter of making common tasks easier versus occasionally, unfortunately, making debugging harder when those tools break. For example, in the big system, we used to use only shell scripts and make, but right now, the build rules are generated by Meson, and Meson also handles a lot of the work that was previously performed by shell scripts. We used to have a handwritten configuration, including all the devices in the board, but now we have a dependency system that automatically enables the devices that are required or recommended by a board. And the configuration management also ensures that impossible configurations don't build, which is easier for developers. Another example is code generation for the marshalling and un-marshalling of seeded structure. This is taken care of by the QMI code generator. Now, this is all about the essential complexity of QMU. Instead, the extended complexity is what makes QMU too complex. It's just the property of the program that solves QMU's problems, but it's not intrinsic in the problem that QMU solves. How do you fight accidental complexity? First and foremost, you need to understand the domain. Know what is the essential part of the complexity so that you don't mistake it for something that is just accidental. And also, you should know what is it that ends up generating accidental complexity. When making changes to the program, you should always keep in mind your knowledge of the domain, your knowledge of why some things are intrinsically hard. And not just that. Essential complexity can be used to your advantage. After all, it's not going to go away, so you might as well use it to solve the problem in the best possible way. At the same time, watch out for accidental complexity as it arises. Catch it before it takes over the project. This in my opinion is a primary part of a reviewer's job. Sure, they have to make sure that the code works and they don't have any security issues, but a huge part of being a maintainer is also keeping the code complexity at bay. What are the sources of accidental complexity? How do you note that accidental complexity is increasing in your project? For example, you could start noticing multiple incomplete transitions. Incomplete transitions in many cases arise when you find a new and better way to do the same thing, but you don't apply it everywhere in the codebase. Also incomplete transitions arise when features are introduced, but they are only supported by a few select targets and devices. So you end up with two different ways to do the same thing, or you end up with things that may or may not work depending on what the user is doing exactly. Another sign of accidental complexity is duplicated logic. For example, duplicated logic arises when the code is missing some useful abstractions that could be applied throughout it, or when too much code is at the hook and foregoes a more general approach to the task. So let's start with incomplete transitions. In fact, in QMU, there are many cases in which we started with a new way of doing something, but we haven't applied it to the whole codebase. For example, there's error reporting, where we have a propagation-based API and some functions that just write errors to the standard output. The propagation-based API was introduced in order to have errors in the QMP monitor and to separate the point where errors happen from the point and the way where errors are reported. It also allows for error recovery because, for example, a recoverable error is sketched using the API and not reported to the user at all. Another one is modeling the boards. Newer boards tend to use QM more. They tend to attach child objects to the board. In some kind of composition tree, that matches the composition tree of the C-structs. Older boards just create devices, don't give them an explicit relationship other than pins or interrupts. Or maybe not even that. Some older boards do not even have devices as separate QM objects, and those devices are just random mellowed blocks of memory connected by function calls. This new way of modeling compositions produces better modularity and makes the code easier to understand. Another incomplete transition has to do with live migration. Here, sometimes, the older VMSate register API cannot be removed at all because switching to the new API might break the migration formula version of QM. And there's also, of course, simple examples. Most timers are allocated in memory and used to a pointer. But really, there's no reason for that, and we could just embed them in a struct and save a couple lines of code here and there. And of course, the newest incomplete transition in QM is moving configuration tests from shared scripts to Mason. Let's say we also have a decent track record of completing transitions. In many cases, these automated transitions were done with Coxinell. If you don't know it, that's a tool that takes so-called semantic patches and applies them uniformly to a large code base. And the four examples in the slides were all done in an automated fashion. The first one is simply removed the useless function call. Well, of course, those function calls were not useless. They became useless after the API was improved. In the second and the third case, the semantic patch looks at code that is going through unnecessary hoops and turns it into something simpler. In final case, however, a whole new API for creating and realizing new devices was introduced almost exclusively with Coxinell. This was a very large change in the API, and it was actually realized as a single series of patches thanks to the power of this tool and of course, the persistence of the submitter. Looking instead at features within complete support, a lot of these are in block devices. That's because many boards have block devices with no real need for sophisticated features. And no one has added the required support code. For example, to the Venerable Flop device, of course. For example, there is a reporting where you can ask QMU to stop the VM when there is an IO error on a disk. This is only supported by a few devices and so is accounting of your operations where you can ask QMU how many bytes have been read or written to disk and how long ago. Some devices even still have blocking IO and that's both block devices that is storage or character devices such as serial ports. Unfortunately, in these cases, you mostly just have to put in the work. Sometimes it may be possible to do this in a more automated way, maybe do it at a lower level in the API so that it comes for free in the device model. But those are really just the lucky cases. Anyway, incomplete transitions are not always bad because transitioning from an old API to a new one is really just part of how QMU is improved. You shouldn't feel bad about introducing new APIs and sometimes the new feature may require a transition period anyway. In that case, an incomplete transition becomes a fact of life. For example, if there could be effects on the command line or on the management tools, which requires a deprecation period. In that case, just use the transition period to your advantage working phases, commit the smallest amount of work that already constitutes an improvement and just make a plan for what comes later. Again, an incomplete transition or a piecewise transition should not deter you from improving QMU. So as some of you do not be afraid of transitions, just make a plan and ensure that you have good test coverage, which will eliminate some sources of bugs in the old APIs and do learn QSNEL because it's a really powerful tool. Now let's move on to duplicated logic and specifically missing abstraction. A simple example is the handling of dirty pages in display emulation code. You can see here a simple graph and you can see how many device models use these same two functions, snapshot and clear dirty and snapshot get dirty. They use them in very, very similar ways. All of them except DCX and VGA at the bottom have exactly one call to snapshot get dirty and one call to snapshot and clear dirty. This function are already a pretty high level abstraction on dirty page handling, but you may wonder if there could be more duplicated code around the function calls, perhaps around handling a frame buffer and whether it would be possible to abstract it. However, be careful because a new abstraction may also become an incomplete transition. Another source of duplicated logic is ad hoc code. There is a trade-off between writing code that is ad hoc or designing data structure that are more reusable or more wide ranging. For example, command line arguments and other inputs sometimes is parsed manually with functions that are not even so easy to use such as SSKF. Sometimes instead we use QMWabs or Qdell, which take care of parsing or even printing the messages consistently. Recently, another example, a new mechanism replaced the many scattered tables and functions related to model functionality, module dependencies. This new mechanism is called ModInfo and it places the information directly in the module. It doesn't scatter it around the whole source code, which is nicer for modularity and for making more obvious what to do in a new module. So as soon as you notice excessive duplication or code that is too dispersed, you should think of a transition plan to eliminate that and possibly make that transition plan work piecewise where the first step only needs to be measurable but small improvement over what exists already. So let's now look at a great study. The QMW command line has about 120 different options and it's implemented over about 3,000 lines of code. There is certainly some essential complexity in this 3,000 lines of code but there's also way way too much accidental complexity. So if one has to work on this command line parsing code, how not to make it worse and what can be done to simplify it? First of all, we should look at what causes accidental complexity in the command line parsing code. To do this, I have divided those 120 command line options in six groups, the flexible options, the command options, the combo options, shortcut options, one-off options, and legacy options. These groups here are sorted from the one that are more essential to the one that really need refactoring duplication or removal. Flexible options are the more complicated ones. I counted really only 10 of them and they generally create either back-end objects or front-end objects. Front-end objects are those that are visible to the user or to the machine, while back-end objects interact with the host operating system or with other processes. Usually, these options are not implemented in the 3,000 lines of code of command line parsing. They delegate as much as possible to QAPI or QAM or at least to some external function through function pointers. This is good because often there's no need at all to touch the core command line parsing code in order to add new features. Whenever a new feature is added to QAM instead, we try to fit it into one of these options. For example, using dash object for all the new back-ends. On the other hand, there's a little bit of accidental complexity in flexible options as well because there are four parses at least. QAMuObs, Qival, which is a cleaner but more limited version of QAMuObs. There's JSON, parsing, and sometimes there are even bespoke parses, such as those used by dash CPU. Four parses are at least two more than there should be. In addition, some of these options could be merged into others. For example, CharDev, Display, and NetDev might be turned into shortcuts for dash object. One could also be turned into a shortcut for dash object. And this Q model is really a part of the machine configuration. Maybe it could be included in the dash machine command line option. Next, in our classification, there are command options. Those are options that perform an action. Typically it corresponds to a QMP command, but the action is specified on the command line. The reason for these options is that these commands have to be done already at the time of machine creation. For example, enabling a trace point or even telling QAMuObs not to start the QAM right away, as is the case for dash upper cases. These command options put a relatively small burden on the maintainer. Many of them are a one-to-one mapping to QMP. For example, dash action, dash plugin, dash dot VM. Nevertheless, we should keep a high bar for adding new ones because it's usually easier to just invoke the same commands from the monitor instead of the command line. Combo options are where we start the descent into accidental complexity hell. These options create both the backend and the frontend in a single command line option. They are essential because they are really useful for users, but they cause a very high burden for maintainers. The parsing code is complex, and they also tend to have ramifications in the rest of the code. Both the backend code and the board creation code. And in this sense, they are also the worst for modularity. Then there's shortcut options. These are many. There are shortcuts for dash Excel, such as enabling KVM. There's dozen of shortcuts for each of dash display, dash drive, dash machine. Many of these options, you may not even know that they are shortcuts. You just have used them every day. You just use dash kernel. You never heard of dash machine kernel. And this is actually good. In fact, some of these options have only very recently become shortcuts. And when they did, there was no user visible change. Just improved modularity inside QMN. Because these options are only known to the command line parsing code. Command line parsing code knows that dash kernel is equal to dash machine kernel. The QMN machine object, it just doesn't even care that dash kernel exists. So the burden for this is pretty small, but don't add more because QMN already has 120 command line options. There enough. And now there's the one-off, the legacy options. Both typically just set a global variable, even though they contribute to configuring a QIM object or a backend. For one-off options, at least their function is still somehow essential, though maybe sometimes only really useful to developers. One-off options have other words. They might be a mix of frontend and backend configuration. But overall, we tolerate them. If possible, they should be transformed into proper shortcut options. For example, dash SMP was transformed into a shortcut for dash machine SMP as recently as QMU 6.1. Again, please avoid creating new ones. Consider using command options. Consider alternative ways to achieve the same effect using QMP and HMP. And do not add more global variables to the command line passing code. And now with legacy options, we hit rock bottom. Most of them you wouldn't even know that they exist. Maybe they configure functionality that is only supported by one or two devices in the whole of QMU. Some of them may seem useful from the command line, but there are much better alternatives. If you use dash demonize or dash run as, which does not work with hotplug, for example, you really are much better served by a full-blown management tool, such as virt manager, libvirt, and so on. And some of them are just failed experiments that we just couldn't remove yet. For this, the best way forward is to deprecate them ultimately, remove them, or perhaps for some of them, transition them to shortcuts as well. Sometimes, actually options were considered legacy, but then we found a decent implementation in terms of other options, and they were kept. Maybe just the worst part of them were removed. And this was, for example, the case for dash USB device, which is now a shortcut option, more or less. By the way, all this is based on many discussion with other people, such as John Snow or Thomas Hood, and all I did here was just systematize these discussions. So what if you do one walking on the command line options? How do you apply the lessons that I mentioned in the first part of the talk? First of all, exploit the existing essential complexity because command line parsing is integrated with QM objects, with properties, with QPI structs, with QMP commands. Ask yourself if you really need a new flag. Maybe you can instead use one of these existing integrations with the additional benefit of improved modularity and keeping all your code in one piece. Ideally, there should be only one obvious way to do a task in QM, but if that's not the case, there should be one documented way to do it. For example, we don't document how to add new tests to the configure script, but we have good documentation on how to do that with Maison. So to complete our summary of suggestions, evaluate the trade-offs between duplicated code and excessive abstractions. Sometimes my duplication is okay, but when things are getting bad, please do not make them worse. Know your essential complexity and exploit it to your advantage. And once you've done this, document the best practices for the other developers to refer to. Thank you.