 Hi, I'm Matthew Trinich. I'm a software engineer at IBM Research that works on the Quantum Computing Project there. I'm here today to talk about building a compiler for quantum computers. Before we get into this, though, I wanted to preface this won't be a general talk about quantum computers or quantum information. I actually gave a talk on that two years ago at FossAsia. You can look up the recording for that if you'd like to go into more details on how quantum computers can be used and how they work. This talk will be concentrating primarily on just how we can build compilers and looking specifically at an open source package that is a compiler for quantum computers. So before we can talk about compilers, though, we need to talk about how people program quantum computers and that brings us to the quantum circuit. Quantum circuit is a visual representation that's used to describe a series of operations that are run on a quantum computer. Each row in the circuit represents either a classical bit, these two on the bottom, or a quantum bit or a qubit on these two on the top. And you read them from left to right and they show operations that operate either on one or two qubits. They're used to show dependency of operation and there's no implicit timing information, although they can be annotated to add timing information. One thing that's worth pointing out here is a measurement gate. When you measure a qubit, you are going to collapse its state and either read out a binary zero or one, which is what the classical bit represents. So in this circuit example, when we read qubit zero, we measure it to classical bit zero and we'll either get a zero or one out. And once we measure it, whatever quantum state we had before, we lose. Quantum gates are the fundamental operations that we run on quantum computers. They're, unlike classical operations, they are reversible, which means they can go in either direction. Here we see a block sphere. Block sphere is a geometric representation of a single qubit. The model doesn't really work so well when you start talking about multi qubit operations, but for this example, it works pretty well. In this case, we're going to run a Hadamard gate or an H gate. We start at the zero position, which represents a zero state, and we rotate that along the block sphere to get the plus x direction. All the one qubit gates you can think of as rotations along the sphere. One thing to note, though, is that when we measure a qubit, it's measured along the z axis here, which is called computational basis. We can either measure a zero or one. So when the Hadamard is here at the plus x direction, when we measure it, we have about a fifty-fifty chance of going to zero or one when it's measured. That's important to keep in mind, because while quantum computers can be deterministically programmed, they can still behave randomly in the result, which is a fundamental difference from classical computers. Here, I just put on a bunch of quantum gates. Some will be dealing with in this presentation, others are just some common ones you might see. Just to get a feel for some of them, the other thing that's important here is I put the unitary matrix. Each quantum gate can be represented as a unitary matrix, which is then you can then multiply against the state vector for the quantum state. I'm not going to go into the math here, but it's important to realize that it's basically just doing matrix multiplication and all the visualizations before are just a geometric representation of that. So how are people actually programming quantum circuits in practice? It's all great when we can draw pictures, but we're programmers, we like code. So one way, so we need languages, one way that people program that is the open chasm language, open chasm stands for open quantum assembly language, and it can be used to write circuits. So that circuit I was showing on the first slide is represented here in chasm, where we define some registers, a quantum register and a classical register, apply our gates and measure. Although in practice, most people don't seem to write open chasm in my experience. They tend to use other frameworks like the Kiskit project, which I'll be talking about later, that offer Python APIs or more convenient ways to program them. From a compiler standpoint, this also isn't super interesting. It just ends up being a parser. And parsers, while there's work that can go into them, they're not as interesting as some of the other things I'll be talking about. The other level that people program quantum computers at is at the pulse level. So all of those gates that I showed before can be represented as microwave pulses that are applied to the qubits, and that's how you transform the quantum state on the qubit. So if you're working at a really low level, you can define your own pulses, apply them to the qubits and measure them. You can start tweaking things to try to get better performance out of your circuit. This is mostly used for people that are doing physics research, hardware characterization, that kind of thing, trying to get the most out of the limited devices that are available today. We will not be talking about this, although there are some interesting scheduling problems around how you go from a circuit representation to a pulse level representation. So if we're programming things at a really low level, either at basically the equivalent of assembly or even lower, why do we need compilers? When you think about classical computers and how compilers work, they're basically going from a higher level language down to assembly, which is a one-to-one mapping with instructions. And the reason we need compilers for current quantum computers are that current devices have a lot of limitations, and that when we're doing this circuit, we're actually abstracting away a lot of details that we have to keep in mind when actually running on hardware. So in this example, this is a circuit I actually used in my presentation two years ago. We have a bunch of Hadamard gates, an X gate, and some CNOT gates. And if we run these through a compiler to target an existing quantum chip, we can get two different answers. If we have bad compiler settings or a bad compiler, we get this circuit on the top. If we have a better compiler, we get this circuit on the bottom. And some things you'll notice is that all of those H's and X's, they change to U2's and U3's and all of those CX gates, they have changed a lot. So some of them, there are a lot more on the bad compilation case, and here they're about the same. So what does this mean in practice though? Well, when we run these two circuits, either after a bad compilation or a good compilation, we get very different results. So that circuit had a known answer, and that known answer was all ones. So basically this at the end. When we run this circuit with bad compilation, we get all ones 2.8% of the time. But when we run it with a good compilation, we get all ones 57% or 56% of the time. So you can see that the difference between a bad compiler and a good compiler, can be the difference between getting the right answer or just random noise, which is what happened here. So what are these limitations of the current quantum computers? The first one to keep in mind is basis gates. So each quantum computer out there only supports a subset of all of the quantum gates out there. This has to do with a lot of different reasons on how the devices are constructed and how much effort is put into tuning some of the gates. Remember when I said that the gates are mapped to pulse sequences? Those pulse sequences have to be mapped by hand at some point to define them. So you're not going to do that for every single gate out there. You're going to do them for a subset where you know you can get good results. And you can use the subset of gates to implement all the other gates out there. It's just a little bit of linear algebra. So on the right here I put some examples. So a superconducting qubit is what IBM Quantum devices are used. And these are the basis sets though the IBM Quantum computers use. Trapped ion qubits is another technology for developing qubits. And they use some different gates. And then also a simulator they can use. You can pretty much simulate any basis gate. Another limitation to keep in mind is qubit connectivity. Some quantum computers have limited connectivity between the qubits, which means you can only run multi qubit gates between some qubits on a quantum computer. Here I have a diphoto of one of IBM Quantum's five qubit devices. And you can see these wires that run between some of the qubits. This is what I'm talking about with connectivity. You can only run a two qubit gate between Q2 and Q0, between Q4 and Q3, etc. For example, you wanted to run a two qubit gate between Q0 and Q3. You couldn't because there isn't a wire between them. And this is easy to reason about when we're looking at a simple five qubit device. But let's say we look at a 53 qubit device. This is a coupling map for all of the qubits on IBM Quantum's 53 qubit processor. And it's a lot harder to deal with when you have limited connectivity and so many qubits. It's also worth pointing out that some other types of quantum computers, trapped ions in particular that I mentioned earlier, don't have this connectivity limitation. All qubits can run gates against all other qubits. But there are additional trade-offs with that technology. So what happens if you don't have enough connectivity? So on both of those examples, there is no path from certain qubits to other qubits. And if the connectivity isn't sufficient to fully map your logical circuit that you outlined, you can use swap gates to move it around. So in this example, there are two qubits, qubit 0 and qubit 1, that are connected to all other ones. And on none of the coupling maps I showed before, you would be able to run this. So you use a swap gate, which is represented by this symbol, to move the state around between qubits. It basically, if you have a quantum state on one qubit and on another one, you can use a swap gate and it will switch the states between them. But this can be potentially expensive because when you run a swap gate on a computer that uses CNOT gates, you end up with three CNOT gates to represent one swap gate. So if you have to swap the state a lot to fit the circuit on your device, you're going to end up adding a lot of extra operations, which comes to the next limitation, which is noise. Everything in a quantum computer has noise associated with it and certain errors. So you have individual gate errors. You can have single qubit errors, multi qubit errors, and also readout errors, which is when you measure. So you can see here that we have a coupling map for that same Yorktown device I showed, the dye photo of earlier. And this is showing that the error rates for each, which are one qubit gates, are represented on the nodes here with the color map. And you also have error rates, which are a bit higher, you know, around 1%. I believe that's what it says, between 1 and like 2% between different gates for when you run CNOT gates between those qubits. And then there's also a readout error rate like around 2%. So basically every time you do something on a quantum computer, you're introducing potential source of error where you'll get the wrong answer. That's why when I showed the graph before with the results, we didn't get the right answer 100% of the time. We got a 50-something percent of the time because even when we did everything correctly, we still had errors and unexpected results. The other thing to keep in mind, though, is there's decoherence. Decoherence is the amount of time that we're able to preserve the quantum state and the qubits. Decoherence is measured in two times, typically. There's T1 and T2. T1 is the energy relaxation time. It's the time it takes if you had a qubit at the excited state or at the 1 pointing down on the block sphere in that earlier slide. How long it will take for that 1 to decay to the ground state and just go back to 0? The other time is T2, which is a little bit more confusing, but it's defacing time. If you have a qubit that's in superposition, so somewhere between that 0 and 1, how long it takes for that position to defase or change direction? This basically puts a limit on how much time we have to run our operations. If we hit decoherence time, whatever data we have, we basically lost our quantum state and you can't rely on it. The more operations we add to a circuit to make it run on a device, the closer we'll be getting to that decoherence time. Which is not good if you're trying to run computation. Those are the primary constraints with quantum computers today and why we need compilers. That's where the project I primarily work on comes in, which is Qiskit Terra. Qiskit Terra is the base layer of the Qiskit project for working with quantum computers that provides an interface for hardware and software between hardware and software. It provides an SDK for people to build quantum circuits. That was the Python API I alluded to earlier, so you can write Python code that will generate quantum circuits for you. But it also includes a compiler to take those higher level circuits and map them to specific hardware backends. It's designed to be back-end agnostic, so it can work with any hardware or simulator out there. Out of the box we ship extensions that let you run circuits from Qiskit Terra on IBM Q devices, Honeywell devices, some startups like AQT, which is a trapped ion startup out of Europe. And I believe some other people have written some for like Google and Rigetti, which are some competitors. Terra is written primarily in Python and it's Apache licensed, and you can find it on GitHub at the link there. So how does the compiler in Terra work? The compiler in Terra represents the quantum circuit as a directed acyclic graph. Here we can see a quantum circuit represented as a DAC in Qiskit Terra. We have nodes that are either input nodes, which are represented in green, operation nodes, which are represented in blue. So you can see an H gate, a C not gate, and measurements, and then output nodes, which are represented in red. And we use this directed acyclic graph because it makes it much more explicit and easier to track the flow of data between the quantum circuit. So you can see here that we track which bits each operation is touching, and we track that through the DAC, which makes it easier when we start looking at more complicated examples. And how we start optimizing circuits later. So here's an example that shows how this makes it more explicit. You can see here we have a circuit here, which is basically the same one before, but it adds this RZ gate, which has a classical condition on it. Basically, this means that when we have this RZ gate, we only run it if the bit string on the classical bits is 0, 1, 0. And if you look at this, it's hard to tell that there's any relationship between this H gate and this RZ gate. But when you look at the DAC, that relationship is pretty obvious because you see we have this H gate here. And if we just follow the Q0, we end up having an arrow between this path and here. Same with the C with Q bit 1, although that one is more obvious because the two operations back to back. So the DAC lets us see this data flow much more easily than we could on the circuit. So what do we do with this DAG and TEPRO? We have what we call a transpiler, which I don't think is personally the best name. I just call it the compiler, but I'll use them interchangeably. So we have the transpiler, which is built around a pass manager that is used to execute passes on the DAG to basically convert it from that virtual form, or the generic form to something that's specifically designed to run on a specific computer. In this run, these passes are defined as small, well-defined tasks that either do a transform or do some analysis. We keep the passes simple and we can then use the pass manager to handle scheduling, managing dependencies between them and basically build a pipeline for transforming that input circuit to something we can run on a device. The passes are built of two types, basically. We have a transformation pass, which is designed to transform that DAG. So we have a DAG input, you run the transformation pass on it and you'll get a separate DAG output. Then we also have analysis passes, which will read this DAG and pull some information off of it and put it in a property set, which is basically just a Python dictionary that keeps track of certain attributes of the circuit as we go through it. One example is if we have a pass that analyzes it for some property. On the slide I used commutation relationships and then a later pass will use that information to do a transform. The pass manager is broken up into a series of stages. We basically start with the user circuit, which can be arbitrary numbers, arbitrary gates, and arbitrary connectivity, and we start with doing logical reductions. So we look at the circuit and we see are there simple logical reductions we can do to just simplify the complexity. From there we go to embed it, which is we map the virtual circuit to the physical device constraints. So that'll include picking, mapping between virtual qubits, physical qubits, as well as unrolling to the basis set for the specific device. From there we go to physical optimizations, which are an additional optimization, any additional optimizations we can run after we've embedded the circuit. And then after that's all done, we have an output circuit that can be run on a device. And the thing to remember is that this entire, while we have these basic stages, the entire pass manager is designed to be pluggable and extensible. So we ship some preset pass managers that come with it, but you can customize it to do any kind of optimizations or transforms or analysis that you'd like to do. And this is an area that there's a lot of research in, is looking at how we can come up with better compilation steps. So having this pluggability is very useful. So out-of-the-box Qiskit Terraships for optimization levels, which is just like GCC or Clang, basically, where you have different optimization levels that go to different levels of effort to try to optimize a circuit better. The trade-off, so we started optimization level zero where there's basically no optimization, it'll just do that embedding step. So make it so that your input circuit will run on the specified device and do nothing else. And then on the other end of the spectrum, we have optimization level three, which will go to all available effort to try to optimize the circuit. So if we look at each optimization level, we have optimization level zero, which will run no optimization. Just unroll the circuit, apply a layout, and do some swap mapping, which is the, if there's limited connectivity, it'll insert swap gates where it needs to be. And I apologize, this drawing is so small I could not come up with a simpler visualization to show all of the steps. But it's just showing the flow chart of each of the passes in the pass manager. Optimization level two is basically the same as optimization level zero except we have optimization passes at the end. One that'll do one qubit gate optimization and another that'll do C not gate cancellation. And you can see here, I don't think anyone will be able to read this, but it says do while right there. So we actually run this in a loop until we get a fixed depth output. So when the size of the circuit doesn't change after two runs, we say, okay, we've optimized it as much as we can and we stop there. Optimization level two, it just adds an additional optimization pass on level one for commutative cancellation. And it changes some of the passes it runs for embedding to try to do a better job, even though it's a little bit slower. And optimization level three just takes it up another notch. So we run a lot more optimization passes, which you can see here. And then we also run some yet another different layout pass. So now let's talk about some individual passes. The first one I'm going to talk about is the unroller. So when we have to map from one basis set to another, we need a way to translate that. The unroller is the simplest method to do this. It basically looks at every single gate's definition and descends through those definitions until it reaches a gate in the basis set. So in this example, I just put it on a graph and said, okay, our basis is going to be CNOT and U3 let's, and we started with a swap gate and a CZ gate. And we just descend through the definitions until we get to something that matches our basis. In Kiskit this only really works with superconducting qubits and honestly superconducting qubit devices from six months ago, because IBM actually changed the basis gates they used not that long ago. The limitation with this is that you only can have one definition per gate. So if you don't have a path from one gate to another, you won't be able to unroll it. We have an additional pass in Kiskit which is called the basis translator, which takes this graph but also adds additional definitions to it and then does an A star search to traverse the graph to try to find the fastest path from one gate to something in the basis. That was a lot more complicated to put on a slide. So I just stuck with this example. We still use the unroller though for complicated custom gates because we can use this to unroll one level into something the basis translator can then use. Moving on let's talk about layout. So layout is actually deceptively important. So when we have a virtual circuit with our qubits, Q1, Q2, Q3, Q4, we have to figure out how we're going to map those onto the physical device. And that initial qubit selection from what's in our circuit to what's on the device can be really important because if we pick poorly then we have to start adding a lot of swap gates to make the circuit runnable. Out of the box we have three passes in Kiskit that do layout. The fourth one that I forgot to update on the slide, Saber layout. That's not important because we're only going to be looking at these three. So let's look at an example of how layout can be important. Here I have an example circuit. You can see you don't have to worry too much about what the circuit does. This is actually from something called Grover's algorithm which is used for basically doing database searches. But we have some two qubit gates. These are CZ gates between qubit zero and all of the other qubits. And we have some more gates and again some CZ gates between qubit zero and all of the other qubits. And we want to run it on this device with this coupling map. And if we look at this, if we think about it for a second, the obvious answer is, oh, we just pick qubit zero is qubit one because that has connectivity to three other qubits just like we have in the circuit. And then the other ones we just pick either zero, two or three for one, two and three in the circuit. And then we don't have to use any swap gates. But if we use some of the other passes, so we can specifically tell the transpiler to okay, let's use that as our layout. But then we have, we can also use the passes and see how they do. So I apologize how small this is, but when you compile the circuit, it gets a little bit lengthy. But so I ran the three layout passes. We have trivial layout, dense layout, noise adaptive layout, and custom layout is the one that's perfect that we know about where we don't need any swap gates. And you can see trivial layout just goes from qubit zero to physical bit qubit zero. And it results in we need two swaps. If you remember the three C knot gate pattern I showed before, we have one swap gate there and one swap gate there. And that's actually pretty good. I was kind of surprised just going zero to zero one to one did that good of a job. Dense layout, which is a little bit more complicated of a pass, it goes from it ended up inserting a lot more swap gates. You say one, two, three, four, five swap gates are inserted there. So that one is probably going to be pretty noisy. The noise adaptive layout, which is similar to dense layout, but it also factors in it's trying to pick qubits with the lowest noise. So it sometimes does a worse job than dense layout. But in this case, it actually didn't seem to. And you see this ended up with three swap gates, but it did different qubits than the other two. And then our custom layout, we have no swaps. We just mapped it perfectly. And then when we look at the results, this is where things get interesting. So there is four correct answers to that circuit, the four expected answers, I should say. We'll be getting either zero, zero, one, one, zero, one, zero, one or one, zero, zero, one and or all ones. Those are the four answers that are expected from that circuit. And if you see trivial layout, it's kind of hard to tell which ones are which. That's actually why I got tripped up a second. I should have looked at one of the other ones. But so you can see that we got a pretty even distribution between those four answers. Ideally, we'd want 25% on all four of them or close to it. Dense layout, despite having all of those swaps, did a better job. I think that's because dense layout factors in noise to its selection, just like noise adaptive layout does. But you can see we still had a pretty high error rate for the other ones. When we look at noise adaptive layout, it's honestly pretty similar to dense layout and performance. And then custom layout, because we had no swaps, did an excellent job. Everything is below 5% that we weren't expecting with the exception of zero, zero, zero, one. But you can see that layout ends up having a surprising effect. You wouldn't think that initial mapping would be as important as it is. But when you factor in that swap gates end up being kind of expensive, it makes a pretty big difference. After layout, we also have swap mappers. These algorithms are a little bit more complicated. So I'm not going to go into too much detail on how they work. But in Kiskit, we have actually now four algorithms for doing swap mapping or routing. Basic swap, lookahead swap, stochastic swap. And the fourth one is now saberswap. But we actually use stochastic swap by default in all of the preset pass managers. That's primarily a function of speed. Stochastic swap was rewritten in Cython about a year ago. And because it ends up being compiled from C++ through Cython, it ends up being a lot faster than the other passes. So we just use it for pass. Performance is very good. It's based on some heuristic algorithms. It basically does a bunch of random trials of different ways to swap the circuit and picks the best one based on that. There are a lot of different ways. This is actually an area of a lot of active research to try to come up with better algorithms because some of these don't scale very well for large qubit numbers or have different performance trade-offs. So routing is actually a very interesting problem I wanted to think about. So those are the three main embedding passes, the steps of the embedding phase of transpile. So let's look at some optimization passes. And this is where things can get kind of fun and interesting. So the first thing we're going to talk about is two qubit block collection, which is an analysis pass. So what this pass does is we want to collect all blocks of operation that operate solely on two qubits. So we want to look at the circuit and figure out, okay, which operations are isolated to just two qubits and collect that in a block. If you look at the circuit, this is actually pretty hard to do because of this T and H gate over here. So I map this circuit directly onto a DAG. And you can see here, it's a lot easier to actually find this block collection. We just follow the data flow and when it goes off of those qubits, it's no longer isolated. And that's pretty simple. But what do we do with these blocks? That's where things get interesting. So that's where the consolidate blocks pass comes in. So after we take those blocks we collected in the previous slide, we run a simulation of those two qubit blocks to find what the unitary matrix, the matrix representation of that block would be and insert that on the circuit instead of the block. And what we can do then is once we have that unitary matrix, we can do some more math and, sorry, and convert that into a circuit. And you can see here that this circuit is a lot simpler than the original one we started with. There's a lot less operations because when you go through this simulation to the unitary and then just the ideal synthesis of a circuit that represents the unitary, it ends up often being less operations. Especially if you imagine you had a very large two qubit block. In fact, I was doing some benchmarking of a circuit once that operated on solely two qubits. And I ran it through and I was testing things and I saw that optimization level zero through two, several thousand gates that the compiler was outputting. And then when I ran level three, it ended up with seven. I was lost for a while until I remembered that optimization level three uses consolidate blocks as part of it. But I thought for a while we had a really terrible bug in the compiler because I didn't think it could optimize it that much. Another interesting optimization pass to look at is the optimize one qubit operations pass. This is similar to the consolidate blocks pass. If all it does is it basically looks for runs of solely one qubit operations on a single qubit in the DAG. So it just traverses the DAG and tries to find all series of operations that are isolated to a single qubit and then pulls out that run. So here's an example of one that's just four one qubit operations in a row. And then it looks at the, it assumes we started a zero state and it just looks at what the end state is going to be. And it replaces that with a single gate that goes from the start to the end, whatever that is. In this case it would be just a U rotation like that along each of the axes. But you can calculate this doing two different methods. You can either, if you're doing it with angles you can use quaternions just like video games actually to like track rotations in three-dimensional space or you can do the linear algebra and do the matrix multiplication and then do a circuit synthesis like we do for collect blocks. In Kiskit we used to use the quaternion approach but we moved to the more general linear algebra approach as we, actually as IBM changed the basis gate because it, as IBM employs we get kind of incentivized to make things perform best as best as we can on the IBM devices. So that was all I had on passes I wanted to talk about. I'm just giving a feel for some of the passes that are out there but one thing I wanted to end the conversation talking about is why is this important? So here I wanted to talk about something called quantum volume which is the metric with which IBM and most other companies out there use to measure the performance of various quantum computers. All quantum volume does it says you generate a series of random two-qubit unitary matrices and you apply them like across every qubit multiple times and it's designed to be very difficult for a quantum computer especially modern ones with all the limitations we talked about. Especially with limited connectivity and things like that and that's where a compiler comes in because you run the same circuit and you have a better compiler you can get way better results and it's actually really rewarding for me personally to look at something like this and see that I'm working on DAGs and other computer science concepts I'm not an expert in quantum information theory but I can see that the work I do on the compiler optimization has a direct impact on how the circuits we can run on these devices perform and if we make a good improvement to the compiler we can use the devices that exist today for more. So with that, that is all of the prepared material I had. I'm open to taking questions now although I'm not sure how that's going to work with pre-recorded talk. Hopefully I'm awake for answering questions at least. 1AM is a bit difficult for me with my normal schedule. I have some links here for more information including the link to the slides as well as the GitHub for Kiskit Terra, the Kiskit website. One thing you might be interested in doing is signing up for access to the quantum computers. I forget the exact number, it's like 5 to 10 but a large subset of the quantum computers that IBM has available are available to the general public if you sign up for an account so you can submit jobs to the quantum computers and then some more information on the transpiler and if you're interested in quantum information I have a link to a talk I gave on open source quantum computing a couple years ago and then also the Kiskit textbook which is an open source textbook for learning about quantum computing. It starts assuming no knowledge, it'll start with the linear algebra that you need to understand and work its way through some very complicated examples. So thank you for listening to me and I hope the talk was enjoyable. Thanks.