 Only for the past several years, but we've recently made some definite progress on that, which is the synthesis of complex biomolecules. By complex, I mean, in fact, they are structurally quite complicated. They have lots of bits and pieces that can be connected together in many different ways. And I'm going to explain how a cell manages to do this quite reproducibly, even though the underlying chemical mechanisms by which they're built are subject to the kind of chemical randomness about which I've been talking in the spring college. So before I start, this is the PhD work of Anjali Jaiman, who is sitting over there. And in close contact with the people who've taught me about these systems, Ajit Bharki is a glycobiologist who's at UCSD, introduced me to these molecules. Arnab Bhattacharya is a computer scientist at the Indian Institute of Science, which is close to where I work. And you can find out more about that at these email IDs. You can write to me or at these handles. You can find more information about where I work. Okay, so the molecules I'm going to be talking about, they're called glycans. Glycans means, in principle, they're sort of polymers made of sugars. This is an actual EM image of a cell. And that right there, that light bit is the cell membrane. You can think of it as sort of five nanometers in thickness. And every cell on the planet, every single cell on the planet, if you try and land on the cell, like a spacecraft landing on the moon, you cannot just access the cell membrane because you'll be confronted with this very dense forest of sugar polymers, essentially. And the proteins are just down there. The proteins are also about 5, 10 nanometers in scale. And several tens of nanometers long are these tendrils. Some of you are, in fact, many of you would have heard about some of these sugar polymers. For example, your blood groups, A, B, A positive, all these kinds of things. Your blood groups are defined by these chemicals, right? That's the type B antigen, that's the type A antigen. They're just different combinations of these sugar building blocks. Okay, so what's the point of having all these sugars on the surface? It turns out that these sugars are extremely information rich. And this information is actually actionable by the cell. Cells use this information. So you imagine there are two cells that are opposed to each other. They're confronting each other. And one of these cells has a bunch of proteins drawn here as cartoons. And these proteins have binding sites for the sugars on the other cell surface. So for the moment, this cell is displaying a bunch of different sugars attached to the various proteins on its surface. And that cell is trying to read which kinds of sugars are on the cell surface. This kind of interaction between two different cells happens in many, many contexts in biology. For example, it is how pathogens find your cells to invade them. It's also how the speciation barrier is formed. So here's an example of a sponge. The sponge is a sort of very simple animal. You can take the sponge, you can send it through a membrane and break it up into little, into single cells. And you can leave it in a buffer and the stuff, the cells will just come back and make the sponge again. Now it turns out if you mix sponges of two different species, they actually segregate and they remake the original sponges of the correct species. How do they know what species they are? Exactly by these kinds of interactions. A very, very important example here. The reason a sperm, a mouse sperm or a rat sperm doesn't bind to a mouse egg is because the sugars on the surface of the mouse egg are of the mouse type and not the rat type. In fact, if you made sugars, the rat type, then the rat sperm would bind, right? So literally these sugars are the place where the social life of the cell is determined, yeah? This is how a cell tells the rest of the world who it is. Every cell on the planet. Okay, so I'm just going to explain a little about the molecular basis of this, right? So unlike proteins or RNA, which are linear in the sense they are single strands. The way RNA is converted protein, you all know this. That's an mRNA strand. And ribosomes move along the mRNA strand and they just make these linear proteins, right? So protein is completely encoded by its linear sequence. That's called the protein folding problem. Somehow the linear sequence of the protein is enough to make three dimensional final folded structure. Now the reason RNA and proteins are linear is very simple. It's because amino acids can only bind to two other amino acids. Nuclear tides can only bind to two other nucleotides. Sugars can bind to three or four other sugars through all their different carbons. And because of that, you can make polymers of sugar that are branched, not linear. And that's the simple reason why these molecules contain structurally more variants than RNA or DNA, okay? So here are examples of these branched polymers. Many of the polymers I'm going to be talking about, these sugars are in fact attached to proteins covalently. They are grown on top of proteins at very specific amino acids. And here's an example of a protein. And here's an example of the sugars stuck to it. This is actually a structural model. Here are the sort of schematic models of the same sugars. And the way the schematic models are drawn, these little symbols are meant to represent different monosaccharides. So if you're not into chemistry and you don't like reading the chemical structures of these things, you can just look at it as a purely abstract representation. Here are the different monosaccharides, mannose, galactose, glucic and so on and so forth. And here are the bonds that connect them. And if you can't quite see it, there are numbers on the bonds. And the numbers say which carbon these extending sugars are attached to. So you can attach to the three carbon or the four carbon or the two carbon or whatever it is, okay? So that's the representation of the sugar polymer. The other thing to say about it is that this particular protein is a famous one. It's called erythropoietin and I'll mention it in a later slide. Everything is a covalent bond, okay? These are all covalent bonds and they're all bonded covalently to either asparagine or serine or threonine on the underlying proteins, okay? So in fact, they're made by enzymes and that's basically what I'm going to discuss with you. Now what's the underlying conceptual issue? The conceptual issue is the following. Suppose I pull out any individual type of protein in a cell and I can do that by purification. I can then cut off the sugars and I can send them to a facility where using mass spectrometry or NMR, you can figure out what kinds of sugar polymers are attached to these proteins. And it turns out that there's a huge amount of heterogeneity. So this is related to the idea of randomness in biology. Here are the three types of randomness that occur in these sugar polymers. I don't know if you can quite see it, but if I take one type of protein, okay? These are mucins. These are the proteins. These are the sort of sugars that line your lungs, okay? They make mucus. If I take a particular type of protein and I see what sugars are attached to it, the same protein can have many, many, many different sugars attached to it. Each molecular protein has a different type of sugar attached to it. So there's some diversity at a specific site, right? That's called micro-heterogeneity in the glycan literature. Also, in humans, this is a human protein. In humans, two different proteins, even two different proteins made by the same cell have two different kinds of sugars attached to them. So here's the spectrum of sugars attached to this protein, which is called HCG. This is a famous protein. It's the pregnancy hormone. That's how you do a pregnancy test. And here are the set of proteins attached to mucins, the set of sugars attached to mucins. And you can see that, although they're made of the same building blocks, if you stare at it for a second, you'll see that there's some differences in what's down here and what's over there. So between two different proteins, you have two different sugars. The third thing is the same protein between two different species. For example, the horse version and the human version have different sugars attached to it. So there's three layers of diversity, same molecule, two different molecules at the same species, two different species. I don't know why that keeps on happening. Okay. Okay. So now, this guy was caught because he was doping and he was using erythropoietin, which is that protein I showed you. The reason he escaped for so long is because he was using human erythropoietin. There was a company that was making it for him and nobody could prove he was doping because it looked just like a human protein. The way he was caught is because the company that was making it for him was making it in yeast. So the human protein had yeast sugars and they caught him because the sugars were wrong. So this is sort of fairly high-tech ways in which athletes are now trying to circumvent the system and are caught. This is Lance Armstrong, if you didn't recognize. Two difference champion. So this is just to say there's a lot of information in these systems, right? Okay. So now the cell biology part of it. These glycans are actually built in a very intricate way in a very famous organelle of eukaryotic cell. The organelle is called the Golgi apparatus. It was discovered originally identified by Camillo Golgi here in Italy. And the Golgi apparatus has this very remarkable sort of morphology. This is a side section. This is view from the top. It basically looks like a series of plates or a series of pancakes that are stacked on top of each other. And the whole process by which these sugar polymers are made, they are made inside the stack of pancakes. And the stack of pancakes forms a sort of factory production line, or at least that's the idea. So how does it work? So here's one of the plates of the stack. Here's the next plate of the stack. They're called cisternae. And the building blocks, which are the individual sugar monomers, are passed into the plates. And then something happens inside here to put these sugars together and attach them to a protein. And then the whole thing is packaged into a little vesicle and sent into the next plate, and so on and so forth. And once it goes through one, two, three, four plates, it comes out the other side as the final product. So this literally looks like a factory. There's some steps that happen here. There's some steps that happen here. There's some steps that happen in a third step or a fourth step. Here I've only shown two. And then you get the final product. So the question is how does the system work? What can we say about it? What can it do? What cannot, what it can't do? How much can we learn about this system? So here's a little bit of philosophy. Now imagine that you want to make something. You want to make a product, right? Biology has many, many different ways by which it makes specific products. One way, which is the way we all learn in school, is if I give you an input chemical and you have a so-called enzyme, the enzyme can carry out chemistry on that input and make a product. And enzymes are very, very specific. If you give it some other input, it's not going to do anything to it. You have to give it the correct input, in which case it'll make the correct product. Here's a slightly more complicated way that biology makes things. In this case, the enzyme is the ribosome. But the ribosome doesn't always make one product. In fact, the ribosome can take any number of basic amino acids, and it can take a different template sequence from the RNA. Depending on what template you feed in, it makes a different product from the same substrates. So these are two slightly different things. So it's a more sophisticated way of making something based on a template. Now you can think of the Golgi apparatus making glycans from building blocks in the same way, except that the Golgi doesn't have a template. It doesn't have a copy of the final product to make a copy of. So unlike, let's say you have a mold and you pour plaster into the mold to make a statue, which is like that. All the information is in the mold of the same physical size and shape as what you want to make. Here there is no mold. So instead of making a statue from a mold here, here it's more like making a cake from a recipe. You have a series of steps. You've given a bunch of inputs and you hope you make the final product. Some recipe books actually contain images of the final product. And if you carry through those recipes you can see how closely you approximate that final product. If you're good at cooking, you might be able to do it. So the question is, how are these sugars cooked by the Golgi apparatus to make the final product? And the pitch I want to make to you philosophically is this diagram is rather reminiscent of this one, which is the picture people have in mind. How does an egg, which contains the genome, convert a bunch of nutrients from the environment to an adult? And where is the recipe? The recipe is in the genome. Our cells do not contain a copy of the final product. You don't contain a little human inside a human and make a copy of that to make a final product. You have a recipe to make a final product. And for the purposes of this talk, such a recipe, I'm going to call it an algorithm. And here are the general constraints I'm going to say about algorithms. If an algorithm is non-trivial it must be able to accept many different inputs. For example, many different genomes. This kind of thing is not non-trivial. That enzyme can only accept one input. This thing can accept many inputs. Now, I'm going to call that also an example that's not so interesting because it has a template of the output. So if you have a template of the output it's sort of easy to imagine how the thing would work. Instead I want to encode a recipe to make it truly non-trivial. And the central point about the algorithm is the definition of the algorithm, computer science. It converts every input to a corresponding unique or specific output through a sequence of steps, which could be stochastic. That's what an algorithm is. I'll stop for any questions if anybody have any questions about this. Yes. Then these glycan softens are pulled out to the outside of the cell. In fact, they get packaged in vesicles. The vesicles fuse with the membrane and display. So it's very important that the glycans are made in the Golgi for a fixed residence time, or some residence time. And then they pulled out and they no longer grow. So I'll give you the rules about how they grow. Yes. Yeah. So it's the difference between making, let's say, a statue when you have a mold, or making a statue when you have a series of instructions about what to chip and not to chip. They are sort of fundamentally different things. When you have the mold, the mold contains the physical size and shape of the statue you want to make. It is a template. Otherwise, you can imagine making a statue by verbal instructions. And you can imagine that's much, much harder. So in a sense, I mean, these are subtle distinctions. But in a sense, the Golgi apparatus's way of functioning is to have a recipe. And it's not obvious just by looking at the recipe what product it's going to make. Whereas here, it's sort of totally obvious. A protein is literally just a copy of the RNA, modulo converting three nucleotides to one amino acid. Question? No, no, no, no. Every protein is different. And even every individual protein molecule is different. Different types of proteins are different, as I mentioned. Even in the same set. So every time, this is interesting. Every time a protein goes through the Golgi apparatus, there are other parts of the protein that cause it to see some subset of the enzymes that are in the Golgi. So for every protein that goes through, it sees a different subset of the enzymes. And I'll mention exactly how this happens in a second. No. So the bacteria don't have intracellular organelles. So the way bacteria do this is they build their sugars on the outer membrane outside. And in fact, bacteria make random heteropolymers with very little information content. Interesting question. So there's been a lot of work recently. I direct you to these papers on the idea of algorithmic assembly. The idea that at the molecular level, at finite temperatures with discrete stochasticity of chemical reactions, how do you reproducibly build certain microscopic structures? It's a very important problem to solve. Because if we're able to get control of nanoscale assembly, we'll be able to build complicated machines as complicated as biology. So here's a few papers that have come out over the past several years. Eric Winfrey has an entire series of papers where he discusses the fundamental limits of algorithmic assembly. Mike Brenner, Arvind Murugan, and so on. Paul Rothman, who is the person who invented this idea of DNA origami, which some of you might have heard of. So here's an example of Mona Lisa made by DNA origami. So in all these cases, the rough idea, how do you build a specific final product reproducibly is the following thing. What you need is to take a system in some input state. And this input state is a bunch of building blocks that are floating around in the fluid. And somehow you want to reach a target state where the same building blocks are now assembled in a desired structure. Now how do you go from the input which is disordered to the target which is ordered? There are a series of pathways by which these input building blocks can get together, could be stochastic, there could be many pathways. So what we need to make sure if you want the target in high yield is that all the paths that lead from the input lead to the target, that you don't have any of these paths that go away from the target. That's one important piece. Secondly, there should be no paths that take you further away from the target. And once you get to the target, if you can go even further, that's not a good recipe for high yield of the target. Thirdly, very importantly, even once you set up this topology, you must have enough time so that everything goes from the input to the target instead of getting cut away halfway. These are the ingredients you need to make the target in high yield. And this is, in some sense, a proof which Winfrey and Rothman have put forward, that these are the basic ingredients you need to make a target structure at high yield. So another question is, how do you ensure that all the growth pathways have exactly this feature? So I'm going to show you how this kind of very abstract stuff works in the case of actually the Golgi apparatus and glycosylation. So based on what we know, the real molecular and cell biology of glycosylation, all the stochasticity, all the chemical randomness, can the system be algorithmic in this way? Can we achieve a certain target at high yield from a certain set of ingredients? If so, then can we say what its limits are? Can I say what structures it can or cannot make in principle? And I'd like to say this rigorously rather than by simulation or exhaustive sampling. So here's a bit of cell biology. And like anything in biology, the details are important. So I'm going to compress a lot of stuff about glycans into a single slide. Even this slide is reasonably complicated. So just pay attention for the next 30 seconds or so. So the way glycans grow is some part of it is attached at the root to a protein. Or this could be the rest of the tree itself. Or it could be another part of the tree connected to the base. There'll be some monomer, which is called the acceptor, which eventually is going to be attached to some other monomer which is called the donor. And the acceptor itself on one of its carbons may already have a branch on it. Remember, every sugar has many carbons. Now what does the enzyme do? The enzyme actually carries out this reaction. It takes that floating donor. It attaches it to that acceptor at a specific carbon through a covalent bond. That's what the enzyme does. Now for the purposes of the stock, I'm going to use a cartoon representation of the enzyme. That's the cartoon representation of the enzyme. I draw the enzyme by its action. What does this enzyme do? It takes the yellow square attached to the blue square and connects it to the yellow circle. So you can actually read the property of the enzyme just from its graphical representation. There are any questions about that? So once I've told you, and by the way, look at the scale of these enzymes. This is the sugar polymer. That's the size of the enzyme. One of those little things is the size of the monomer. So the enzyme is in principle much bigger than the rest of the sugar that it's. The enzyme is a protein. The enzyme is a protein. So how do enzymes work? These enzymes are chemically, exquisitely specific in the following sense. Look at this enzyme. I've shown that what it does is, at least graphically, it attaches this yellow square, blue square combination to the yellow circle on the left carbon. Suppose I give it as an input just a yellow square. Nothing happens. Suppose I give it as an input, the yellow square with the blue square, but the blue square also has something on it. Nothing happens. This enzyme only works if you have the yellow square with the blue square and nothing else. That's the level of so-called specificity of the enzyme. Now one important thing is the enzymes in glycosylation for this large family of glycosyls, don't care what's down here below. They don't care what happens closer to the root of the tree. They only care what happens in branches that are further away from the root, from the monomer of interest. And this is a feature of these enzymes. And it's a feature that we're going to use very specifically in our analysis. So the enzyme, I mean the rough idea is that the enzyme can't read the whole tree and then decide what to do. They can just read the monomers, see if they have the right size and shape and have the right branches, and then they can act. So by the way, for glycosylation of the type we are considering, technically they're called old-linked glycans, the trees only grow. The trees only grow. They never get pruned. This is a fact. Secondly, there's no proofreading. It's not that the sugar is waiting, waiting, waiting, and it's taken out of the oven only when it's made the right structure. It just comes out at a certain time, whether it's made into the final structure or not. And thirdly, but I haven't written here explicitly, they only make trees. It's not that trees are then joined into cycles. And the reason for that is one monomer is added at a time. It's only three monomers that can be added to the tree. The enzyme cannot link two monomers that are already in the tree. So you only make trees. So mathematically, there are a lot of regular features of the system which make them amenable to analysis. It's a very interesting system. OK. Yes. No, no, no. So it's real time. It's real time. And all chemical reactions happen at some probability per unit time. So every enzymatic reaction happens at some rate alpha. The probability of happening in unit time dt is alpha dt. And the transfer of the protein from one plate to the next plate from one cistern to the next is also a stochastic process, which also happens. It's some rate beta per unit time. So it's a big Markov chain. That's what it is. Yeah, so the specificity of the enzymes is always of this type. Every enzyme will always have a substrate of a very specific acceptor monomer and a very specific side branch or branches. There are actually hundreds of types of enzymes in all our Golgi apparatus, specific to every one of these possible little substrates. But the important thing is the substrate is not the whole tree. It's just the root monomer and its branches. In particular, as we'll see over here, suppose I mean this is another complicated slide, but let me walk you through this. So in my figures, these dark boxes are meant to represent one Golgi cisterna. So it's the reaction compartment. If you want, you can think of it as a tube in which the reactions are going on in a lab. On the left, in these little sort of losanges, I just describe graphically which enzymes are contained in that compartment. So that's the enzyme. Here, look at this. That enzyme is one that adds a yellow circle to a blue square. That's what it does. So there are two different objects here. One is the enzyme, and the other is the actual reaction compartment. Now what happens if I feed into the reaction compartment a tree that happens to have two blue squares? Since the enzyme doesn't care about the rest of the tree, it cannot distinguish, in principle, which is the left square, which is the right square. And therefore, stochastically, depending on which part the enzyme bumps into first, it's going to extend either the left one or the right one with some probability per unit time. So you can think of this reaction diagram as representing the flow network of a Markov process. You transition from one state to the other state of this network with some probability per unit time. This is encoded in the chemical kinetic rates. These rates could be different. They could be slow. They could be fast, whatever it is. So here, you might extend left. You might extend right. Once you extend left, you might extend right again. But you could extend left and then leave. And once you leave, you can never grow again. So this reaction compartment, where things can enter and leave at arbitrary times, you get a lot of so-called incomplete structures. They don't reach their final possible state. Now how do you make sure that this reaction actually reaches this unique final state? Very simple. Just wait a long time, which I label infinity. Wait long enough, so that whatever the rates of your reaction, all the probability in the state flows to the last point, and then you pull it out. Once you do that, I indicate with this little notch, those are called terminal structures. The terminal structure is the last structure in these little reaction networks, a structure from which you cannot go further. So this is very simple. If you wait long enough, you're going to get a unique structure. Well, it's not so simple because suppose, I mean, this is a system where there's only one enzyme. In fact, every plate of the Golgi has hundreds of enzymes. Suppose there are two enzymes. One enzyme has a job, the same one. It adds a yellow circle to a blue square. The other enzyme adds a blue square to a yellow circle. This is really bad news because then the longer you wait, the more you polymerize. Because this is a stochastic process, the distribution of states over all possible polymer varieties just becomes broader and broader and broader. So in other words, the longer you wait, the more diverse the products that you get. You can't afford to wait long. So the only way to actually make a unique final product, a well-defined final product used in this system, is somehow to mess with the properties of the enzyme. So this enzyme can only act once. And then only one other enzyme can act. For example, by requiring that it needs a side branch or something, whatever it is, you have to break this loop. If you have a loop, you're going to get diversity. So you would like to wait long enough. But if you have these kinds of reactions, you cannot wait long enough. So that's one point. There's another important source of diversity in the system, which is competing fades. And this happens either at long times or short times. It doesn't matter. Here what happens is, you see these two enzymes? They both attack the same substrate. In fact, they both attack the same substrate at the same carbon. So if one enzyme acts, the other one can no longer act. So no matter what happens, this system has a branching reaction network with two terminal states. And no matter what happens, you're going to get both those states coming up. So the only way you can influence the system is somehow to encode the flow to make sure that the enzyme does this thing first. In other words, that enzyme acts first, and the other enzyme can never act or some such thing. And here's a final version, which is a similar one to this, competing fades. It's more subtle. In this case, these enzymes actually act in a way where if one enzyme acts, the other one is not attacking the same carbon. But because one enzyme acts and adds a branch, and the other one doesn't like the branch, it stops acting somewhere else. So it's a sort of regulated barrier to something else happening somewhere. So again, you get a reaction network which is branching. And again, the termini of the reaction network are not unique. There are many terminal structures. So this is the entire, in some sense, this is the graphical abstract of the entire talk. There are three sources of randomness in the building of glycans. One is the fact that you could exit at a short time. And because reactions are stochastic, they may not complete. The second is that if reactions are runaway, no matter when you exit, you're going to have diverse polymeric structures. The third is that the reaction network might be branching, and that's going to cause multiple terminal structures no matter what you do. So I'm going to show you step by step how to remove these separate sources of noise in a guaranteed way. And here's how you do it. It turns out that it's very easy to diagnose whether you're going to have a runaway reaction. And this is not as obvious as it seems. You might imagine that because the enzymes require side branches, that by using side branches in a clever way, you can cause the system to terminate at a particular size, even though there's a polymerizing reaction. That's actually not true, because every time you add a new polymer, it's a bare polymer with no branches. So it in some sense erases the memory of where it came from. So it's very easy to diagnose these loops. A loop means there are two enzymes that can infinitely extend the branch if you give it infinite time to wait. And the only way to remove those loops is to actually remove those enzymes. And then you'll never have these long polymers. The other important thing that happens, I spent a little bit of time on this, are these competing reactions. And let me just explain to you in great detail. Here are two enzymes. These two enzymes, I want to show that these two enzymes are competing in the sense that if one enzyme acts, the other one cannot act. Why should it be that if one enzyme acts, the other one cannot act? Imagine you have a big tree. Imagine you have a very large tree. And there's some monomer here, and there's some monomer there. Look, an enzyme can act over here. It doesn't influence that. An enzyme can act over there. It doesn't influence this. They could act in either order. The reaction network would reconvert. But why would the reaction network not reconvert? There's only two reasons. Here's reason number one. Literally, the action of this enzyme and that enzyme are on the same substrate. And somehow, once this enzyme is acted, the other one is prevented from acting because this enzyme adding the blue, the enzyme that would act the green doesn't like the blue branch. Not that it's on the same carbon. It just doesn't like the blue branch. So they inhibit each other. That's one reason. They just attack the same substrate. And that's called a type one conflict. Here's the other reason which took us a long time to identify. It's called a type two conflict. It's interesting. Imagine an enzyme, imagine an enzyme, which is, for example, this one here, which likes to add a blue branch when the right side has a green dot, right? So here's this enzyme acting. It adds the blue branch. And then another enzyme can act to add an orange branch to the green. No problem here. But suppose that orange enzyme acts first. In other words, the orange enzyme is acting somewhere deep in the tree. Because it acted first, it changes the nature of the branch. And once it changes the nature of the branch, all the way back here, the information is read by the enzyme and it can no longer act, right? So in a sense, enzyme E3 has a very short window of time before it misses the boat. If the other enzyme acts first, this can no longer act, right? If it acts first, then the other one can act, no problem. This is called a type two conflict. Yeah, so the way it works is that the enzymes will have some sort of, it could be a lock and key mechanism, it's some sort of thermodynamic affinity to exactly this branch shape, right? And the addition of that particular monomer will prevent the enzyme from even binding to the substrate, right? So it's the usual way in which enzymes detect their substrates. Every enzyme in the world works just like that. That's how they find their specific substrates. So the question you're asking is a good question and it's literally the same question that applies to literally every enzymatic reaction in the world, okay? In the same way. Now, having told you these are the sources of diversity, now I'll tell you, if you remove loops, if you remove loops, then something very interesting happens. A loop happens, so this little graph I've drawn over here is just a graph of which monomers are allowed to follow which monomers as you walk down a branch of the final structure. If these graphs have no loops, right? A graph without loops is a directed asychic graph also known as a partial order. If a graph has no loops, then no matter which branch of the tree you go down, all of them will respect the partial order. That's one important feature. If you remove these kinds of sources of diversity, then two things happen. One is that you always get depth first growth, right? If an enzyme will be inhibited when another enzyme acts and finishes growing a branch, right? You cannot have such an enzyme in the system because it may or may not act and that causes to casticity. The only kinds of enzymes you're allowed to have in the system are throws which can act when everything else has already grown from the same substrate. So the tree grows in a depth first manner when you remove type two conflicts. And when you remove type one conflicts, a very simple thing happens. Any substrate, a substrate is a monomer already attached to the tree, it has some branches, and there's some empty carbons. The carbons will be always filled in the same order for the same kind of monomer, okay? So let me step back and explain very, very broadly what's going on. The growth order when these kinds of conditions happen is something we call uniform depth first growth and I'll explain that in more detail in the next 15 minutes, okay? But the rough idea is the following. We've made a dictionary. We've made a dictionary between three different kinds of entities. One entity is the kind of diversity that you get in the collection of products made in the tube, okay? One is you get incomplete oligomers. The second is you get tandem repeats. The third is you get competing oligomers, competing fakes. So this is just the chemistry of the output. These three types of micro heterogeneity are related to these three types of properties of that reaction network. The reaction network either has early exit from intermediate nodes, or it has an infinite runaway reaction, or it has a divergent reaction which never reconverges, right? So these are equivalent statements. These are sort of provably equivalent statements. Now what is the reaction network you ought to do with the enzymes that are actually in the compartment? That's what we took a long time to understand, and here's the answer. The first one is easy, right? The way to prevent early exit is, I mean, if you have early exit, that means the entire compartment has a finite residence time, or a short residence time compared to other timescales in the system, right? That's very simple to understand. What about runaway reaction? If you have a runaway reaction, means you have collections of enzymes that could, in principle, act in a cycle. Divergent reaction means you have this type one or type two except a conflict, and these are essentially equivalence conditions. This is a one-way implication, but it turns out to be roughly the same as an equivalence, as far as our proofs go. So we have a dictionary that says, if you get this type of diversity, it's because of these properties of the enzymes, and this is our first contribution to understand this problem. These are rigorous. This applies to literally any class of these Markov chains based on any kind of enzyme specificity of the type I mentioned, for any set of reaction rates and specificities that you want to write down. In particular, it applies to literally every single simulation anybody's ever done about these systems, right? So we've taken everything everybody's ever done and put it into this page, rigorously. Okay, so let me bring it back, bring it home, right? So you have a reaction compartment. The reaction compartment contains a bunch of enzymes. I've drawn a sort of projection of what those enzymes do. Every time there's an arrow, it means an enzyme exists that adds a green to a red. Maybe an enzyme exists that adds a red to a green. An enzyme exists that adds an orange to a green and so on. These arrows are meant to represent which sugar the enzyme adds to which sugar. This, why is it a projection? This little diagram doesn't include the side specificities of the enzymes. It doesn't include how the enzyme behaves with all its branches. All that is underlying there, obviously. This is just a guide to the eye, right? Now this particular reaction compartment has a runaway reaction. And because of that, you see there's a runaway polymer. That's why I put subscript N. This thing can get as long as you want. Not only that, there are two enzymes that inhibit each other. So either this thing can be added on the left or the right, right? So both sources of diversity are there. Not to mention it could be incomplete. If you don't want the system to be incomplete, you have to remove the polymer reaction, so I've removed that. And you have to make the residence time infinite. And I've done that. Now once I've removed the idea of polymerization and I make the residence time infinite, two things happen. First of all, the reaction networks are not infinite. They always have finite terminal structures. That's the first important thing. And secondly, because the reaction time is infinite, you always reach the finite terminal structures. However, there won't be a unique terminal structure. So for many different kinds of input, you might get many, for each input, you might get many products. So this is not a function. Each input gets stochastically mapped to multiple products. It is a generalized function, but if you think of a function as a one-to-one thing in the state of structures, this doesn't map every input to a certain output. In particular, you know why this is happening? It's because these two reactions can happen or not happen. So if you add a branch to the left side, then you can never add one to the right. If you add one to the right side, then you can add one to the left. That's what this symbol means. So there are divergent reactions. So now I need to remove all enzyme conflicts. I remove those conflicts. Once I do that, this compartment, this little reaction chamber, becomes what we call an algorithmic compartment. It means literally every possible input I can give it. Every possible one will eventually reach a single possible output, stochastically, but a single possible one. And it will eventually guarantee reach that at long times. So the content of everything I've shown you so far is to say that we can give conditions for how to add and remove enzymes from these compartments that are necessary and sufficient to make compartments of this type. That's the important thing. The second important thing, if I could microscopically look in that compartment and see how these structures are growing, every one of these structures would grow according to this very important pattern, which I call uniform depth first growth. Uniform depth first growth means it's stochastic. There could be many ways in which the tree is growing over time. So the time order doesn't matter. This side of the tree can go fast, then that side, or they could alternate. All that doesn't matter. What's important is it is depth first in any piece. Secondly, on any given monomer, the branches always open in a fixed order. That's what makes it uniform. These two conditions are equivalent to the idea of having algorithmic compartments for this class of Markov processes. Why is this useful? It's useful because I can work out uniform depth first growth on the back of a napkin. So here's theorem one. So this is the whole idea of proving these theorems. You take some complicated set of conditions, which you show apply to a large class of models, and then you reduce it to a certain testable certificate. And the testable certificate can typically be done on the back of a napkin. The real work happens in proving that that certificate is necessary and sufficient for the huge result of interest. So here's a question that one can ask. If I'm interested in building this tree, that's a tree. The bottom is attached to the root. Each color is a different type of monomer. There are branches. The tree I want to build is, I don't know if you can quite see it, but the tree includes all these hollow nodes and all the dark nodes. That's the tree I want to build. The tree I give you is the tree that just includes the filled-in nodes and the thick branches. So I want to take the tree with the thick branches and the filled-in nodes. I want to drop it into the Golgi. And on the other side of the Golgi, I want that tree. And I want that tree as the unique final structure. Can it be done? Can it be done? Which enzymes does it take? What properties should the enzymes have? And so on and so forth. So here's the theorem. An input target oligomer pair like that, the dark bits of the inputs, the light bits of the targets, is algorithmically achievable. It means I can put the input into the compartment at sufficiently long times. The output will come out. If I put the right enzymes in there, if and only if, there is a uniform depth-first growth order from the input to the target. The last thing is easily verifiable. I just draw the input, for example, that bit of it. I draw the target it's supposed to be. And I just work out all the way through whether there is a single opening order for the branches and a single depth-first growth order, which allows you to go from here to there. I'll give you an example of this in just a second. Are there any questions about this? So this is a result. Here's how the result is used. Again, it's the same example. The dark bit is the tree I'm given. The light bit is the tree I want to make. This tree is actually a large collection of desired growth patterns. Every single monomer on this tree that grows has to grow to some sort of final state. So I break up the tree into monomers. I write down all the ways. For example, look at that one. It's two oranges attached on a red. I want that to become like this. In fact, I want also a red with a green to become the same way. So these are all my desired input-to-output maps that a single Golgi compartment has to satisfy. And they all have to be consistent with each other. In particular, trivially, if the same initial state has to go to two final states here, I wouldn't be able to do it, because it's not algorithmic. But anyway, in this particular case, all the inputs seem to go to a fixed target. So the trivial counter example is not there. Instead of drawing it graphically, I can actually draw it using this slightly more compressed notation. My orange monomer cannot be extended, so I don't even keep track of it. My red monomer can be attached to three different things. My green monomer can be attached to two different things. My blue monomer can be attached to two different things. This is the entire list of desired input-output maps that allow this tree to work. So for example, the red monomer that's attached to a B on the left, so in that case, I suppose it would be this one, eventually is left alone. That one doesn't have any new branches attached to it. So that's that condition. On the other hand, a red monomer with a D on the right, which I guess is that one, must be extended by a C in the middle, which is that one. What happens to the C? That's also written over here. A C with nothing on it becomes a C with a D attached to it. So this is sort of a hierarchical, self-consistent series of things. So this is a highly compressed. And why am I allowed to do it hierarchically? It's because it's depth-first growth. So I can self-consistently assume, once I open a branch, it gets finished. I don't have too many things to check. That's why I'm saying back of an napkin. If I write all these down on a sort of graph, how does empty monomer A, why this arrow keeps going down. I have empty monomer A, and eventually it goes into various final states. It can be filled with this in the middle branch, this in the side branch, that in some other branch. Let's take a simple case. Empty monomer C of that type must become monomer C with monomer D on the right, like that. So that's why I have an arrow there. However, monomer C with the D on the left side actually becomes a monomer C with the D on the left side and an A on the right side. That's why I have that arrow over there. So these conditions get mapped onto this graph. Similarly for B, these conditions get mapped onto this graph. But what about for A? For A, it turns out that all these conditions cannot all live at the same time on a graph of this size. They conflict with each other. So if you manage to achieve all these conditions, that condition cannot be achieved. The reason is because from every one of these internal states, only one output state is allowed. That's what uniform depth-first growth means. If I have a certain set of open branches, only one next open branch is allowed. I'm not allowed to bifurcate my reactions. So this system, you may not have been able to tell right at the beginning, this tree cannot be made in one compartment. I've given you an example of a tree that no Golgi apparatus with any amount of fancy enzymes can ever make in a single compartment. The fact is the Golgi has multiple compartments. So you might want to ask, can a tree be made in multiple compartments? In terms of this tree, if you sit and work all this out, can trivially be made in two compartments. And now I'm giving you the push here. This is the reason why the Golgi has multiple compartments. There are classes of structures that cannot be made in a single reaction compartment. And the class of structures that can be made in multiple compartments is much, much larger. And that's what eukaryotic cell makes use of. Whereas bacteria have a single reaction compartment. OK. So here's the theorem for multiple compartments. It's a simple extension of the previous theorem. I won't explain how I actually go through the proof of it. But basically, if I have an input in a target, here's the theorem. An input target pair is algorithmically achievable in a series of n compartments. If and only if, there is a growth order from the input to the target that can be fully decomposed into n stretches of the type I mentioned. If you can work out whether that's true, then you can make it in n compartments. In fact, there's a much easier way to test this theorem. If you don't care how many compartments you want to make something in, then there's a very easy way to check. An input target pair is algorithmically achievable in some number of compartments. If and only if, there is a series of single enzyme compartments that in the infinite time limit can convert the input to the target. And this is actually a trivial thing to check. Is there a series of single enzyme infinite time compartments in the middle between an input and a target that can make the structure I want? And if I find such a thing, then I can always make it in some number of compartments. How many compartments? Well, that depends how many this n is. So I'll give you some examples. Here's an example of a system where this is the target I want to make. That's the input. And a green edge actually means that there is a single enzyme compartment that converts this input to this target. For example, this system converts that input to that target at infinite time, obviously. So that's what these green edges mean. Each of those green edges is a compartment. So I can take this input and convert it to these outputs, these outputs, these outputs, and so on and so forth. I could write down the entire set of intermediate states that go from this input to this output, which is what I've shown here. And I can connect any pair that can be done by a single enzyme. Look at this long arrow. That system actually adds two monomers. It's not that a single enzyme adds a single monomer. If there are two places it can act, it will act in both places because it's an infinite time limit. And then all I have to do is to find out whether there's a connected path going from the input to the target. If so, it can be made. If not, like that guy, he looks very simple. This thing looks very simple. It cannot be made in any number of Golgi compartments. Uniquely, by any enzyme of the type that the Golgi is known to contain. That thing cannot be made. That thing can be made. So it's counterintuitive. The reason is because this guy has enough bells and whistles to regulate the growth. Whereas this guy does not. Why can't it be made? Because imagine going from these structures to this structure. Either this side will be extended or that side will be extended. So either you have to extend both the blues or you have to extend both the yellows. There's no way to just extend one of them. And so the enzyme doesn't have enough information. And since I haven't added enough side branches to put some control, you can't make that. Here, that red branch is crucially important. So you can actually regulate. By the way, how many compartments does it take to make this? Well, you can take these growth orders. You can find the minimal number of depth first stretches. This turns out to be a structure of complexity three. You need three compartments to make that structure. You can't make that structure in infinitely many compartments. So that's the idea. For each oligomer, you find the minimum number of distinct compartments needed for its synthesis. Why is this structure made and no other? It's because the reaction network is not infinitely branching. Imagine for a minute, this is the real Golgi apparatus of a real cell. In fact, in human cells, three Golgi compartments is the norm. They're called cis, medial, and trans Golgi. So it's maybe not a coincidence that this human ligand is a n equals 3 complexity structure. Imagine that these three Golgi compartments somehow collapsed into one compartment. If they collapse into one compartment, the same reaction is still possible, obviously, because the same enzymes are there. But many, many more reactions are possible. And all those other reactions cause all kinds of other structures to be made. So you can't collapse this any further because it'll cause other structures to be made, and you don't want that. That being said, most of the precision is because of the way the enzymes must act one after the other. So most of the control of the Golgi is coming from the enzymes and not from the compartments. So it's a subtle matter. This is also a minimal requirement. If my enzymes are sloppy, if my enzymes cannot read up to infinite depth, which I've assumed, you need even more compartments. So this is just a minimal number. So let me summarize. So the reason why you have compartments in the Golgi apparatus is to separate the compartments so you don't have these runaway reactions. And it's to order the reactions so you have a few terminal structures. And it's to build these controlling side branches so that the next compartment can use the side branch to differentially treat two monomers that otherwise might have been identical. And of course, the reaction kinetics can be used for independent tuning of the different compartments. So finally, here's real data. This is the last slide, and it's real data. The real data involves exactly the same three structures I used. There are three compositionally distinct regions of the Golgi. But there are other species, for example, some plants where there are many, many more. And there are some species where there's only one. So here we go. These are the data structures I showed you in one of the earliest slides. These are the glycans attached to, these are the musin glycans in your lungs. This happens to be from a data set we found for a patient that had cystic fibrosis. But I don't think that's relevant for this example. This is the glycans attached to HCG, the pregnancy hormone. These are the glycans that are attached to horse CG, which is the horse pregnancy hormone. And you can see that they're all related, but distinct. Now, these are not exactly the same as the test case that I've been using for my whole talk. My whole talk has been about how you make a single unique structure. The reason why I went down that path is by forcing myself to make a single unique structure. That's how I discovered what the sources of diversity are. If I hadn't thought about that single algorithmic goal, I wouldn't have understood rigorously what the sources of diversity are. But in real cells, you don't make a single structure. You make a few. To be honest, there are astronomically many that are possible. So even the fact that you make a few is quite a task. But you make a few. My theorem works to give you unique answers for a single final structure. Or it gives you the entire set of answers. In this case, if I want to make a collection of structures, I haven't yet figured out, it's an open research problem, how many ways I can arrange enzymes to make exactly the set of structures. I haven't figured it out yet. But here are some non-unique solutions to make this collection of structures. These are three Golgi compartments with the flow happening downwards. The protein goes here first, glycans are grown, then it goes here, then it goes here, then it comes out. Here are the enzymes in each of the compartments. Here are the enzymes I'm predicting, non-uniquely, will make this collection of CF-mucin glycans. Here are the enzymes that I'm predicting make a human chorionic gonadotropin. Here are the collection of enzymes I'm predicting make horse chorionic gonadotropin. These enzymes cause these reactions. So these enzymes correspond to these reaction networks. You'll notice that they all start with one root, which is the identical root. Because when you first ship something to the Golgi, you ship it as a monomer. And then you grow it. And then you pull it out the other end. And just to focus the mind here, I'm going to assume that it spends a lot of time in the earlier compartments. That's why I've labeled infinite. But I'm going to intervene. And I'm going to pull out the protein at some intermediate time, t, from the final compartment, just as a way to understand what growth is going on. So I'm going to let the reaction continue, but I'm going to pull it out prematurely and see what I get. And here's what I find. So this is, of course, the Markov chain. Every one of those edges has some sort of rate. Let's assume for the purposes of this simple case that all those rates are identical. If they're all identical, I can predict the probability distribution over all these possible states if I start off in an initial condition with just the root. So I get the whole probability distribution, which is the kind of thing we've been discussing in the class. So interesting, right? These CF musins are not the terminal structures of the system. In fact, they are incomplete structures. They include structures that would otherwise be extended. So the only way I can get these is to pull it out at a short time, is to pull it out before they all get completed. And the way I see this is, well, so I can actually find the time point where I get this distribution as close as possible in the Kulbach-Liebler sense. So there's a way I can actually fine tune the time to make this the final structure. Or if you want to think of it differently, by looking at this structure, I now have a way to figure out what the residence time in the Golgi was with respect to some standard reaction rate. So it's interesting. It's an inverse question. The horse, sorry, the human Corionic gonadotropin is different because the structures that you see are the terminal structures of the system. So in fact, you get them as long as you wait infinite time. And there's no need for any fine tuning. As long as you wait long enough, these are the structures you get. And finally, the horse case. And I'm going to discuss this for a little while. Incidentally, I can always find the entropy over the collection of possible states, the Shannon entropy, sum of p log 1 over p. That's what I've plotted on the y-axis. It's in bits, log base 2, as a function of the residence time. All these systems start off, well, at least these two systems start off with a non-zero entropy. That's because the earlier reactions were divergent. These reactions have branches. And therefore, you already enter with an entropy of 1.5. Then the reactions in here, because there are further branches, they can be more or more state. It's like a factory where the car is going incomplete into the factory. Some people are adding a door. Somebody else is adding a mirror to another car. The diversity of the types of cars increases. But eventually, of course, all cars come out complete. Somebody adds a door to the one with the mirror and so on. So because there's branching here, the entropy always increases at short times for these two conditions. The musin is actually pulled out at the highest entropy state. You pull it out when it's incomplete. The HCG, you actually leave it long enough so the entropy again decreases because you complete all the structures. So again, it relaxes back to the original entropy. I can't go below that. This guy is the most interesting in my mind. This is real data. What we infer is that the horse gonadotropin looks just like the human gonadotropin Golgi. But for some reason, the last two compartments have collapsed into a single compartment. And because that happened, two enzymes that were perfectly fine before now make a loop. And because they make a loop, there's a runaway reaction. And that's exactly what you see using this polymerization step. So these three data sets, the reason I pulled it out, is they correspond one to one to our three proven sources of diversity, incompleteness, polymerization, and divergent reactions. So this is real data telling us that we, in some sense, have isolated the three sources of randomness correctly in the real data. OK, so I'll just end with this. So I don't know how many of you, of course, in Europe, IKEA is very popular. So it's much easier to copy a template than to actually follow a recipe, as you all well know. And I think by studying Lycans, we've learned a lot about how a cell does it. And to my mind, it's very beautiful that the solution requires intracellular compartments. Intracellular compartments were only invented by cells about two billion years ago. Before that, cells did not have intracellular compartments and therefore did not have a social life of the type which eukaryotes are capable of, which include things like sexual reproduction and multicellularity. So my claim is that the origin of the Golgi apparatus is deeply connected to the emergence of true multicellularity and sexual reproduction in eukaryotes. But this is a claim. So I'll stop over there. Thank you. I hope that made sense. I hope you did.