 It's by a group called Emerson, Lake and Palmer. I think they're all dead by now. All right, so anyway, with that rather strange introduction, it's four o'clock in the afternoon, so I'm allowed to make strange introductions. So today, while the magician's assistant keeps running back and forth across the stage in front of the fake magician, I would like to finish up what I was presenting yesterday. The part of yesterday that I did show was sort of standard stochastic thermodynamics 101, so to speak. This is some newer material. There's always this weird thing when you give lectures, how much do you present your own stuff rather than trying to go after what you think is truly important, given that you think your own stuff is truly important? And so this is actually my own stuff, but I do think it's truly important, so more strange preamble. But if you remember yesterday, I pointed out, I can't speak with this thing, excuse me, I pointed out yesterday that almost all stochastic thermodynamics today has considered this scenario where you are given a fixed initial distribution and then we want to examine what happens as we change the actual physical process that works on that initial distribution over states. We can add multiple reservoirs, we can do things faster or slower, we can have periodic forcing, all kinds of complicated things you can do, where the total about varying the process were fixed distribution. However, as I tried to emphasize in very, very many situations in the real world, basically everything I would say to do with computation, either if you're building a computer or if you are giving birth to one in the sense of just stating a new biological computer, both of those systems, the process is fixed, the actual individual gates in my computer, they're fixed ahead of time. Same thing with some degree or other, depending on how much you might consider things like phenotypic plasticity. A biological organism, the process that it runs by which it actually transduces free energy in the environment into things like its own computation, that process is also fixed. What's varying greatly there instead is the initial distribution. So it's actually the exact complement of what Stochastic Thermodynamics has worked on almost exclusively. It's the complement that's of most interest. So what can we say then about the dependence of, in particular, entropy production on the initial distribution? I fix a process for the remainder of this lecture. Process is fixed, but we're varying the initial distribution and how does that affect things? So to keep life simple, let's see, is the laser doing better? Oh. Cut off, I don't know why, see how. Yep, it should be this one, acrobat. Actually concerning musical pieces, there's something called before and after science, and I was seriously thinking about playing that at the beginning of the very first lecture, the before science part of it, and then there's another, my next part on that CD, which is the after, and that could be at the end of the course. All kinds of deep symbolism, but in any case. So we've got a, we're varying the initial distribution. We've got a fixed process for simplicity. We're gonna be assuming a single reservoir, and I'm just going to choose units so that KT equals one throughout everything. Recall, I've now made you sick and tired, hopefully, of this formula. This is how the entropy flow, how it is a function of the rate matrices, which can be changing in time, and also the distribution at that moment in time. This is a very weak laser. The thing I want you to notice, this is a linear function of P. So in other words, another way of saying that is it's a linear function of the initial state that you begin with. So for any given initial state, I can calculate what the total entropy flow, the total heat due to starting in that state would be, and the actual total entropy flow under expectation is just the average over that. That's going to be very, very important. So, let's see. And so we can define that way, because of that fact, we can find the heat that would be produced under expectation if we were to start in one particular state, X of T zero. So if we got a distribution over states, it's just going to be the average of this over all X of T zero. And then when we average over all of X, T zero, what we're going to get is our formula for the entropy production is going to be just from before. It's the drop in the entropy of the total system minus the average of this. The key thing is that this is a linear function of P zero. That is not linear, but this one is. Okay, here's going to be the weird thing. In certain respects, when you take, when you look at how it behaves under differentiation, entropy, no, it's not strictly speaking a linear function, but it acts just like a linear function. So actually the directional derivative of this is going to be essentially linear, and you'll see what I mean in a second. Okay, so define this distribution Q zero. That is the initial distribution that results in the minimal EP. So you can, it's not too hard to show that any physical process there's going to be a unique initial distribution unless it's doing a permutation over states, unless it's actually implementing a dynamics which is exactly a deterministic and invertible, any other kind of a process, there's going to be a unique initial distribution such that if you start with that initial distribution, you end up with the minimal total entropy production. Okay, another way to view it is that if you design a bid eraser, you design an AND gate, you design a human being, oops, a human being, all of them you can design, if you happen to know the initial distribution of that particular system that you want to perform a computation, you can in theory design the process to be optimal for that distribution or to flip it around. If you're given the process, the AND gate, the human being, what have you ahead of time, there is going to be one initial distribution over its states, such that it's going to generate the least heat in the form of entry production, okay? So that's this creature. In general, it's also going to have full support. It's not going to be, does some particular edge case. Okay, so given that, notice we're now doing this very, very simple elementary calculus for any other initial distribution, the directional derivative from Q zero, the optimal one in the direction of R zero, if we're evaluating this at the actual, because this is the minimizer, because it turns out that yes, everything is differentiable and so on and so forth, this is going to be zero, no matter what R zero is. That's what it means for Q zero to be minimizing this, is the gradient is going to be zero there. So this is, dot product is always zero, okay? Let's see, is this showing up when people know it isn't? Okay, and the gradient that I'm taking here, this is the gradient over the space probability distributions, question. The whole thing will be zero as well, so just the way you can think of it, yes, yes and yes. Here is if we sort of had it was only in R, if there were only two components, well, see the thing of it is, we actually are restricted to be on the simplex, because probability distributions have to lie on the simplex. So in general, this gradient term right here, that's not going to actually equal zero in all components. But this dot product has to, because it's the minimizer. That dot probably, see this is on the simplex and that's on the simplex. Because we're constrained to be on the simplex, this gradient is not going to be zero in general. This is a gradient over the full space. For example, let's say that you had three possible values of your system. So in this case, the simplex is going to be in R three, it's just going to be a plane. And we are requiring that Q zero has got to be on that plane. But this gradient is over that entire space R three. So that gradient is going to in general not be zero at the point on the plane where it minimizes things. But this directional derivative has to be. You can also do this with like Lagrange multipliers kinds of stuff. But this argument will go through. So this is actually always true for minimal points on any particular kind of a hyperplane or a hyper surface through some higher dimensional space. An embedded manifold. Quick, is that, that's a little bit, I'm not going into the messy details. Yeah, it's due to the fact that you have to be on the unit simplex. That means that in general you're not going to have that gradient be zero. Okay, good question, very good question. In fact, I had to remind myself what the answer to that was several days ago. So there you go. Any other questions? Notice by the way, the implicit arrogance that I say when you ask a question that I asked that means it's a good question. Pretty good, one of the advantages of being a professor. Okay, yes, so here's the, as I was just saying because we're on a simplex, the gradient is not identically zero. Okay, now we recall that, now we're just going to be doing a little bit of algebra. Recall that we just said that the total entropy production is, which we're at the minimum of, that is given by this formula, that's linear in P. So it's very, very easy. What is the gradient of this term with respect to the components of P? It's just going to be a Q of X zero. So the X zero is indexing the actual components of P and the gradient of the expected Q under P is then just going to be that Q of X zero. Simple, here we go. Now let's look at the entropy term. This is where the entropy term is going to act sort of like it was linear. If you just go, how do I move this thing? Sorry, there's junk on my screen that you get to not see and it's kind of hiding things from me. There we go, that's better. Okay, so then when we evaluate the entropy, the gradient of the entropy, P of X, remember this is the change in entropy from P of X zero to P of X one. We're going from time zero to time one. And so when you take that gradient with respect to P at X zero, you're going to get a conditional distribution and basically the first step of your algebra is going to be this. Those terms, the second step of your algebra, you end up with that. Okay, everybody pretty much with me? Let's see, I can do it this way, I guess. Is this working? Come on. Okay, this is not working, so I gotta do it that way. That's not working either. Okay, so anyway, let's now just plug in. Sorry, this stuff is not behaving very nicely. So if we plug in, we know that R zero minus Q zero times this gradient has to equal zero. Here is the entire terms of the gradient. This is a single component of the gradient. So we just plug it in. Q zero times the gradient, is it going to equal this? It's actually itself is going to be the EP at Q zero. And R zero times the gradient, when you go through the math just a little bit, you remember the definition of relative entropy of K L divergence from Gilder's presentation a couple of days ago. This is actually just going to be the difference of the K L divergences between R and Q at time one and time zero, where both in this, to evaluate R one and Q one, you hit both Q zero and R zero with the exact same conditional distribution P of X one given X zero. So P of X one given X zero times Q zero, that gives you Q one. P of X one given X zero times R zero, that gives you R one. Okay, so that's how these terms are defined right here. Note that this is completely general. It doesn't depend on the details of the process at all. All details of the process, they're only coming in, this value, sorry, that's a typo, that should be delta sigma of Q zero. All details of the process are coming in for delta sigma of Q zero. In other words, they are saying, what is the least amount of entropy production you could do for any initial distribution? That's where all the messy crap comes in. But aside from that, which is just an additive factor, the actual amount of entropy production for R zero is going to be just this drop in the KL divergences, so the difference between sigma of R zero and sigma of Q zero is just a drop in the KL divergences. That's the amount of extra EP due to the fact that you are not actually using the distribution that is optimal for your process. When I take this computer, right now I'm using it, let's just, and I've got some distribution over inputs to this computer by which I initialize it state, run a computation, initialize it state, run a computation, that's happening many, many times. We can assume that I'm just like an IID process. If this is not the case, but we could imagine, Apple of course wants nothing more than to build their computer specifically for David, so we can imagine that they actually built their computer such that David's distribution resulted in minimal EP for this computer. Then when Professor Marsili uses it, his initial distribution over inputs to this computer is different, and for that fact alone, there's going to be extra entry production when he uses it compared to when I use it. In particular, even if it's thermodynamically reversible for me, the best you can do based upon the second law, the exact same physical device that's thermodynamically reversible, that same physical process that is thermodynamically reversible for one initial distribution will be irreversible, will be generating a non-negative entry production for any other distribution, okay? That has massive consequences for things like designing computer circuits, consequences which nobody has yet figured out, has even started to calculate. Again, another plug if people are looking for PhD projects. Okay, so these are some references for it. Originally, this was some work that I did with a postdoc Artemis Koczynski. There's been recent work by Paul Riekers and Millet-Gew on it. Here, more recently than when we did the original paper, we extended this to quantum thermodynamics, we extended this to differential time processes, we extended it to uncountably infinite spaces and a whole bunch of other mathematical junk. So this is, we refer to this as the mismatch cost. As Maguja noticed yesterday, this has got a very, very interesting connection with information theory, because in information theory, if we have a channel and a coding system for that channel that is optimal for some distribution Q zero, but in fact, you generate signals going down that channel according to P zero, the expected code length of your words coming out the other side of the channel is now going to be the relative entropy plus the entropy that it would have been if you were actually spot on Q zero. So in other words, this term here, does any one of these, that first relative entropy is in essence, the extra expected length of code words if you are having a coding scheme, a code book that is optimal for a distribution Q zero, but you're actually running it with a distribution P zero. Remember, information theory is all about information transmission. We're not interested in that. We're interested in information transformation as computation. And so, well, it's very clean, very cute and do not yet know actually if there's any deeper significance to this, but then what happens, the exact same kind of a mismatch scenario, you instead of looking at just the D of P given the relative entropy between P and Q, now it's the drop in the relative entropy between P and Q. Okay? Did you say anything about the underlying assumptions on dynamics, for example? Okay, we are gonna talk about finite path formalism tomorrow. And we were discussing just- This holds no matter what. Exactly. That's the point. Yeah, I want you to say it. Yes, exactly. So we are gonna see finite path formalism tomorrow where we are, I mean, until this point in the course, we didn't account for the degrees of freedom of the bath, the heat baths, right? Because we assume that they're just straight forward, like infinite. But while we are driving this, we're not actually, for example, and in the finite path formalism, when we start to actually take into account these degrees of freedom of the bath, we are going to assume that this overall dynamics of this joint system and the bath, they're actually governed by some Hamiltonian. So you have deterministic dynamics, not stochastic dynamics. But even in this formalism, if you take this kind of a derivation, you can still use this kind of a result. So yeah, I just want you to say it. Yeah, okay, that's a good point. So what Gouda's actually referring to several things. There has been a smaller portion of current stochastic thermodynamics. So let's see, step back a second. I mentioned earlier that if you're going to have a Markov process over your system of interest, you need to have an infinite bath that is also got a time scale separation so that you're losing all information about the fast. That the bath is actually, its dynamics is infinitely faster than that of your system of interest so that it does not actually contain, so it's always a Boltzmann distribution. Your bath or bath's plural. And so it contains no information about the previous states of your system of interest. That's what allows you to treat the dynamics of your system of interest as though it were first order Markov process. So that is a continuous time Markov chains. And as I emphasize a couple of days ago, that's everything from ecosystems to gene regulatory networks too, for that matter, financial markets. And for that matter, this is particularly relevant today to geopolitical systems. You can model them as continuous time Markov chains. But there's another body of approaches, another body of work in stochastic thermodynamics where they say, wait a second, what if that's not true? And there are actually many reasons to actually know that that's never strictly speaking true ever in the real world. But in any case, this is an approach where you instead of saying, here's my system of interest and it is coupled to one or more infinite baths. So this is what we've been talking about so far. Infinite bath. And so we actually don't directly see Liouville's equation or anything like that in this particular formalism. I'm just saying there's a CTMC over this. Another approach, no Hamiltonian dynamics, is to say, here's a system of interest and we have one or more baths, but they're all the same kind of size. Actually, let me make this a little bit, make this somehow a little bit more graphic. I'll do better. So here, blue, infinite. It is different. White is all Hamiltonian dynamics. Blue is not. This is finite. And that's finite. So in this alternative approach, what you do is you actually look at the dynamics of this. You assume that this SOI, system of interest and these baths are isolated from the outside world. That's your assumption. This is then Hamiltonian dynamics. Liouville's equation, if you're doing this in terms of phase space and so on, or Schrodinger's equation, if you do the quantum formulation of this, you can do it either way. And so in this particular setup here, a lot of the results that we've been applying, that we've been driving so far, they have analogs in this domain. In this domain, though, the evolution of the SOI in general is not Markovian. The reason being that because this is finite, if you're in one particular state here, it's on particular time, you're going to be having an effect on the bath. The bath is then going to be having an effect back on how you go from XT to XT plus one. This is going to mean that the actual dynamics you're undergoing is going to ultimately, the way it goes is like this, the end result of this process is going to be that X of T plus one depends upon X of T minus one as well as X of T because of what's going on through the finite bath. The finite bath is carrying information about your infinite past. Therefore, the probability distribution of X of T plus one given X of T is not the same as the distribution of X of T plus one given X of T and XT minus one. XT minus one has extra information that's not XT, so it's not a first order Markov process. When you go to finite baths, okay? So that makes some things a lot harder, but here's the trick. Here's the slide that I was presenting before. I should have actually pointed this out. I was saying at this time that the actual dynamics of the system, the details don't matter, but here's the reason they don't matter. Notice that we're just looking at X zero and X one and a conditional distribution of X one given X zero where X zero is the very beginning of the process and X one is the end of the process. What goes on between X zero and X one? It's not just as independent of things like oh, the number of reservoirs you have or anything like that. There's no assumption of Markovian dynamics even anywhere. There's just some arbitrary God given conditional distribution X one given X zero because the entropy flow function is linear, because entropy itself, change in entropy, once you take its derivative is indistinguishable from a linear function, the end result is that we have this form for the mismatch cost even if the intervening dynamics is not Markovian. And so tomorrow Gilger will be presenting some work of this, but then also in terms of using this formula, applying it to these kinds of situations, I'll be presenting stuff on that next week when we're talking about basically using this to model deterministic finite automata to analyze them. Okay, so that was kind of backwards fours. Yes, question. For example, if I have like a hot cup of tea and in the sense that if you have a hot cup of tea, for example, just a hot tub cup cup of tea cup of tea. Yeah. At some point it will thermalize with environment and it will be cold. And there's no way of knowing the initial state of that cup hot cup of tea. So in that sense, we're working in an infinite bath in the sense that it's forgets that hot cup of tea itself is like an infinite bath. But the question is, if I'm now attaching something, so I've got over here a hot cup of tea and over there a cold glass of ice, let's say. So I've got a temperature difference, I can use that to run a heat engine. And at that point, what becomes important is actually, is the working of the internal parts of the heat engine slow compared to the thermal agitation time, the Maxwell's relations. Maxwell's, I guess it was Maxwell, who came up with the distribution over speeds of molecules of functional temperature for the hot tea and the cold ice. The actual system in between has to be much, much slower the way that it's on typical time scales compared to that of these two. Okay, but regarding that in terms of memory of the system. So when that's the case, so it's much more slow, that means that F, in essence, it thermalizes, that cup of tea thermalizes very, very fast. So the time you get to an X of T with that, with any probability is different from this. So we've waited long enough so that this is not going to be almost surely identical. That it's actually had time to change. Once we wait that much time, this is right back to being a hot cup of tea at its Boltzmann distribution again. And so this has no more information about what your previous state of the system is. And notice there's actually a very, very deep philosophical issue here about what probability distributions mean. Because really as a practical engineering issue, that information is still here, but in an extremely fine-grained, complicated distribution. What happens is, loosely speaking, there was an initial distribution of F that was like this. It was perturbed by interacting with XT minus one. And what happened is this actually became something that looked, you know, this is very, very sort of high level. But I think you can get the idea that it's actually, let's say this started like that, and now it's become like this. What we do in physics is essentially approximate, this is really just kind of the same thing, because we are not able to actually discern, or more importantly, we as engineers cannot exploit any of this really fine-grained filigree detail in this particular actual distribution. So for as far as our experimental apparatus is concerned, this is identical to that. It's not that they're approximately the same, it's that our fingers are too crude and rough to be able to get them into any of these holes. So essentially, we are up there as far as all of our devices are concerned. So there are these kinds of slippery, subtle things going on. Thank you. Okay. Sorry, this has been a bit of a bounce forward, bounce backward kind of a presentation, but in any case, as I said, that's called mismatch cost. And now let's look at some of its interesting implications. Let's assume for simplicity that the system actually is thermodynamically reversible. In other words, generate zero EP for the initial distribution Q zero. Then what we get when we write it down is the entropy production, the entropy flow, the heat flow out to the heat bath, it's the change in entropy plus the entropy production. In this particular case, if we're at Q zero, that other term is going to be zero, the term that's hanging on on the end. So in this particular case, the actual entropy flow is going to be the difference of the cross-entropies. It's going to be the difference of the relative entropies plus the difference in the entropy of the system. And now let's see what the implications of this will be, for example, for land hours bound. Everybody likes to say KT log two. Well, let's even assume that you've got a process, yesterday there was some discussion about semi-static processes versus more realistic ones in finite time. Let's assume you can do things semi-statically slowly. And let's assume that for a uniform initial distribution over the state of your bit, you actually generate zero EP. So you achieve land hours bound, the total heat flow, the total EF for that initial distribution is in fact KT log two. Well, now let's assume that actually somebody comes along and they start using your bit erasing system with a different distribution that's not uniform. Okay, you would actually expect that's going to almost always be the case. Well, in that case, for example, let's say this other distribution is epsilon one minus epsilon, then as epsilon goes to zero, we get a minimal heat flow. Sorry, this is now a case where we're optimized for epsilon one minus epsilon rather than one half. And so in this case though, if we instead run it with a delta function about one for epsilon going to zero, then we get the mismatch cost term is infinity. So you take a land hour bit eraser that is optimal for one initial distribution and run it with a different one. Actually, it's going to generate an arbitrary amount of heat despite the fact that it's thermodynamically reversible for that one particular state. And here's something where life gets even more funky. Let me first state what the result is and then I'll walk you through why it occurs. And this is part of why thermodynamics of circuits is actually an extremely complicated thing that people know very, very little about. Here's the result. I give you two bit erasers. Let's say this first bit eraser, let's say that's thermodynamically reversible for a distribution, I don't know, two thirds, one third. Okay, so zero EP if that's the initial distribution. Maybe I've designed it to be that way. Maybe I was just lucky, whatever. And over here, let's say there's one that's also thermodynamically reversible for one quarter, three quarter. And we just went through the little calculation that well, if you were to use them with a different initial distribution, now they would generate EP, okay? So we have that, we're comfortable with that. Now let me run these two thermodynamically reversible systems in parallel, like in a digital circuit or for that matter, like in a human brain or anything like this. Let's say that the joint distribution, so we'll call this right here, that's x one, this is x two. We're running these two bit erasers in parallel where we're feeding the parallel bit eraser system inputs generated by some distribution of P of x one comma x two. And let's set it up, choose this distribution so that in fact P of x one is two thirds, one third and P of x two, the marginals is one quarter, three quarters. Okay, can anybody tell me, is this system going to generate EP or not? In general, if there's any statistical coupling between x one and x two, you're gonna generate EP. And in fact, the amount of EP you're gonna generate is given by the mutual information between x one and x two. And it's a, it actually comes from mismatch cost. It's an example of mismatch cost. I'll work that through in a second, but let me first give the intuition. For a system to be thermodynamically reversible means that if you take the initial distribution, run the process forward to get the final distribution, you can then run the process backward to return to the initial distribution. Okay, here, I start with an initial distribution where they're coupled. I run the two bit erasers in parallel. I end up with a pair of delta functions. Those delta functions, they have no way of knowing that when you go backwards, you're going, so they will end up with these delta functions no matter what the initial distribution was. That's a bit eraser. It's gonna erase that bit no matter what your initial distribution was. So then when you end up with delta functions, you've lost all information about the original statistical coupling. So if you were to try to run these delta functions backwards, depending on the precise details, what you would probably get, if you run the process forward and then run it backward, you would get the product distribution. You would lose all those statistical correlations. They are being lost when you do these kinds of things in parallel. So two thermodynamically reversible systems, you run them in parallel, they're no longer thermodynamically reversible. This is due to mismatch cost. This means that there's non-zero EP in systems just due to the architecture of how they're arranged in terms of parallel gates rather than doing everything serially all at once, okay? And here is, we can see very easily. By the way, when I asked that question, people really should have just looked over at the slides because I gave the answer there. But anyway, I think that I give it on the next reason. So since the gates are distinct, the thermodynamically reversible joint distribution, I'm assuming that each of those gates there run separately from one another. So the thermodynamically joint distribution is gonna be the product distribution. So we're gonna have the mismatch cost. If we actually run it on a P1, which is not a product distribution, we're gonna have a mismatch cost in the normal way that's going to be non-zero. And if you just work it through, not too much surprise, that's actually gonna be the mutual information, the initial mutual information. Can anybody make a guess as to what happens if I run three bit erasers in parallel? What is the mismatch cost, the unavoidable EP now as a function of the initial distribution? And this is actually, it's a part of information theory that Goudje did not actually present. So this is in a certain sense a trick question. If anybody has come across this concept. So there is a reply from the chat, total correlation. Yes, indeed. The multi-information is also sometimes called. There are many, many ways to try to generalize the concept of mutual information, which is defined in terms of two random variables to more than two random variables. There are many ways to do this. People are still having big fights about it. Still lots of blood in conference rooms after people get together and argue about what's the quote best way to do it. One simple way to do it is as follows. The mutual information between two random variables X and Y is one way to define it is the entropy of the marginals minus the entropy of the joint. When the variables are completely, oops, that's a Y. When the variables are completely independent, this is equal to the sum of S of X and S of Y, so you get zero more generally. It's not hard to prove this is always non-negative, okay? So based upon this, what is a natural idea for how to generalize it to more than two random variables that you think might be going on here? This is what the person on the line said. If I have a set of n random variables, it's going to be the sum of the marginals minus the joints. Now what happens if you start to be doing things that are more complicated than bidder racing? And what happens when you start to actually have circuits that are layered where many things are going on at once and where, depending on the actual values, different things might be happening? And so how should we design a circuit to minimize the EP given all these effects? That should be a topic in many, many computer science departments. Right now, there's nobody on planet Earth who's investigating that. With what exception? Actually with a few exceptions. And there's also some workshops going on. But let me put it this way. There is nothing known about that solution yet. And there's also all kinds of in computer science, there's what's called circuit complexity theory. You may have heard of complexity theory in computer science, which is like NP equals P and things like that. There's a version of it for just circuits in particular. And that is very, very natural to try to extend circuit complexity theory to include not just the normal costs that you consider in computer science but to also consider things like EP. And nobody knows really anything about it. It's very, very difficult problems, very difficult. Okay, so I already gave that intuition. And that's it for my slides. And next is going to be, I guess she's going off to the restroom is Golja. I'm presenting thermodynamic uncertainty relations for the rest of today. Are there any more questions then at this point? Anything from the chat or anything like that? Nothing from the chat. Okay, so anyway, when she gets back, she'll take it over for the rest of this afternoon. Okay. Okay, we take a short break. We can take a break now, yeah.