 to move on to John Bayes, who will be discussing algorithmic thermodynamics. Thank you again. John, are you with us? Yes, I'm here. And let me try to share my screen I guess. Let's see. Looks like this is working. Good. Does that look good? Look visible. I think it's fine. Okay. It's not in the, it's what it's, it's currently not in the, it's kind of not in full screen mode yet. Is that your? There's black borders. All right now. Now's good. Now's good. Okay. Yeah. Yeah, that's as full as the screen can get now. I think, yeah. Okay. Thanks very much. Well, I'm perhaps also talking about something a bit different than anything that came before. The title is algorithmic thermodynamics and it's an attempt to blend ideas from algorithmic information theory with more traditional ideas of thermodynamics. I think what I'll do is give mainly an expository talk so that everybody can learn the beautiful results. That already exists that my collaborator Mike stay and I were, were building on, and then just near the end get into a bit of our, our new stuff, the new stuff really deserves. Extra thought from, from other people with extra new ideas. And so hopefully some of you will have some new ideas. Okay, let's dive in here. So it's statistical mechanics and information theory as I'm sure all of you know, we often define entropy for a probability distribution, although there are quantum generalizations and so on. But this is the simplest notion of, of entropy. But Kolmogorov complexity is a, is a notion of entropy that lets you define a concept of entropy for a natural bit string of finite length and I'm just using strings of bits zero and one is a sort of convenient representation of, of data but there's nothing particular about that you could use natural numbers or, or anything else like that. So that's shocking at first. But and I'll explain that. But then it turns out and this is sort of the new somewhat new point that comes out to some error that has a bound to it. Kolmogorov complexity can actually be seen as a special case of the entropy of a probability distribution. So although people traditionally think of Kolmogorov complexity is only analogous to the more familiar concepts of entropy, it's actually could be seen as a special case. So here's how the idea goes. So there's a concept of a Turing machine and luckily you don't really need to know anything about Turing machines except that some idealized gadget for doing calculations in a completely systematic way as you probably know Turing machines were invented before digital computers. The only, so you can imagine it as being a digital computer. The only difference is that it has an infinite memory or potentially infinite memory. So that, that is you never, you never overloaded. So you could write a Turing machine that could like square numbers and in this mathematical idealization, there's, you could square arbitrarily large numbers and that would never crash. In what's called M will compute in general compute a partially defined function from natural numbers to natural numbers who put in an input, which is a natural number. And then it may eventually halts and print out a natural number, but it can get stuck in an infinite loop and never halt. And so then you're, then it's the answer is not defined. So we just get partially defined functions in general. And there are results from logic that say that you can never dodge the, the problem that some of your programs don't halt. I mean, you can never dodge it without significantly weakening the power of your Turing machine. Okay, so then we say that a partially defined function from natural numbers to natural numbers is partial recursive. If it's computed by some Turing machine. And then sometimes you'll have a function from the natural numbers to itself that's defined everywhere. And it's computed by a Turing machine and then it's called a recursive function. So church and Turing came up with this thesis. This is not a theorem. This is a claim that any function from the natural numbers to itself that you can compute by any systematic procedure whatsoever is actually computable by a Turing machine. And so it's recursive. This is not the sort of thing you can really prove because any kind of systematic procedure is not a mathematical notion. But what people have done is they've made up various other concepts of computation. It seemed like systematic procedures and they've shown that anything that those other procedures can compute can be computed by a Turing machine. So by now people believe the church Turing thesis. People have never been able to think of anything they can compute that you can't compute with a Turing machine. However, it's important to know that there are lots and lots of non recursive functions from the natural numbers to the natural numbers. A lot of them are known. First of all, there just have to be a lot of them because there are only countably many Turing machines and there are uncountably many functions from the natural numbers to the natural numbers. So in one sense, just a piddling fraction of functions are recursive. But also you can come up with specific examples of non recursive functions. So I've been talking about using lots of different Turing machines to compute different functions. But to make more progress in this theory, people try to use a single universal machine where the input you think of is a program together maybe with a number that you're going to feed into that program. So it turns out that the good way to go about this is to think of, it's more convenient to think of Turing machines as accepting bit strings rather than natural numbers as inputs. And I'll define this prefix free concept in those terms. So I'm just going to use the term string to mean a bit string. And so what is that? That's a finite possibly even empty list of zeros and ones. So the finite list of zeros and ones. So the idea is we're going to think of sticking a bit string into a Turing machine and then having it spit out a natural number as its output. Now, if you've got two strings, X and Y, you can make a string called XY, which is just the concatenation of them. You just stick the string X in front of the string Y. And so we say a prefix of some string Z is a string X that comes at the start of the string Z so that Z is X followed by some other string Y. And a prefix free set of strings is a set of strings in which no element is a prefix of any other. So the idea is we're going to define a, sorry, that's probably not the first thing I should tell you. So the idea is that we want to type instructions into our Turing machine in the form of giving a string, but it gets confusing if one string that's a valid set of instructions is contained in some longer string that's also a valid set of instructions. So in normal programming languages, we solve that problem by having this thing saying end that tells you when the end of a program is. And that prevents this awkwardness that would happen if some program was the initial, was a prefix of some other valid program. So that'll be handy to do here. And a nice mathematical fact that's going to turn out to be important when we get to entropy is that if you've got a prefix free set of strings, then if you sum over all those strings, two to the minus the length of the string, then that sum is finite. That's a fairly easy thing to check. Whereas if you summed over all strings that'd be right, that'd be two to the n strings of length n and that sum would diverge. So the domain of a prefix of a Turing machine is the set of strings for which the machine hauls. So then it prints out the answer M of X. So the domain of the Turing machine is just the domain of the partial function that the Turing machine describes. Okay, now a prefix free Turing machine then is one whose domain is a prefix free set. So that's just a way of saying that there's certain strings count as valid programs that will hauls and print out an answer and those form a prefix free set. Then a prefix free machine U is universal if basically it can do anything any Turing machine can do. So in other words, if there's any other prefix free Turing machine, then you can say M, then you can calculate functions. Then there's a way to re-encode the input X into some string Y and that U of Y is equal to M of X. So this is just the idea of translating from one language, computer language into another X. One language computes M of X, you can translate it into Y and U will compute the same answer but we also want Y not to be too long compared to the original X. So we require some bound on this. So this is the concept of a universal prefix free machine and there's a theorem that's well known that there exists a universal prefix free Turing machine. In fact, there are lots of them. Basically any normal programming language that you would use acts like a universal prefix free Turing machine. Although it's not literally a Turing machine, it's running on your computer, which works a little bit differently than the abstract Turing machine. And this bound here says that like, for example, if you use any reasonable computer language, you'd never have it be that like my computer language, I could always write my programs much shorter than your programs. There will be this difference in length that's bounded by a constant in how long our programs need to be. Okay, so the Kolmogorov complexity of a natural number is just the length of the shortest string X with U of X equals N. So in common parlance, it's just the length of the shortest program that prints out that number. So for example, if I have an enormous number that's like 11111111111, et cetera, for like a trillion times, I can print that out with a very short computer program. Whereas a typical enormously long number would take an enormously long program to print it out because a typical enormous number has a lot of data in it that you cannot figure out how to compress in any way. So we could also talk about the Kolmogorov complexity of a string because you can encode strings as natural numbers. And indeed you can define the Kolmogorov complexity of any sort of data whatsoever. Whereas the interesting part for people like information theory, Shannon entropy works for probability distributions on the set of strings, whereas Kolmogorov complexity assigns an amount of complexity to an individual string. But there's a relationship between them, which is actually proved by Kolmogorov, which is that the Kolmogorov complexity of a long randomly produced string chosen from some probability distribution is typically close to the Shannon entropy of the probability distribution. So for example, well, here's how we make that more precise. So suppose you have any probability distribution on k-bit strings, then you can define its Shannon entropy in the usual way. And then what you can do is choose a whole bunch of random strings, n of them, from this probability distribution. Then you can stick them together, x1, x1, x2, dot, dot, just concatenate them. So you get a long string made out of pieces that were chosen from this probability distribution and it will have some Kolmogorov complexity. There will be some program that can print out that string and you can find the shortest program that does that. And the length of that program is this Kolmogorov complexity. So here's the theorem that with probability one, as you let n goes to infinity, the Kolmogorov complexity of this concatenated string approaches n times the Shannon entropy of the probability distribution that you use to create this string. You expect that factor of n down there, of course, because when an n increases here, we're making longer and longer concatenated string. So this is the connection between algorithmic, the basic connection between algorithmic ways of thinking about complexity and Shannon entropy. Now there's an interesting problem though. We've got this great notion of the complexity of a single string. Unfortunately, that function is not itself recursive. You cannot compute the entropy, the Kolmogorov complexity of a single string in general. And in fact, there's even an upper limit on how complex you can prove anything is. This is an amazing theorem here. You can pick whatever set of axioms you like for mathematics. And if they're a finite set and they're consistent, there's some number that you can't prove any string whatsoever as Kolmogorov complexity bigger than that number. This is amazing because there have got to be infinitely many strings with Kolmogorov complexity arbitrarily large. So you know that there are lots of strings with large Kolmogorov complexity, but you can't ever pick one and say, that one I know is very complex. And the reason is because there might be some short program that keeps running along for on and on. And you have, you find it, you can't really tell whether it's going to halt or not. And it might after eons halt and print out this, your some particular string, but you can't really ever prove that it's, you can't decide using a computer, whether it's going to halt or not. So that's annoying to have this nice concept, but it's uncomputable. So Levin introduced time-bounded complexity. So instead of minimizing the length of the program that prints out a number, you can minimize the length of the program plus something involving how long it takes for the program to run. And traditionally they use the logarithm of the runtime. So what we do here is, suppose we have a universal prefix free turning machine U and suppose it halts on some input X and let T of X be the number of steps it takes before it halts, then you define the Levin complexity of this number, of a number N, of any number N to be the minimum of the length of the program plus how the logarithm of how long the program takes to run and you minimize over programs that print out that number N. So this basically winds up getting rid of the uncomputability problem that we had with Kolmogorov complexity because programs that take a huge amount of time to run get penalized by this extra sum term here. So this number here is not an integer necessarily because it has this logarithm in it which I guess is just sort of a tradition. But so it's a, to any number it spits out a positive real number called the Levin complexity. And the interesting thing is that that is computable by a computer program to arbitrary accuracy. So unlike the Kolmogorov complexity something that you can actually compute. Okay, so that stuff doesn't sound much at all like traditional statistical mechanics. And so let me just show you how it is connected. Okay, so we've got this universal prefix free Turing machine that we've chosen and let X be the set of strings for which that machine halts and prints out a number. So it's called the domain of the Turing machine. Now let's make up a partition function. So X here you can think of as the set of all programs that halts if you want to think of it in terms of like sort of everyday language. And we're summing in a typical sort of partition function manner and exponential and beta is in some number vaguely reminiscent of temperature or inverse temperature. But it's what's penalizing the log of the runtime for X. So this quantity here will be big when the program takes a long time to finish running. And remember, X is just the length of the program itself. So there's another constant here. Gamma that you exponentially use to exponentially penalize programs that are just long. So this partition function will converge whenever beta is greater than or equal to zero and when gamma is big enough, you may have remembered I had a little fact about convergence of a sum early on in my talk and this first fact here just follows from that. If also you beta is bigger than zero so that you are penalizing programs that take a long time to run, that is they contribute a small amount to the partition function, then this Z of beta gamma is actually computable. But for beta equals zero, we can run into this problem that it's not a computable function. It's still a well-defined function. It's just you can't compute it with any computer program. So now we can march ahead and say, wow, that was fun. So let's actually pretend we're doing statistical mechanics now. So you can make a Gibbs type ensemble where you just do the usual sort of Boltzmann-like factor and divided by the partition function. That will be a probability distribution on the domain of our Turing machine. So this is a, you've got a probability distribution on the set of programs that halt where the probability of any particular program is exponentially damped or exponentially decreases according to the logarithm of its runtime and also to the length of the program itself. And as usual, this probability distribution is nice because it maximizes Shannon entropy subject to constraint on the expected values of the two quantities that we stuck in this exponential, which are the log runtime and the input length. And so one thing of course you can do with this probability distribution is you can just sum the probabilities over all machines that have a specific output N. And so that will be the probability that a randomly chosen input that is a randomly chosen program actually will cause our universal machine to print out N. So we've got this ensemble of programs and so we can ask what's the probability that one of those programs will print out a given natural number N. So you can imagine it like some, some simple natural number like two to be zillions of programs that will print out the number two and there'll be lots of fairly short programs and also programs that don't take long to run to print out two. So this will be, this probability here will be large for a simple number like two. But if I give you a randomly chosen number with a trillion digits, the probability that a randomly chosen program will print that out is going to be very small. And so we can have this standard method and statistics of computing probabilities into something called the surprising or surprise. You just take minus the logarithm of a probability and that's the surprise. So the surprising sub beta gamma of N is the surprise you'd experience when a randomly chosen input causes your universal turning machine to print out the natural number N. So you would not be so surprised if it prints out the number two but you'd be more surprised if it printed out some crazy number with a trillion digits that you would happen to have made up your mind that you were hoping it would print out. So here's the cool theorem. There's some constant C bigger than zero such that for any natural number N that the Kolmogorov complexity is within that number C of this surprising if you pick the right values of beta and gamma if you pick beta to be zero so you're not penalizing programs at all for taking a long time to run and if you pick gamma to be log two which just has to do some artifact to our conventions then the surprise will be within C of Kolmogorov complexity. So what we're seeing here is the Kolmogorov complexity although we described it originally as something that you compute from a single string or I should say you attempt to compute from a single string it's actually not a recursive quantity. Nonetheless, it actually does have a statistical mechanical interpretation up to this constant error. And then similarly for the Levin complexity the Levin complexity is also within a constant of the surprise but now we just adjust beta so that programs are being penalized for taking a long time to run and so we see that these two concepts of complexity are just two choices among a continuum of choices that you could pick beta and gamma to be anything greater than zero and you'd get a concept of some kind of complexity that would have a certain amount of pay a certain amount of attention to how long the program was that printed out a string and a certain amount of attention on how long the program runs. So now you could go ahead and do further things with this you could say wow this is really analogous to statistical mechanics so I could like try to do some thermodynamics with this so I could I've got this ensemble of programs that depends on these parameters beta and gamma but I could look at quantities like the expected value of the logarithm of the runtime or the expected value of the length of the program and here just to be cute I'm making them be analogous to the energy and volume of a cylinder of gas that analogy is not supposed to be a profound analogy it's just that you're most familiar with with formulas involving derivatives of these relating these expected values with derivatives of the partition function in the cases when you're playing around with a cylinder of gas or a cylinder of stuff so you can check that the usual formulas it's just math holds that this E is going to be the derivative of the logarithm of the partition function so you can define the derivative of that with respect to the one of our Lagrange multipliers beta and V is going to be the derivative with respect to the other and thanks to this it means that if you march on and all the usual formulas you're familiar with from thermodynamics will hold so you could define a kind of algorithmic version of temperature to be 1 over beta pressure by P over T equals gamma so I'm just like mimicking formulas from thermodynamics and stat mech then you'll get the usual relation the D E is TDS minus PDV where now S here is the entropy of the whole probability distribution of all programs the whole ensemble of all programs so so Mike Stay and I went and we worked on this a bit more we did some things like for example look at thermodynamic cycles analogous to those good old heat engine problems you're subjected to when you're first learning thermodynamics where we cycle around in the P in the temperature pressure plane or however you want to parameterize that plane and see how much work can be done and so on but I really think there's sort of a little bit of a loss here to figure out exactly where to go with this concept and I think I would like to hook it up to some other attempts that people have made to connect algorithmic entropy or comagor of complexity to thermodynamics there are certainly other things people have done that involve all the same words and I'd like to see at any time how what we're talking about here is connected to that but I must admit I haven't done that so I'll stop here thanks thank you very much John very provocative do we have any questions particularly junior people Valmer do you want to go ahead yeah hello very nice content so I would have several questions the first one is related to this algorithm with the log runtime why do you make a choice of logarithm I don't really know it wasn't my choice it was Levin's choice and I wonder about that and I the only reason why I did that is to stick to tradition for people who already know about this he was a really hard guy so he probably had some good reason for it but as far as I can see it's a more impressive result it's a more impressive result if you can make the thing computable by only penalizing with log runtime then it's more impressive than if you penalize against runtime I guess that's a stronger result log implies one but probably the other one doesn't that's true but I actually feel that in fact any function that approaches infinity would do it because what you really are just trying to do is to kill off those I'll have to check maybe my intuition is inaccurate and you need some you need to have some at least a little bit of rate of growth here but I agree with Thomas's answer that if you do just use t of x all the results I say would still be true but but the reverse implication is because you later use expectation of this value you use ordinary linear expectation what happens if you change these two more general explanations like the expectation can really be recovered from this framework oh sorry a little red new type entropy did you say yes I'm sure you could do that you definitely could do that all the all sorts of basic stuff is true sort of regarding Shannon entropy and Renny entropy regardless of what your Hamiltonian was and so you could think of us here as just picking us peculiar Hamilton Hamiltonian on the set of programs and just one quick question what happens with the chemical potential in this picture in terms of pressure yeah in our in our paper we wanted to make it even cuter so we included a chemical potential which I didn't introduce here just to keep it a little bit shorter again I want to emphasize that there's no profound reason that we know why we're making energy analogous to this and volume analogous to this so in some sense we could have chosen anything to be the chemical potential but any quantity that but what we did actually is we chose the output of the program to be the expected that is not the chemical potential but the the sorry I'm spacing out what do you call the the number of particles right the quantity conjugate to the chemical potential is the number of particles in your system and so we had a V and a V and I think called n and our thing that we chose to be analogous to n is just the output of the program so in other words here we're just picking a a Boltzmann like distribution that's putting constraints on the expected value of the long run time and the input length and if you wanted a third quantity the very natural one would be the output of the program okay thank you very much one one question from David hi David well see you disappeared I don't know David do you want to do a question we can't hear you okay I think in the interest of time we'll move on there are two questions in the chat John I don't suppose you can reply