 Good afternoon. I'm actually going to survey two topics in Markov chains, two of my favorite topics. One is the subset of mixing, namely the cut-off phenomenon, and the other topic, which is not up here, but we'll see it later, is quite different. It's the rate of escape of Markov chains, and so that's quite demanding given that we have Anna here, who's one of the greatest experts on this topic, but I'll still present it. But today I want to start on cut-off, and some of this will be useful also for those coming to Eyal Lubecki's lecture that starts tomorrow, since he will be analyzing some examples of cut-off on random and ramanujan graphs. Okay, and these notes were prepared together with David Levin, and we have a book, Markov chains and mixing times also with Elizabeth Wilmer, and there is a second edition coming out, and so you can find that book linked from my page or on David's page, it should be easy to find. Today, much of what I'll talk about is there, since I'm not assuming that you read it or remember it, but we'll see more advanced material as we move through the course. And also, throughout, I'll be giving some problems that are maybe a little more advanced, and so those who already know the elementary material can spend the time solving the problems, and we'll discuss the solutions in the exercise session at the end. So, before defining, I'm just going to give a quote from Diakonis 96, so Aldos, Diakonis Shoshahani were the initiators of this phenomenon in the early 80s, and until recently, really the way this, so until the last decade, the way this phenomenon was understood is by really specific examples, spectral analysis of eigenvalues and eigenvectors, and we're, despite the overall title, we're going to focus mostly in my lectures, mostly on the non-spectral approaches, but we'll connect with the spectrum later. So, just to fix some notation, so we're going to look at ergodic, namely, this just means for us irreducible and aperiodic finite Markov chains. So, just from every state, you can reach every state, and it's also aperiodic, so the times that you can, so you can think there is some power of the matrix which is strictly positive. So, there's a unique stationary distribution which is denoted by pi, so pi equals pi times p, so it's a row vector. Now, we measure distance to stationarity in several ways, but the most important one will be the total variation distance, so this total variation norm, which has multiple equivalent definitions, so it's half the L1 distance, okay? And it's also, so the total variation distance between two measures, just the maximum over all sets in the state space of the probability, you know, of one measure minus the other measure. So, in other words, what we want is to measure d of t, the distance at time t is the maximum over starting states, maximum over subsets of the state space of the probability of being in A minus the stationary probability of A. So, we know in this setting of irreducible periodic that this tends to zero, and the goal is to quantify at what rate it tends to zero, but the classical point of view on this was you fix a chain of n states and you drive time to infinity and you see how fast this goes to zero, and then this really determined by the top non-trivial eigenvalue, but really the modern point of view initiated by Aldous and Iconis, but also closely related to statistical physics, is different. We will, instead of driving the distance to zero, we will have some goals. So, we want the distance to be less than a quarter or one tenth or one percent, some fixed epsilon, and we'll ask how long to get there, but when the sequence, the sequence of chains, so we'll not consider one chain on 10 states or 100 states, but we'll consider a sequence of growing chains. All this will become clearer with the examples. A few other distances which will come up and I'll remind them, so instead of L1 distance, one can also look at LP distance and the right normalization is to look at the density, so distribution at time t, so we look at Ptxy, we divide it by pi of y, so pi is strictly positive, and we look at the LP distance between one and this ratio, LP with respect to pi. Now, so all of these are useful, but the most are one, two and infinity, so for one you just get the L1 distance twice the total variation, and an infinity distance will just measure for us the maximum deviation of the density from one. Now, here is a somewhat more subtle definition called separation by Aldo-Sendarconis, and so at first this looks just like L infinity, but it's not, so notice there's no absolute value here, so the separation distance measures to what extent the distribution is kind of blanketing the space, so we want to control states with too low probability, but we don't worry about if some state has too high probability, and it turns out that this is very, very close for reversible chains to the total variation distance while the L infinity distance can be much larger and decay more slowly, so we'll discuss some examples of this. Finally, something very close to total variation distance is, to pi is total variation distance between two states, so this is kind of measuring how much we forget the initial state, so if we start from x and from y, we want to see at time t how these distributions differ in total variation, and d-bar is very close to d, so remember d of t was the distance to pi, d-bar is at least d, but at most twice d, so I'll reiterate this later, but maybe I'll write this on the board where I see. So both of these follow from triangle inequality, so this one is because if I measure the distance from p-t-x to p-t-y, I can just go through pi and use triangle inequality, while in the other direction, if I take d-bar, so this definition, so the average, if I take p-t-y dot and average this over y with weights given by pi, the average, so if I take some pi of y, p-t of y dot, if I average these measures, so each of this is a probability distribution starting at y, I average them with weights pi of y, what measure do I get? Pi, right, so just stationarity tells us that, so just from convexity of the norm or again triangle inequality, you see that d-bar will be at least d, but d-bar has some, so though d is usually what we care about, d-bar has some nicer properties, in particular it's sub-multiplicative, so this is an easy exercise, let me say one interpretation of, so what is d-bar of t? So d-bar of t is the probability of failing to match the two chains under the best coupling, so if you, so you look at the chain starting from x, starting from y, you have two measures p-t-x dot and p-t-y dot and you look at the coupling of them, so some joint distribution that projects to both of these measures and you want to couple as well as possible, meaning that the measure of the diagonal will be as large as possible and the probability of that coupling failing, the measure of the complement of the diagonal will be d-bar and if you think of that interpretation, it's easy to see this inequality, so maybe for those who are not familiar, this is the first exercise you can, you can do and all these initial exercises, you can also look up in the Markov chain book. Okay, now separation distance, again it's a kind of point-wise distance, but just from below and so it's very easy to see that d of t is bounded above by the separation distance, again just by averaging, but something which is maybe more surprising is that the separation distance at time to t is bounded by four times the total variation distance at time t. So this is only true in the reversible case so again total variation tells us, it tells us one thing and separation seems to tell us something different because it looks at all locations y and here we see that roughly they are equivalent and that's important that it uses reversibility so the rough idea is that if we want to go from x to y in 2t steps, we can think of going forward from x and backward from y and if we are mixed in total variation these two measures have to be close so we can kind of interpolate and get this fact. So again this you can find either in the Aldous D'Arconis paper where it's first proved or in the Aldous Phil unpublished notes or in our Markov chain book. Okay so we're going to very soon go to examples but I want to emphasize in the examples we're not going to focus just on one fixed chain and drive time to infinity we're really going to think of a family of chains so the canonical examples will be the simplest examples will be cycles and hyper cubes then we'll see other examples as well. So when we have a sequence of chains sometimes we'll emphasize this by putting Tmix and Epsilon meaning the mixing time in the N's chain so that's the first time where the total variation distance in that chain goes below Epsilon. Now because of the sub multiplicativity the value of Epsilon is sometimes de-emphasized and so we will write Tmix which actually just mean we plug in Epsilon it was a quarter. So again the point of view we'll discuss following Aldous D'Arconis and others is how does this scale so when you fix Epsilon let N go to infinity. So for instance one of the simplest examples of an aperiodic irreducible chain is lazy random work on a cycle of N states and we want to know the to understand the mixing time here so this is a very simple case again to make it aperiodic we make it lazy so with probability one quarter you walk left right one quarter you walk left and with probability half you stay in place so that's the lay so the lazy chain means that with probability half it stays where it is so that is a simple device to ensure a periodicity so once we look at the lazy so this will be once we look at the lazy random work on the cycle it's quite easy to see in many ways that the order of the mixing time is N squared so one way is we all know the central limit theorem and the local central limit theorem for random work on Z and random work on the cycle is just going to be taking that random work on Z and taking it more than so because so that would be because the Gaussian distribution is roughly flat over an interval of the standard deviation that will be one way to see it we'll see better ways shortly so the order of the mixing time will be N squared but kind of the question we will want to explore which is easy in this case but we'll explore it in other changes how does this depends on epsilon so when you so fix epsilon and let N tend to infinity will the mixing time be asymptotic to a constant N squared that depends on epsilon or doesn't depend on epsilon so and that kind of answer will differ from one class of chain to the other so we'll see for the cycle actually you cannot have such a relation so this constant will have to depend on epsilon while for other examples which will have cut off this constant won't depend on epsilon so so one key tool used in this business is coupling so again giving starting states X and Y will this is a little drive with definitions it will come to life when we go to the examples so we'll have X YT will be a joint Markov chain so a Markov chain evolving in the product space so that when we project on the first coordinate it will have this transition matrix at P but started at X the second coordinate will have the same transition matrix but started at Y but these two coordinates are not independent so they could have arbitrary dependence between them and suppose that this Markov chain in the product space has the property that after some random time tau XT and YT coincide so usually this tau will be a stopping time but sometimes in enlarged filtrations or enlarged probability space but suppose that the chain is arranged so that XT equals YT after some random time then we can bound the total variation distance at time T very simply by the probability that tau is still bigger than T because on the event that tau is less than T then XT and YT agree that's what that's how tau is defined so it's an elementary calculation written here just using the definition of L1 norm that this total variation distance is bounded by the probability that tau this meeting time exceeds T and so D of T which is our goal the total variation distance with stationarity is bounded by D bar of T which is the maximum over XY of this quantity so that's so as I said that's and that's bounded by 2 D of T but this inequality means that we can bound D of T using the tails of this meeting time if we can get these tails analyzed uniformly in the starting point X and Y so finally let's see an example where this works the most basic example in mixing which is the lazy random walk on the hypercube so okay so we have so this is the hypercube is just zero one to the N so every this would be the two dimensional one so every vertex has N neighbors correspond to flipping one of the N bits but we're going to make it lazy one move of the chain corresponds to picking a coordinate uniformly at random and not necessarily flipping it but just randomizing that coordinate so replacing that coordinate the bit in that coordinate by a uniform random bit so this suggests a very natural coupling if we started two states X and Y how are we going to couple the chain starting with X and Y in the coordinate and we're supposed to in X and Y randomize that coordinate well let's do it together so we chose the coordinate we chose this one and now we replace whatever is sitting here in X and Y by a new random bit and the same bit in X and Y any questions so so this kind of coupling relates to the classical coupon collector problem because each time we're picking a coordinate at random and by the time we picked all the coordinates which corresponds to collecting all the coupons the two chains have coalesced they've reached the same state and they will stay in the same state from there their own so we have a natural random time tau here which is the coupon collector time the time where we have picked all the coordinates at that time the two chains must have met so what is that time what is the asymptotic of that coupon collector time with awake yes n log n right so this means that and moreover by time a little more than n log n it's very very likely that all the coupons have been collected so let's just go through this so the probability that tau is bigger than t is just the sum over all the coordinates well it's bounded above so here a crude union bound would be good enough for us so it's bounded above by this sum which is n times 1 minus 1 over n to the t and so if you take t which is n log n plus cn already this crude bound suffices to tell you that d of t is bounded by e to the minus c if you just plug in this value of t so think of n log n plus cn where c is a large constant then at that time the total variation distance is very small and you see that the epsilon here comes in in the kind of correction term not in the first term so this suggests that maybe the mixing time is n log n because this looks like the best possible coupling but it's not so the mixing time is actually asymptotic to n log n over 2 so that's kind of the first surprise in this business however we'll see later this coupling is significant it gives if you want to look at the separation time the time where the separation distance goes below epsilon then there is then you'll get the n log n coming in here rather than n log n over 2 so on the hypercube one can compute the transition probability write it explicitly using the eigenvalues and the eigenfunctions they're all very easy to write down explicitly and then computing p to the t is just a matter of taking a matrix to a high power if we know the eigenvalues and eigenvectors we can do that and this will give sharp estimates in this case but we'll write down this calculation soon but the problem with that approach is it requires all this detailed knowledge which is harder to get in more involved examples so it will be useful to look at the simple example of the hypercube from several perspectives and to try to get the sharp result in a non-spectral way which will then generalize to other examples like Glauber dynamics for the easing model so so I said the naive coupling which gave us n e to the minus t over n to to this bound will then yield the n log n bound for the mixing time but we'll see in the sharper analysis that instead of using this coupling all the way to the end you really only want to use it until the distance is root n and then switch to a different a different way of bounding so okay so let me already illustrate how we improve this n log n bound so again this is a very special case lots of symmetry so many alternative things we can use some of them one way to use the symmetry is to look at the hamming weight of a configuration so just the total number of ones the sum of the coordinates so called that WT also observe that because of the symmetry of the hypercube if we want to understand the mixing time it doesn't matter what initial state we use so it will be convenient to fix all one state as our initial state but it's trivial to check that the total distance to stationary time t doesn't depend on the initial state so we're going to fix the initial state to be all ones and here is one useful observation exploiting the symmetry that the total variation distance between XT so XT is the lazy walk on the hypercube WT is the hamming weight at time t so the hamming weight as we'll see is itself a Markov chain but just a one-dimensional Markov chain on the integers and the total variation distance between the state in the hypercube and pi pi is just uniform here pi is uniform measure on the hypercube is the same as the total variation distance between WT and its stationary distribution so its stationary distribution is just obtained from the station from the uniform distribution on the hypercube so you know pi W of K is just the number of configurations that have K ones multiplied by the weight of each such configuration now why is this equality true because when you're computing the distance here to pi you can what we're supposed to do is look at half the L1 distance when we're computing the L1 distance so if you want to look at so P1 XT equals Y minus pi of Y so we're supposed to sum such terms and divide by 2 but observe that of course pi of Y is just uniform but this probability when I started the all one state this probability just depends on the hamming weight of Y all states Y with the same hamming weight have the same probability so because of that when I sum this over all Y I can just sum it level by level so I can just and if you look at the total contribution from vertices Y with weight K and you sum over that you'll just get the same probability that WT equals K and pi W of K so but this simplifies the problem reduces it to understanding this one dimensional chain now the coupling when we looked at the hypercube the coupling we were using also called the greedy coupling looks like the best you can do but once you make this observation that you reduce to birth and death chain so this chain WT either stays in place with probability half or goes down with probability which is WT over 2N because if you have K1 then the probability of going down you have to be non-lazy so that's a half and then it's K over N to reduce one of the one and it goes up with probability N minus WT over 2N this is known as a birth and death chain it goes down or up and now for this chain we see other possibilities of coupling so this chain is just a projection of the hypercube chain and because in the hypercube we saw that we had a bound D of T it was bounded by N E to the minus T over N so if we take T which is half N log N I'm going to omit writing integer parts so T should be an integer but you can add the integer parts yourself so if T is this time then we're going to get so so when we plug in the half half N log N I want to use I don't want to use the bound for D of T I want to use a bound for the expected so the expected Hamming weight so let me see if I have that down so this bound is correct but we want to do something better so again here is a recap of the evolution of WT and so the so if we look at the expected distance between WT minus 1 WT this is if you go to the previous example distribution how does this change so its probability half it doesn't change and it's going to go down with this probability and up with this probability so if you then use this to compute the expectation you'll get that the expectation is just minus WT divided by N and this rate of decay actually corresponds to the coupling we discussed before but it's just elementary from the previous equation so again this will give something a little better than this total variation distance so that the expectation of WT is bounded by N e to the minus T over N because W0 was N so but now we have something else we can use so if we're just using this inequality and we want to get this to be below 1 we would still have to wait for N log N but instead we're going to just wait till this goes to root N so that will be N half N log N so in half N log N steps this distance will be root N so okay I'm doing something I should explain so this is not the WT itself so sorry so I should switch there is a switch I need to do which I so let's so you can either switch to plus minus 1 or look at starting at 1s and starting at 0 so this is the right way to say it so think of the distance starting at 1 and starting the hamming weight started at 1 and starting at 0 so using the coupling we discussed before by time half N log N we're going to get the expected distance to be bounded by root N so we started with distance N and we're going to get the expected distance bounded by root N at this time at this time T now once we have the distance to be root N a continuing to use this contraction is too slow and at that point what we want is to use the variation of the work so the the point is that okay so it's not explained well in the slide so let me explain this on the board so once we have so let's write these two variables maybe WT started at 1 and WT started at 0 once we know their distance is root N is at most root N let's let's look at the different coupling which is not implied directly in the hypercube just in this birth and death chain in this coupling we're going to first toss a fair coin to decide which of these two is going to move okay and then and then the one that moves will move with the with the correct probability so if we so this will move either up or down and this one will stay in place so again we toss a fair coin one of them stays in place the other one moves so and the probability it will move so suppose it will be it's K then it's going to move to either K plus 1 with probability N minus K over N and K minus 1 with probability K over N alright so it's kind of the non-lazy version of this chain but this is only with probability half with probability half this will stay in place and the other one will move so now look at the difference between these two locations or think of them as particles what what you can check is that this difference is going to be dominated by a simple random walk so and hence the time until they until these two coalesce will be dominated by the time that the random walk that starts at root N reaches 0 and and if you have a simple random walk on all of Z you start at root N for the expected time to reach 0 is infinite by time N the probability is very large that you will reach 0 so again the key thing here is that the drift so if Wt1 is larger than Wt0 then its drift to the left is stronger than the drift to the left from Wt0 so the difference between them actually has a negative drift so its going to be so formally the difference between them is a super martingale and so and so the time until the couple is going to be order N and so we will get we can get the coupling of the birth and death chains at time half N log N plus order N so I will say more about this example from other methods but any questions about yes so you say it's dominated it's actually got negative drift at all times can you take the minimum of this negative drift so like minimum absolute value of the negative drift well the problem is that the drift when they get very close the drift is very weak so the drift is using right so then the drift you really get very little from this drift and it is to use actually the diffusive nature to say that so even if you had so basically replacing this drift by zero drift and just saying that two particles because the variance is positive each time they have a chance to move then two particles that start the distance through 10 are going to so in this coupling the difference is actually just like one process which is a super martingale with a non-trivial variance so just from the reflection principle you can easily check that if it's got if the minimum absolute value of the drift like yeah well the minimum the absolute value of the drift is very very small so if they're adjacent it's one over N so that that is too weak so what you use is the fact that when they're far away the drift is strong so until they get to distance root N we use the drift after they get to distance root N the diffusion is a dominant force bringing them together rather than the drift and that's something that's harder to exploit originally on the hypercube that's why we project to the birth and death chain and then exploit it there any other question okay so the key thing which is that half N log N so what one finds if you do this argument with more care is that the T mix of epsilon is going to be bounded above and there also be some bound like that from below by N log N over 2 plus a constant that depends on epsilon times N so if you want to be sure that they coalesce you just have to wait a large constant times N so the C epsilon is actually of order log 1 over epsilon but so the key is that the epsilon only affects the second order term it turns out there's also a lower bound of the same type N log N plus C epsilon N so N log N over 2 plus C epsilon N and this will be the cutoff phenomenon but before that here's contrast the example where there is no cutoff so this is random walk on the cycle and so for a lazy random walk on the cycle I said that you can get from the central limit theorem mixing time of N squared but you can also easily get it using coupling so then if we have lazy walk on the cycle we'll use the same trick as before where with probability half we toss a coin and decide which of the two particles and then the other one will stay in place and then the distance between the two particles will just itself be a simple random walk on the interval 0 N because once the distance reaches either 0 or N they've coupled and now the classical gambler's rune estimate tells us that if you have a gambler starting at K and he stops at either 0 or N and he just does simple random walk not lazy now then the time till he reaches the boundary will be just K times N minus K so it's always bounded by N squared over 4 so the expected coupling time or coalescence time is bounded by N squared over 4 and this gives a bound of N squared or here it was generous and put it to for the mixing time of a quarter. If you want really to mix the T mix of epsilon where epsilon is small then this will not be enough. It's easy from this argument to get a bound of N squared times log 1 over epsilon because you can just do experiments each time you succeed with a couple with probability at least a half by time to N squared and then if you do repeated experiments the chance you still haven't coupled after K such experiments is exponentially small in K so you can easily get a bound T mix epsilon which is N squared log 1 over epsilon but that log is really needed in this problem so the difference between the hypercube and the cycle is that the epsilon on the cycle affects the first order term in the mixing so it multiplies the N squared and the hypercube it does not so the holy grail of the topic is to starting from these two very elementary examples understand in more complicated groups and more complicated chains when is there cutoff and when is there not let me give you one simple example that was just resolved this year so this is known as the random to random shuffle so you have two cards one on top of the other and you pick a card uniformly at random take it out of the deck and put it in another uniform place how long does this take to mix it's pretty easy to guess and not so hard to prove that the order will be N log N but what's the constant in front is it concentrated is there cutoff so that's for some more than 20 years since was just solved this year I'll give the citation later but the answer is three quarters N log N for that chain and that's still one of the examples which is understood by very specialized arguments upper bound using representation theory and upper bound by careful definition of the right event there's no general theory that can solve for us now this type of example so that's the holy grail is to understand when can we expect cutoff and when can we prove it so just for the cycle here's just the elementary argument why the mixing time indeed is of order N squared and not less because if time is less than N squared over 32 then you know the chance you will reach as far as N over 4 is too small by Chebyshev so so the mixing time is at least N squared over 32 so this type of argument won't identify the right constant but it will give you the order so here is one tool to lower bound variation so this is not a sharp estimate this is just the estimate you get from Chebyshev so if you have two distributions so for us one will be the stationary and the other will be the distribution of time t if the means differ by r times the standard deviation then the total variation distance is at least 1 minus constant over r squared and the way you check that is you just define remember the definition of total variation one definition so mu mu minus nu the total variation distance where is the eraser you see this one is better ok so ah this is a little better so remember that the total variation is the maximum over sets of mu a minus nu a that's one of the two definitions you just have to define a good a good event so if you have these two measures mu and nu so one way is just look at a threshold which is the average of the means right so this is a mu of f plus mu of f over two and the event a that you take are the so give this this point a name I don't know theta and the event a is just the x's so that f of x is bigger than theta and if you just right so one mean is here and the other mean is there so by Chebyshev you know that the probability that f of x is bigger than this theta under this mean is small and the probability it's smaller under that mean is small so just by using Chebyshev twice using this threshold you get this type of bound again um if you haven't seen it you can go through the Chebyshev argument to do it and in the book we have this as well as the sharper bound of this type so you can do better than Chebyshev here um so this can be applied in many examples for instance in the hypercube you can apply it so if you look at the so let's look at the hamming weight chain the total hamming weight is so we can write it as a combination of two terms so if RT is the total number of coordinates that have not been updated then the expectation of W of XT is RT so all those still have a 1 because we started at 1 and from those that have been updated in expectation half of them are 1 so we get this expectation is half of RT plus N so the expectation starting from 1 of W XT is N over 2 times 1 plus 1 minus 1 over N to the T and you see that the ok it's very easy to see that in stationary distribution the variance of W is at most N over 4 but it's possible to check it this is true for all T so again I leave this verification so you get it from this formula that the variance is bounded by N over 4 for all T and once you know that you can just use the previous lemma to lower bound the total variation distance so if I look at the expectation under pi of W is just N over 2 the expectation starting from 1 is written here so the difference is going to be sigma times root N times 1 minus 1 over N to the T and if we take T which is half N log N oops there is some this is this N is wrong so half N log N minus alpha N then you get d of T to be bounded by that so the key thing is that so there is no N here half N log N minus alpha N and at that time d of T is so if alpha is very if alpha is very large then d of T will be close to 1 so this is the other part of what I told you so that at this time half N log N with a linear correction you are going to get d of T close to 1 oops sorry so for the hypercube L2 and L1 so the total variation mixing and L2 mixing both occur at the same time half N log N with corrections of order N so the L2 calculation will see shortly so there is a window of order N so this is an example of the cutoff phenomena which I will define next now one thing to notice is that in this chain we said the mixing time is order N log N the relaxation time which is defined as the inverse of the spectral gap so we'll use this more but let me write this so the relaxation time is defined as 1 over the spectral gap so if lambda 1 the largest eigenvalue for stochastic matrix is 1 lambda 2 the next eigenvalue is smaller actually I'm going to use not lambda 2 but the largest eigenvalue in absolute value so maximum over lambda different from 1 so sometimes this one can be negative the largest absolute value we're going to use this for the relaxation and then so for the hypercube the relaxation time will be N since the second eigenvalue so this largest absolute value will be equal to the second eigenvalue so it will be 1 minus 1 over N and all the eigenvalues are non-negative maybe one thing I should comment so we work a lot with lazy chains it's convenient to reduce a periodicity it also of course kills all the negative eigenvalues so going to a lazy chain is just averaging the transition matrix P with the identity so since the stochastic matrix has eigenvalues in minus 1 1 stochastic reversible matrix so self-adjoint so eigenvalues are real and they're in minus 1 1 if you average with the identity the eigenvalues become in go to 0 1 so you don't have to worry about negative eigenvalues so I'm pointing out this because this will be part of a general phenomenon so here the relaxation time on the hypercube is N while the mixing time is order N log N so here is the general definition of cutoff that I've alluded to before but here is one formal definition so we have a so this D is really D N so we look at the total variation distance in the N's chain and maybe there is a picture here so this picture is maybe a good one to look at so what we want and I'll come back to definition what we want is that the total variation distance will descend very rapidly from near 1 to near 0 so so there are several equivalent ways to write this so one is here there is some window W N the W N should be little o of T N in order to get cutoff so this is not written in the slides but let me add it so W N should be little o of T N so T N formally represents the mixing time W N is a window where the mixing happens so if we are at a time which is T N minus some large constant times W N the total variation distance should be close to 1 if you are at T N plus a large constant W N total variation distance should be close to 0 so that will be cutoff and an equivalent definition of cutoff so maybe easier to remember cutoff means that T mix N of epsilon doesn't depend on epsilon to first order in other words if you look at T mix N of epsilon and divide by T mix N of 1 minus epsilon this ratio should go to 1 as N goes to infinity so that's equivalent to cutoff it's just easy to see it's equivalent to this and another useful notion is precutoff which means I don't require that this will go to 1 but I just require that this ratio doesn't blow up with epsilon so formally for any epsilon once you fix epsilon and you take a limb soup in N this limb soup should be bounded in a way which is independent of epsilon so there are few examples that have precutoff but don't have cutoff often what happens is there is a chain it probably has cutoff or sequence of chains it probably has cutoff but we can't prove it so the substitute is well we prove precutoff in fact for many chains where we conjecture cutoff holds we know precutoff but we just can't refine it okay so again here is another way to think of cutoff so D N so if you look at the fixing time of the N's chain and multiply by a constant the limit should be 1 if this constant is less than 1 and 0 if this constant is bigger than 1 the limit in N so I told you that for the hypercube there is a very easy argument using eigenvalues let's go through that because that actually bounds something stronger it bounds the L2 distance rather than just the L1 distance the drawback of this argument is it's specialized it requires this detailed knowledge of the eigenvalues and the eigenvectors so again we have p reversible eigenvalues lambda 1 bigger than lambda 2 this should be lambda N and fj are the eigenvector corresponding to eigenvalue lambda j and then the total variation distance squared even multiplied by 4 because this is half the L1 is bounded by the L2 distance squared and this can now be written explicitly so you just diagonalize and you easily see that the L2 distance squared can be written in this form so sum over the fjx squared lambda j to the 2t so fj is a normalized eigenfunction normalized in L2 of pi so for skip this so now the relaxation time I defined is 1 over 1 minus lambda star is the maximum eigenvalue different from 1 and so using the diagonalization it's easy to prove an upper bound T mix of epsilon is bounded by T rel times well times a logarithmic factor and here there are two terms here so the epsilon that we want and the pi min the minimum stationary probability so again this is a general upper bound some cases in some cases it is sharp but in many it is off so even in the hypercube this bound is too generous so in the hypercube the relaxation time is n the log1 of epsilon should be there but log1 over pi min is so pi min is 1 over 2 to the n if you're in the n-dimensional so this will give you an upper bound of n squared on the mixing time so this kind of bound although sometimes useful is often not sharp one example where it is sharp and you'll see that later say in Lubecki's lectures is in expander graphs so expander graphs have a bounded degree and have relaxation time which is order 1 and pi min is for n so this general bound for expander graph on n nodes it will give a bound of logn for the mixing time which is the right order so there are examples where this bound is the right order but often it is too generous so if you compute using eigenvalues the distance in the hypercube for the lazy walk so the eigenvalues I say are all in 0, 1 and in fact it's 1 minus g over n with multiplicity n choose j and that's an easy exercise so for instance the largest eigenvalue is 1 minus 1 over n and we have multiplicity n because each of the coordinate functions has this eigenvalue so we have n coordinate functions the first, the second up to the nth each one is an eigenfunction of n and in general you'll get eigenvalues 1 minus j over n with multiplicity n choose j the eigenfunctions are characters you can write them explicitly and if you use that you'll get the l2 norm squared you can again write this binomial expression and bound it in this way and this bound is close to sharp 1 plus e to the minus 2t over n to the n minus 1 so if t is half n log n plus cn you get a bound on the right hand side which has this form so just plug this in so it's e to the and just use 1 plus x bounded by e to the x so it's e to the minus 2c minus 1 so the key thing is as c grows you will find that this exponential will tend to e to the 0 so this difference will tend to 0 so this means that at this time half n log n plus cn not only we're controlling the l1 distance but we're controlling the l2 distance as well and any questions? maybe before going on to the example let me connect to something that so Charles was telling you something about the lampliter groups yesterday so that's a good example interesting example where l1 and l2 are different so let's look at the lampliter group on the cycle I want to give you one more interesting example before continuing with the theory so what is an element of this group? I won't draw the whole group, I'm just drawing an element of this group and this is the lampliter group on a cycle so we have a cycle of n nodes and an element consists of a bit for each node which you can think of the lamp being on or off and also a marker on one of the nodes so this picture is not of the lampliter group it's a picture of one element now I don't care now about the group operation just about the the graph so in order to define the graph I just have to tell you for a given state who are the neighbors for a given vertex where the neighbors and here there are three neighbors one obtained by flipping this bit where the marker is and the other obtained by moving the marker left or right the easiest chain to analyze will be not the simple random work on this graph what's called the randomize move randomize change so given where the marker is first randomize the lamp where it is then have the marker make one lazy step and then have it randomize the lamp after it moved so but this is just the technicality so to make it nice and reversible and to ensure that all the locations it visited have randomize lamps so what is the mixing time in this example so then it's pretty easy to see that the mixing time in total variation is going to be order n squared because by time n squared this marker will have moved around the cycle if I wait 100 n squared with high probability to move around the cycle and it has time to randomize all the lamps so I'm not after the constants now so I'm not so this example doesn't have cutoff but I want to understand just orders of magnitude so in this example the mixing time is going to be order n squared but now suppose we want to look at understand the mixing time in L2 so this I claim will have a higher order and to see that just consider the event that the marker stays in one half of the cycle say in this half that event is very reasonable if I just wait for a time period of n squared so with constant probability the marker will stay in this half cycle let's assume that the initial state was just all zero state so that was the initial state all zeros and the marker here and it stayed in this half so what this means is that if it stayed in this half it means all the lamps in the other half, in the lower half stayed zero which means that the density, the distribution is focused instead of on two to the n states on this event which is reasonable it's focused on a tiny collection of just two to the n over two states about to the n over two times n but the tiny part of the state space so then the density is exploding because if you have a probability measure it's only on two to the n over two states and out of the two to the n so when you look at the density it's huge and the L2 norm is exponentially large at this time and this will remain true as long as you have a substantial probability of staying in this half so the only time the mixing in L2 will occur is when the probability that you stay in this half is exponentially small and for this you need to wait for n rounds where each round is time n squared so it turns out that the mixing time in L2 for this example is n cubed is order n cubed so the ratio between so the mixing time in L1 is order n squared the mixing time in L2 is order n cubed which is the maximum possible ratio between them because it's log of the size of the state space so now in the hypercube that doesn't happen and we get a mixing in L1 and L2 to happen not just in the same order but really at the same time so it was the same constant alright so I want to mention the spectrally necessary condition for cutoff which is that the mixing time and the relaxation time so here I took relaxation time minus 1 just because of some you know trivial two state change but in all examples of interest the relaxation time is going to infinity so this minus 1 doesn't matter ok so here is a necessary condition for cutoff even for pre cutoff the mixing time and the relaxation time should be of different order so this is already enough to exclude cutoff on the cycle so if we have on the cycle the right again the first eigenvalue is 1 minus constant over n squared and so the relaxation time on the cycle is n squared and so is the mixing time they are of the same order and hence there is no cutoff so so in general so why is this true basically it's true because of this inequality the mixing time T mix of epsilon is always at least so I gave you an upper bound before the relaxation time times the log 1 over epsilon phi min but there is also a lower bound which is the relaxation time multiplied by log 1 over 2 epsilon so again you can find that in my Markov chain book but the idea behind this inequality is to use the second eigenfunction so if I wait so let me explain the intuition here take the stationary distribution and perturb it by the second eigenfunction ok so we move it by some constant time the second eigenfunction this perturbation if you wait the relaxation time it just it brings down this perturbation by a factor of E so the perturbation doesn't disappear so you won't mix until really this perturbation becomes less than epsilon so since the perturbation only is squeezed by a factor E in every round of the relaxation time you will need log 1 over epsilon rounds until this perturbation becomes negligible so that's the idea behind this lower bound on the mixing time which again you can find what a simple argument I said written out in the book and then if you look at this ratio T mix epsilon over T mix so T mix is just T mix of a quarter you'll get a bound of at least log 1 over 2 epsilon under the assumption that the mixing and relaxation time are of the same order so but this kind of inequality tells you that you don't have cut off you even don't have pre cut off because this ratio is bounded below by some quantity that blows up as epsilon goes to 0 ok so so one kind of a ok so for the so again I said both mixing and relaxation time are order n squared and there is no cut off for the hypercube the ratio with order log n goes to infinity and there is cut off so if you look at many examples this suggests the conjecture that this ratio going to infinity is the condition for cut off we know it's essentially necessary ok with this minus 1 it's usually sufficient but there's a warning with this usually so there are various weird counter examples so one thing we are still missing is an exact condition that will inform us of cut off without calculating exact constants so to find where the cut off is you have to do precise calculations but the hope is that we will find qualitative properties of a chain that will tell us when there is cut off and when there is not and philosophically I think by now we understand that just the theorems are somewhat behind so I'll give more examples but I'll say a word about the philosophy that a cut off occurs when your state space has a lot of independent or weekly dependent variables and mixing means that you know most of these have to you know have to mix separately while if your state space is low dimensional or tightly correlated then you don't expect cut off so again in the cycle I said there's no cut off if you have any torus of bounded dimension so you take instead of the cycle so we looked at the cycle Zn, Z mod n but you could look at Zn to the D for fixed dimension and do a simple lazy random work there and for any fixed D this will not have cut off however you could say let n tend to infinity and also let the dimension tend to infinity and it turned out that if the dimension tends to infinity with n no matter how slowly then this implies cut off so so somehow cut off captures a high dimensional behavior of your chain so from the spectral point of view cut off represents that the that mixing isn't the work just of one eigenfunction but somehow there has to be a cooperation of many important eigenfunctions so one signature of cut off is high multiplicity of the highest eigenvalue but that's not necessary for cut off because you might have many vectors corresponding to close by eigenvalues so exactly quantifying the condition is still still a challenge so here is a nice example by Aldous showing that just having mixing and relaxation time of different order is not enough to give cut off so this will be a reversible chain so there is a path here of length 5n where the walk is so I'm describing a transition matrix but will then the actual chain will use will be the lazy version of this so with probability half we stay in place, probability half we move according to these rules so there is a tail here of length 5n where we move with probability one third left two things right so this has a strong drift to the right and here the path breaks in two pieces so the lower this lower path is of length of length n and here the probabilities are one fifth to the left four fifths to the right on the upper path it's of length 2n so it's longer you don't see it in the picture but this is length 2n and here you have one third to the left two thirds to the right so on the upper one it's longer and slower the lower one is shorter and faster however these probabilities are exactly planned to make this reversible so how does the stationary distribution behave so because of these transitions the stationary distribution just doubles every time as you go here to the right now here as you go on the bottom path the stationary distribution has to grow by factor four every time to compensate for these probabilities and here on the top it doubles every time so because the top path is double the length 2n these two things are consistent and when you reach this point its stationary measure is just four to the n larger than this point idea and so this defines the reversible chain and so there's four to the n times an extra factor of two but ok I want and you don't not important just four to the n times this one ok so this chain is actually a weighted expander so by that I mean if I take any set of measure so almost all the stationary measure is concentrated near this right node so if I take any set of measure less than half so it means it really cannot contain all these nodes on the right its boundary has measure which is at least this constant times the size of the set so this is a weighted expander and hence using the trigger inequality which I haven't explained here but again you can find it in many sources including our Markov chain book the relaxation time of this chain is actually order one so the largest eigenvalue is separated from one the mixing time is of course not order one it's much larger because if you clearly order n so the worst starting point will be here almost all the stationary measure is concentrated here so the mixing time is determined by the time it takes a particle from here to reach this point however that time depends on whether the particle moved on the bottom path or the top path so if I look at the total version distance of stationarity it behaves in this funny way so it I'll explain it back in this picture so when the particle reaches here it could with positive probability take the top path with positive probability take the bottom path these are not exactly half-half but if it chooses the top path then it will take it another time a 6n so after this this initial part I'm ignoring the laziness so everything has to be doubled but without the laziness this part would take time 15n and then another 6n to go on this path because here you move at speed one-third and the length is 2n on the other hand here on the bottom it's shorter and you move faster so this part will only take 5 thirds in ignoring the laziness the laziness will double all the times so the point is that these two possibilities take different multiples of n time so this means if I look at the total version distance of stationarity it will start near one and at the time which is well twice times 15n plus 5 thirds n it will drop and then but it will drop not near zero it will drop to a constant which corresponds to the probability the particle is taking the slow road and then finally when the time of the slow road arrives which is 15n plus 6n times 2 then the total version distance will go down to near zero so in this example there is no cutoff so this is what it looks again everything has to be doubled to account for the laziness okay so so this is just what I explained in words and so in this example if you look at it closely there is a cutoff but no cutoff yes are there such examples for the stationery distribution of uniform? examples yes so I was debating whether to give that but let me give so this is an example that's modified from an idea of Igor Pak so let's start with the hypercube and we're going to okay so I'll just write it in words with probability 1 minus 1 over L take this move in the lazy hypercube and with probability 1 over L pass to the uniform distribution okay so at any state so this will be a chain it can be done in states but to be concrete start with the hypercube so this will be still a chain on the hypercube with probability 1 minus 1 over L and I still have to tell you what L is so L will be the geometric mean of the mixing time and the relaxation time so it will be n times the square root of log n okay so the state space is the hypercube with probability 1 over L you just go from where you are to a uniformly chosen configuration so with the 1 minus 1 over L you just take one lazy move in the hypercube which we know how to do choose a uniform coordinate and randomize it okay now what is the relaxation time in this chain so it's easy to see that the relaxation time is still like it was if you see what the change in the eigenvalues the leading eigenvalue is still close to n over L but the mixing time is now going down from n log n to n root log n because the main thing that will govern the mixing we won't wait for mixing in the hypercube just by time n root log n which is before that we're just going to make one of these uniformizing moves that will be the mixing time but because this is just waiting time for a rare event here the total duration distance will have an exponential profile rather than a cutoff so the mixing time is order L but the T mix of epsilon will be L times log 1 over epsilon so in this example so the stationary measure is uniform and there's no pre cutoff even though the relaxation time so the mixing time is order L which is larger than the relaxation time so we do have such examples but you see they're kind of contrived so this we're out of time but I said there are other interesting counter examples so the one you saw is modified from an example of PAC they're also interesting examples by Lacouan so maybe let me end with one example you can think about take a binary tree of depth k and n vertices so n is about 2 to the k and try to figure out if in this example there is cutoff or not so maybe I'll end with these two so I'll tell you what the answer is show that on a binary tree binary finite tree so we just go to depth k there is no cutoff so one way to do it is check the relaxation time and the mixing time and show they're the same order in fact their order n which is the number of nodes so this is 1 2 which I'll start is a find a sequence of trees where lazy simple random walk has cutoff it's quite surprising that there exists so if you have a path and you do simple random walk and a path there is certainly no cutoff it's just like the cycle so it's a little surprising that there are trees you can write down a sequence of trees without weights so I'm not when I say lazy simple random walk on a tree and I want it to be bound maybe I'll emphasize bounded degree and lazy simple random walk has a cutoff so that's enough for today so I'll switch tomorrow I want to I guess my next lecture which is not tomorrow I want to tell you a little bit about the rate of escape questions as well and then we'll come back to cutoff in the last lecture thanks a lot for the beautiful lecture are there questions? I have a question at some point you mentioned the connection between the spectrum of the operator and the multiplicity and the ratio between the relaxation time so you think you can express the connection about the bounds of the spectral density close to one or close to the maximum algorithm we don't have so there should be some connection like that we don't know the right one yet does it depend on the structure of the eigen states or just the spectral density do you think it depends on the localization of eigen states? in general it must depend but under some symmetry assumptions maybe you can remove that and it's still not fully understood so we've got a coffee break of 30 minutes and we recombinate 4 o'clock thanks again