 It is organized for giving me the chance to speak at this famous institute. So this will be two case studies. The first one is complex decision. The question we ask is this L2 minimization for this problem is quite easy, but the solution is dense, usually people just discard this dense solution. The question we ask whether such kind of dense solution actually is helpful for solving this sparse problem. Then the second story is related, it's the binary perception in the teacher-student setting and we want to ask how to learn as fast as possible, so it's active in online learning how to design the questions or the quiz. So now starts the first story, it's basically we proposed an algorithm that is holistic, it's called SSD, shortest solution guided decimation. This work was in collaboration with the student, Samo-Tien, and research at the machine, Pan-Zan. So what is complex decision, we have already had a lot about this. The problem is quite defined, it's quite easy. So you have a matrix, this matrix and you have a sparse signal and the task is to reconstruct from the measured vector Z, the original sparse signal. So for this matrix you can regard each line of this as a measurement. So in this case the number of measurements is m, in this case it's 4 and the problem, the instance dimension is n, in this case it's 10. So here we have two parameters, one is the density of measurement m over n, this is defined as alpha, so it's the compression measurement density and the other is the sparsity of the data, this is rho, rho is defined as the number of non-zero elements in the signal versus the total number of elements, in this case it's 3, 0.3. So this is the problem and the task is to, from this measurement vector Z, you want to reconstruct the hidden, the planned solution H0 and the given matrix D. A huge development in recent years actually starts from the 1980s and there are many excellent algorithms, for example this orthogonal matching pursuit and the similar orthogonal least square and also the R1 based minimization. In statistical physics we also have this approximate message passing and even we have the statistical physics analysis, but seems the story is still not completely solved. What's unsatisfactory is that in most of the previous studies theoretical and computer mathematical studies, usually people assume this measurement matrix A D is random and the satisfied condition is called RIP, basically means if you pick a number of columns of this matrix D, this measurement matrix D and all these columns should be independent. But in many situations, this measurement matrix might be highly structured, so there could be strong correlations in the columns. For example, in this quantum measurement color simulation, you might want to find the green function, you want to get the density of this something. And this is a physics problem and this matrix K for sure is quite non-random, so the problem is, but if you apply this AMP and such methods to such highly structured matrix, it seems it doesn't work, it doesn't work. So this all work is trying to have a solution for this case that the matrix highly structured. But it seems I'm a beginner in this complex system stuff, so I just found the beginning. What's the measurement? We have this Z is a vector of dimension M is equal to measurement D times H, I know. But this actually the measurement, if you look at it as a kind of singular value decomposition, you find the measurement D, this complex sense actually is very simple. First you, how to say, project this original vector H to some orthogonal base V1, V2, Vn, and you get the coefficient. And then you try to stretch the coefficient, this element, each element, and then you, for each coefficient you apply a lot of vector. So this is a measurement, second principle, you can design, you can have many measurement matrix with this singular values and this orthogonal vector and that orthogonal matrix. For such a matrix when you construct usually, it should be low to be random and independent. So given such a matrix, when the number of measurements is less than the number of, the number of data dimensions, you can get the general solution, H is equal to G plus the main term. And this G is the shortest solution, this composition problem has many solutions but one of them is the shortest, this is called the G. And G is simply just the pseudo inverse of this coefficient of vector G. So, and this G is easy to obtain, it's, you just perform L2 minimization and you get a G. But G is dense, every element is non-zero, it's not what you want. So the question, we ask the question, whether, although this G, this L2 minimization solution is dense, whether it still contains some information about the sparse solution. So we performed some hand-wrapped analysis. So this, if we express this short solution G, we find it's just the superposition of M vectors. So this G, the ice element can be decomposed into two parts. First is GIA and the second is GIB. And GIA is only related to the ice element of the hidden plant solution, HI. But the GIB is this noise stem, it contains all the contributions of other elements of this hidden plant solution. And you find that GIA and GIB are of the same order. So you cannot get some information from this. But if you make a rank, let's say the absolute value of this maximum value and the next maximum value, so you can rank this, the elements of the G and you look at the index with the maximum value and you look at the GI and GLA and GOB. And in this analysis, when you look at this GIA and GIB, you find a very important property that this GIB essentially depend, independent of this element of the index and it's such a value. Of course, it might be plus or minus. And GIA only depends on this HI0, hidden plant solution. So if you require that this GL must be maximum, then it indicates that GIA and GOB should be the same sign. And since the GOB is more or less independent of the index L, it's non-zero. It indicates that this GLA should also be highly likely to be non-zero. So the hand-to-hand argument is that if you look at this short solution G and you look at the element with the highest value, highly likely this element in this original solution, this element H of 0 should also be non-zero at least. So this is the idea of our algorithm. We have a DLSN check, simulations, and we find indeed this rank, when the top rank index is highly probable, it's the index is non-zero. And even this probably increases when the system size, the dimensionality increases. So based on this, we just designed a very simple algorithm. The algorithm has two steps. The first is the decimation routine, trying to figure out which entries of the parent solution are highly likely to be non-zero. And then this backtracking is trying to define the non-zero entries. So the algorithm goes as follows. Given this problem incidence, we find it gets this L2 minimization solution G and we get the largest element and we assume this large element index of this H L 0 should also be non-zero. So we delete one column from this matrix D and we simplify the problem and then we repeat this process again. So we did some simulations. No theory, only simulations. Simulations contains two parts. First, when this I P conditions holds, so the London measure matrix and IID and Gaussian distributed and the maximum eigenvalues of the matrix versus the minimum single value of the order of one. So it's, this is, this Q is the measure of structure, structurally means of order one, it means it's quite random. And as a second example is we, we, we, because the matrix, this I P conditions are violated. In this case, we assume this D, matrix D is the sum product of two matrices, D1 and D2. D1 is m times r dimension and D2 is r times n dimension. Both are Gaussian. And in this case, when r, this number r approaching m, this D becomes highly correlated and this single value Q becomes very large. So these are simulations for random Gaussian matrix when this measurement density measurement versus the machine is 0.2. And we find that this is the capacity of the signal. We find that this algorithm seems to outperform this L1 minimization. And it performs, slightly outperforms this orthogonal matching pursuit because they have similar philosophy. But still slightly performs, auto-performs also question, pursuit. And it is slightly worse than this AMP. AMP is slightly better. But AMP needs prior information about this signal. We don't have any prior information. But if we go to this highly correlated matrix D, early correlations, the story changes. For example, again, this measurement ratio is 0.2. And this, the sparsity, when the sparsity of the signal increases, our algorithm doesn't, doesn't change, actually. It performs good when this sparsity is less than a certain critical value. But this AMP fails when already the sparsity is very low. And for this matrix, when we run AMP, it is completely favored. So these are some more results for the sparsity, for these correlations in the matrix from highly random to highly structured. This SSD, this algorithm, seems completely insensitive to the structured correlations in the matrix. But all these other algorithms, for us, what we tested, they are, when the correlations, seems the performance becomes quite, quite deteriorated. And the speed of this SSD is, actually, it's comparable to OMP. Because when we compute this G, we can repeatedly compute this G by this accelerated due ascent process or even by more advanced methods. For example, the previous speaker has talked about, so the algorithm is about six times slower than OMP. So summary of story one, we have designed a very simple algorithm using this shortest but easy to get solution to guide this search for sparse solutions. And it is, we have some indication that it is highly tolerant to structured correlations in measurement. But at the moment, there is no theoretical analysis. So I will change it to story two, but if you have questions, please ask. Completely lost. This is paper. The paper was published in last year. No feedback, actually, from the colleagues as usual. So coming to the next story, it's about active online learning in binary perceptions. I think science is usually such a process, first you have some observations, it's passive, and then you might form some hypothesis, and you refine this hypothesis by experiment, design the queries. And finally, you might get some abstraction and get some theory. So we look at this problem from a very simple model. It's perception. It's also discussed quite a lot. This perception is a building block of this multi-layer neural network. Basically, it's one, many to one, you have inputs or n-dimensional inputs, and you have some vectors, weight on these edges, and you have output. This output, in this binary case, is just the sign of this product. So we look at this, the problem from this student teacher scenario. The teacher's weight, J, T1, Tn, is hidden to the student, but the student can ask questions. So they can input binary questions, cosine 1, cosine 2, cosine n, and the teacher will give out the answer, sigma, either 1 or 0, minus 1, or 1 or 0, and then the student tries to guess what the teacher's vector T is according to these questions. The problem, the question is how to learn as quickly as possible. This question actually has been asked by many people in the 1990s. So in mathematical terms, you can ask questions, cosine mu, so it's the mu's question and it's a binary, and the teacher gives the mu's answer, it's also a binary. So given these P questions, then you can design the version space or the partition equation. It's just all these variable vectors J that are consistent with the P questions. And from this, you can compute the probability that the teacher's vector J should be, is just, if the teacher's vector is consistent with this P patterns, it will contribute one, and the denominator is just the total number of such patterns. For this, you can basically, you can compute the mean value of each vector J, of each weight J, J i, and you can then get a very simple algorithm, just take the sign of this mean value J i, and take this to be the teacher's, guess the teacher's weight as this ice element. When you use the spin-glass theory of belief propagation, or you can find the letter in this online iteration process, when the questions are asked, after another, and each question is used once, you find that this weight will evolve according to this equation. So when you ask the P plus one's questions, and this weight J i, as the ice edge, will be J i P plus some hub learning terms. So we use this algorithm to perform online learning. Then there are two methods to online learning. First is London. So the questions are quasi one, quasi two, quasi P are just randomly chosen. The student doesn't think about the problem, they just ask London questions. And the second one is the questions are highly designed. You ask one question, you get the answer, and then you guess the teacher's weight, and then you design the next question. And then you design the set question. So for when the system is quite small, for example, it's less than 25, you can perform exhaust search. You find that this London learning is not so good. The area decreases slowly. Even after you have asked two times questions, the area is still non-zero. But if you design the questions, ask the best questions, you just need to ask any questions to completely and exactly figure out the teacher's hidden message. This is for simulations for large system, for this passive learning, when the patterns are London, using this MAP strategy, you find this relative area decreases to zero, not zero. When the number of questions asked versus the dimension of the problem is approaching 4.5, there is still a tail. It means when in this passive learning, although the generalization area could be zero, but still it's extremely difficult to completely figure out the teacher's weight. So the rate of success means you completely get the teacher's weight without any area. And this horizontal axis is a fraction of questions, the relative number of questions you ask. When the system increases, the dimension increases, the success actually decreases. So it means for this London online learning, probably it's impossible to completely learn the teacher's weight. There are still some mistakes. That might not be good because in physics we usually want to get the grand theory, but maybe it's impossible. So we tend to active learning, so we want to make your name as quickly and possible and name without area. So the question is how to design the questions. Actually the idea is very simple. When you ask the next question, you should make sure that this question, cosine p plus one, will exactly cut the solution space into two equal half, two equal parts. So that you ask question the size of the solution space shrink by one half, this is the best. And this idea is not new from 1972. But I only know that in 2018. So I'm actually very stupid. But this is intractable because you have to get all the solutions, candidate solutions and then you cut by half how to make it practical. The practical is also after the derivation is found that you can just ask this question. You want to do the newly designed vector cosine p plus one to be orthogonal to the mean vector j after the p solution. This is quite similar, but it's still a little bit different from what's the early work. The early work in 1992. The difference is that they use not the mean value, but just the actual value at the piece stage. This is j i? Yes. Yeah, j i? Yes. So j i is the, so there are some, some some edges between the input output. So j i is the ice weight on the ice element on the ice age. Yeah. Yes. And this value is in, in my case is plus one or minus one. So, yeah, so this j i p means it's, it's the estimated mean value of j i of the student after online learning p patterns. So if we use this trick, we can design actual learning algorithms. First, we get a solution, or get a vector cosine p plus one. There are many such vectors. And then we input this vector to the teacher and get a feedback cosine plus one. And then we use this equation to update the mean value of the, of the weight vector. And then we repeat again. We did some simulations also for different sizes and found the relative area now decreases to about 2.2. Yes. R p is just a small constant. It's not, not a linear rate. In this, when you use the spin glass theory to compute the evolution of g i, there is some numerical value, R p. It's more or less independent of, of the, of the process. So the, but the, the key point is that first, now it's only needs to a little bit more than two times the questions to, to, to make the relative area decrease to zero. And all those, the other point is that actually, actually, this active learning, the area will be exactly zero. It means completely means the teacher's signal. And actually, this kind of, this kind seems to be a phase transition phenomenon. If the questions, number of questions is less than this value, the success, exactly success fraction is approach zero. But if you ask enough questions, the six rate becomes one. This here, success means exactly success. So in Chinese, it's don't suddenly you, you get the message, but you need this value to point to 20, 25 questions to ask to get this completely, completely reconstruct the teacher's signal. So this is by, by, by this kind of 10 years of optimum learning, statistical learning. But this of course is not so good as deductive reasoning. If we use the human, human power deductive reasoning, actually, we only need to ask M plus log two N questions to get the, get the signal. Okay, the summary of story two, active learning is much more efficient. So it pays to, to ask a different, different questions, but you have to design the questions. And also there seems to be kind of dynamical phase transition in this learning process, online learning process. Again, there is no theory. So I, I stop here and again, or if you want to convert to the detail, there is a paper written by me and published in the best journal theoretical physics journal in China. It's called communications in theoretical physics, it's a journal that our D.A. professor, our, our, also published. So you should also follow his lead. So thank you for, for listening. Thank you.