 So, hello everybody. I like to thank the co-organizers to let me fill this slot here. What I would like, I don't want to start with the outlook. So, I'm aware of the fact that I'm between the lunch and we have heard a lot of different ideas so far. So, I would like to take a little bit more higher level view on algorithmics in data science and then come to some topics, some validation issue which we were looking for in the last couple of years and you might call it also quo va dis artificialis intelligensia. I think it relates to these issues but we are very far away from it. So, we live in the time of data and we had these machines in the 50s and today we have these big centers. They are algorithms and they produce value and I would like to point out that this mapping from data to value is a completely discriminative view which we are pursuing here and we want to actually get under control how that value is generated out of data. For personalized medicine what you see is you start with hundreds of gigabytes of data, you annotate this data with information from the doctors, you generate knowledge out of it, that's a broken pathway for clear cell renal cell carcinoma and then you generate value and concretely that value is the survival probability of the patient. And this costs you here, this costs you maybe 10 kilobyte and you start out with 100 gigabyte. A variable challenge of our field is to identify those bits which have to be preserved on this pipeline. This is all what matters for the patient, for the doctors, for the insurance companies and so on. And if you miss this bit and you are classified as belonging to the green cohort but you belong to the red cohort then chemotherapy might be ineffective on you and the last half year of your life is hell. So this is not an irrelevant bit, this is true value and I'm not saying that you have to extend the probability of survival, it could be any other kind of measure you commit to. This is one of the challenges. So in essence if you use the strategy of data science, you want to estimate a gigantic conditional probability where you condition on your mental health status and you want to estimate the probability of that value and you might even deal with interventions by drugs in order to understand what you are doing in these types of pipelines. On the road maybe I like to show you a concept of how to validate algorithms. There are some considerations which I think are non-standard both for machine learning and for the broader science community and they point towards the essential role of algorithms in these endeavour and then I will talk about examples if I get to this point. Random input implies random output. You see that here with this gene expression data they are drawn from a probability distribution defined by your experiment and the biologists tell us you should look at correlations, you convert this into a graph picture, then biologists also like to know what the communities are so you do your graph labeling, your colouring and you find these communities. The question is what algorithm should you use because this is an element of exploratory data analysis and people have cost functions for that and you cannot compare these cost functions because there is no absolute level of comparison. This is a problem. Many people consider that actually an art and not a scientific question. We know that quite often you destroy the essential bits in the preprocessing of your data and then the fluctuations by choosing the wrong model later on in the pipeline are minor but people tend to focus on where they can publish and publishing on the preprocessing is usually much more difficult than publishing on the nice models which you invent later on. The problem is if you cannot measure the information loss relative to your goal then it's arbitrary. It's basically believing that you are doing the right thing. So in philosophy this is called epistemic uncertainty and this is aleatory uncertainty. The output is a random variable drawn from this conditional probability distribution where you condition on the data. So algorithms in data analytics you map from the input space X to the output space C. I used it because I was working on clustering for a long time. I used C as the clustering and X then returns the algorithm returns then this terminal answer C button of the algorithm. So that gives you a probability distribution over possible answers. So your answers of the algorithm generate a distribution because you have a distribution of your experiments and if you have typical results of the experiments you should be actually robust against any kind of these fluctuations because an experiment is characterized by the signal and I assume that the signal is stable but the fluctuations are not stable. So the fluctuations tell me how precisely I can make a claim. Now the problem is why do we go to the terminal point of this algorithm? Maybe it's better to have a posterior which already regularizes your results. And then you would have a distribution over answers when you sample from this posterior and you take into account that the random data are distributed according to this probability. So I think and that is sort of the design aspects. There are a number of core questions which come up in computer science but are more general for science. The first one is Kolmogorov told us that algorithms is random variables as input compute random variables as output. It's a triviality in that sense if you believe in probability theory. But it's not paid attention to in a lot of design of algorithms. Shannon told us that algorithms have to compute typical solutions. You pay a high price for a guarantee against atypical solutions and if you want to do that you have to have good reasons. So in cryptography you have good reasons because you have powerful intelligent opponents. They might focus on these types of solutions. Nature is not malicious so in the typical cases of machine learning when you get data from your random experiments that should be taken into account. And when algorithms do generalize that's an issue. They have to be robust against noise models and mismatch. So if we want to make the dream of artificial intelligence come true then our algorithms have to autonomously improve their performance. And we have to know what the meta rule is in order to make progress on the resolution of the output space. So for those with a notion of typicality at least in computer science it's not too widespread known. Imagine you have a random coin flip, 60% head, 40% tail. Which segments do you want to report? We can discuss for a long time why you actually want to report a sequence but that is a caricature of a situation in engineering. You have a random optimization problem and you want to report the best solution. The maximum likelihood solution. So the maximum likelihood solution gives you the all one sequence. It's a classical case where the most probable sequence is atypical and you should not report it. So from a computer science point of view it's not at all clear what it actually means that your algorithm correct. I would say the algorithm which finds the global minimum and where the global minimum is atypical is wrong. So the correctness statement on algorithms have to be rethought in this context. So my claim is machine learning is not optimization. First of all you can't do what you want to do because you can't actually minimize the expected risk. What you can do might not be most relevant and that's minimizing the empirical risk. And so I think what machine learning algorithms ultimately do is they localize solutions in the solution space. They pursue a metric goal to be repeatable under uncontrolled fluctuations on the input. This is a metric goal. Optimizing a cost function is sort of finding a good interpretation according to a partial order on your solution space and that is a problem for the following reason. So we should pay attention to this localization. So why is it a problem? Because the standard setting says you have training data, you have validation data, you have test data, then you have a bunch of different cost functions like in my graph gene expression data clustering problem, you have candidate models, which one should you choose? Every one of these candidate models gives you a conditional probability distribution and then you sample from it, which one to choose? So what people advocate is you evaluate your cost function on X double prime on the validation data for solutions which you draw from your training data. So solutions which come from this probability distribution C given X prime. So training data solutions are checked on the validation costs and then you choose the one which has the lowest cost. So that's the standard view and you use optimization, to classic optimization to sort of robustify yourself. I think this is now sort of my summary for students. It violates an important wisdom of modeling. You should use small numbers when you have large uncertainties. Cost functions exactly do the opposite. You give the largest value when you are very far away from your target value. You use the largest number of your risk with all the uncertainties involved. And why is this a problem, despite the fact that we are so successful in the physical sciences with these concepts for thousands of years? It's a problem for the following reason. If you don't know what the correct cost function is and you want to learn it from the data, assume this is now your risk function and this is the probability distribution I draw from it and I now want to know how precisely I should localize this probability distribution. Since this function is convex here, if I make this distribution more picky, what I gain on this side overcompensates what I lose on this side. So there is a clear tendency to make it smaller and smaller and to go for the empirical risk minimization. So if I select models according to this concept, I have a bias which comes from the way I formulate the selection concept. And that's why I think we should go back to information theory and look what is done there. Now in information theory what you would do is look at the posterior on the training data, look at the posterior on the test data, and if they agree, sort of then the communication was robust, if they agree it's good. And you sum all those solutions up which have exactly this property. It's kind of a cross-correlation experiment which tells you what is the right thing to do. Now you might say okay this is arbitrary, boomer believes in this, the rest of the community believes in this, what do we care? Well first of all you get a nice maximum. Second what you can do is you can now look into the sensitivity of these two criteria. So let's assume I can represent x prime, my training data and x double prime my validation data by the average of the data and the deviations. So I go in the center of mass coordinate system if you like and then I expand for small perturbations, for small fluctuations these two measures and the validation error is linear and the score error is quadratic in the measure. Very simple because the score error is symmetric in the two components. So there is less sensitivity to the fluctuations in the posterior agreement which you might call sort of a formalization of cross-correlation than in this validation error. So that's another argument. So the error minimization is more sensitive than score maximization. Computationally we know that error minimization with convex functions is easier than doing maximization because these scores are non-convex. But they reflect better what is the situation in poorly modeled areas of science where you have the chance to give me a local description but not a global one. So when we do genetic gene analysis two very similar species can be nicely compared to individuals on the same species even better if you compare my genome with that one of the worm it's up to guessing what the difference is. So we have very large uncertainties there. These uncertainties might be very unlikely in the inference but they still in the model selection have their footprint. So in the context of the usual type of approach maximum entropy, physics-motivated, you have a risk function. You go to the Gibbs distribution as your posterior. You might actually make the temperature parameter time-dependence in simulated annealing. Then you have this Pt of C given X and then you want to stop at the right time when you have the optimal resolution. So that I think is the challenge in this context. And so the robustness advocates the maximum entropy approach but there might be situations where the algorithms actually are not following the maximum entropy trajectory and have still some advantages. If these advantages give you more stability go for it. So here you see how this works on K-means clustering. I use K-means because it's easy to visualize. You have these pictures over everywhere. So the first split is in these three different clusters which correspond to the sources but I use five different prototypical clusters in my hypothesis class. So at a certain level you get these bifurcations up here and this is primarily random fluctuation driven. So this is the overinterpretation due to the lack of stability. But it's not only stability for a number of years, almost a decade I believe in stability but it's only one part of the equation. The richness of the hypothesis class and the versatility of your algorithm to exploit this richness is the other side and obviously an algorithm which gives you a much more informative answer should be given more credit with respect to errors. So the stability is relative to how much you want to say. So in the next two slides I sketch graphically how this information theoretic approach goes and how it actually follows the concept which I am advocating and it uses only indirectly cost functions via the Gibbs distribution which converts the cost function with this tremendous increase in the values into a score function which squeezes these large deviations from your target values between zero and epsilon. So I have a data space, I have a hypothesis class, I map my graph here into labeled subsets of vertices and if you look what's happening this is my probability distribution for my gene expression data and I map it down to a distribution in the output space but the real question is this distribution here is a function of this algorithm. How broad should it be? That's essentially the question. Should it be more narrow? Should it be wider? Let's assume that this is the right size. Now I need alternatives. You might do a permutation test or you might do something else. So assume that we know a set of transformations which shift this distribution over here. Now you map this distribution to the output space with your algorithm you have a representative here and you do these m times and you get the coverage. So I leave it open for the moment what these transformations are but they are not supposed to touch my measurements they are only supposed to change my representation of the output space the hypothesis class relative to the data space. So this is reality. I have my distribution but I can only sample from that distribution so this is an individual experiment, one graph. You map it to the output space let's assume we have a deterministic algorithm this is the answer. You get two more samples this is what you get. I was advocating that only giving you one sample by such an algorithm is the wrong thing to do because this sample is a random variable. So you want to capture somehow the probability distribution. So let's assume we have this probability distribution in form of a Gibbs distribution if you have a cost function that's it. Again, what is the right width? And these widths correspond to different algorithms. Again, we cover it both input and output space and now we need the test that might be the permutation testing classification. So I get a second sample the red sample here from my distribution that's the control experiment I apply a random transfer I play one of these transformations which I have pre-selected and I now map this sample which was effectively drawn from this distribution down here to the output space and the decision process now is can you recover the transformation when you get fresh fluctuations x double prime for this control experiment. So the decoder sees only this product of tau s applied to x double prime and since this transformation is unknown and you have fresh fluctuations you have now a way of probing what is stable between x prime and x double prime and that's exactly the signal. It's not the fluctuations. So if my transformations resolve the output space so finely at such a fine resolution that you confuse the mappings due to these fluctuations then you are overfitting and that now gives me a quantitative criterion which is the following you take your output distribution on the validation data that was the red distribution you compare that with all the distributions which you generated by your different transformations though this is the distribution on the training data and then you choose the transformation which maximizes this quantity and if you choose your transformations randomly as in Shannon's random codebook idea then you can build now an argument how to derive a quantity which you should use as a scoring function for different models. So this is in terms what you basically have to evaluate. The second trend is whenever your estimator tau hat says that it's different from tau s so that's when you have recovered the wrong transformation and the wrong transformation means you are in the overfitting regime so you have to calculate the probability of error and that error should go to zero so now I'm completely in the Shannon picture of how to derive a criterion. So here is the different steps I don't want to go into it the theorem which we put together is the probability of an error is bounded by a quantity i e to the minus a quantity i minus the rate log m and it has to go to zero so what I'm after was not the communication protocol what I'm after is this quantity i, that's the one which I have to maximize in order to compare different models so if an algorithm would like to improve itself after the change of the algorithmic behavior the i prime the new i should be higher than the old i this is the score function so the kernel k here so if you look at that function it's the logarithm of the cardinality of the output space I'm talking discrete output space I'm talking about the labelings in graph coloring times the quantity which is between zero and one this correlation measures how much the resolution of the output space is actually reduced and it doesn't matter now in this picture how you calculate how you actually generate this correlation you can have an algorithm which is guided by a cost function you can have any other kind of algorithm as long as you can monitor how this algorithm is estimating its own uncertainty and that's essentially the posterior probability distribution of the algorithm for your solution c conditioned on the experiment that would be x prime or x double prime as long as you can calculate this one you are in business with this quantity I know it's difficult but that's what you expect from a data science algorithm it should know what the answer is and it should know how certain it is about the answer and it has to give you this uncertainty about the answer because the answer is a random variable anyway so I think and maybe this might be provocative and I hope to steer a discussion I think this is what we have to deliver so posteriors should agree I told you already that and so the optimal posterior would be the ALT max over this parameter T where you look at the expectation over experiment training and data set pairs training and validation data set pairs of the logarithm of the cardinality of your solution space so this is the output space of the algorithm times this kernel which you have to look at the problem is you cannot evaluate it but now I can basically ask what is the open challenge the open challenge is to find now a set of algorithms which can be written as a sequence of posterior probability distributions up to the point T star and T star should then be the probability distribution with the best possible resolution and I took this long term short term memory picture up there because Schmidt-Tuber claims that's the universal machine of learning algorithms okay so I showed you this before so the problem is we have to do something so now I think we are in the realm of standard statistical learning theory where you say this expected quantity which I'm supposed to maximize I have to lower bound with a sample average quantity minus a penalty and so where do we stand right now I don't know what the penalty is and we operate with L equals 1 that's the situation and we get away with very good results at least with results which we could test because the set of algorithms which we are exploring right now is very limited one parameter adjustment or maybe a couple of parameters so let me give you so this is what you see if the kernel for Gaussian posteriors when they are very broad these kernels they have very small value when you multiply these curves and integrate over it here you have a maximum and here you are overfitting you get again a small value so you get a nice sort of a maximum peak for the resolution of your Gibbs distribution in some sense if you have read the book by Misare Montanari one question which is open in that book is when they talk about learning how do you determine the temperature of your learning machine given the input fluctuations and that has to depend on the algorithm so let me just show you this is a pipeline for which we used it it's in computational neuroscience it apparently has an impact on research and also on treatment of neurodegenerative diseases I'm not a neurologist but that's how this is used you have fMRI data you produce diffusion weighted images 1 to 10 gigabytes then you actually construct the diffusion tensors per voxel you use a tracking algorithm to go from voxel to voxel to see where the diffusion pathways are and you identify these diffusion pathways with the white matter connectivity then you try to find a clustering algorithm which uses this connectivity matrix and actually finds groups and then you map these groups back to the surface of the brain this is the challenge the pipeline again is my first pipeline on cancer detection has dozens of algorithms and some people actually swear on their algorithms as if it would be a religious item and the point is these pipelines are hand-tuned by humans who have an intuitive understanding but it's not quantitative and it's way too complex in many applications to actually see what's going on so this has to be automated and so let me briefly brush over it the biology is mapped to a graph problem in this context and if you look at a very simple area down here you see that the capacity is first going up and then it's going down and then it's breaking it's breaking off here and it corresponds to the point A here where you only see a little bit of a separation then you see a good separation then you see the fluctuations and the respective coloring down there tells you how the fluctuations kick through you see it more clearly in this picture here sorry, a problem to start the video let's see if it's working no, the video doesn't seem to run so let me just explain this is what happens at the optimal resolution when the general capacity is maximal here then you get this type of partitioning in terms of this pipeline and you should see the dynamics here but that doesn't seem to work right now okay, so what I like to emphasize is if you validate these types of information processing pipelines there are two steps there is the statistical validation and then there is the scientific validation the scientific validation actually means you have to make sure that the final value which is up to human decision because the human is usually the user of these information processing results the final value has to be scientifically validated but there is no point in scientifically validating a result which is statistically not solid so I think if you have the statistical validation then we can autonomously improve these algorithms and then we go in a direction which actually allow us to control these information processing pipelines we might not understand the true model which is represented by these algorithms because it's too complex but we can control its reliability and the reliability is coming from repeated experiments so how much time do I have? 50 minutes? okay, so let me come to a problem which puzzles me a lot because it relates statistically and computational complexity so it spills minimum bisection and we are back in the camp of optimizing cost functions so what is spills minimum bisection? and this is the work of Alex Kronsky and Wojtek Spankowski and master students so what we have is this graph you select a subset of the vertices and now I ask you to find a bisection on that subset and contrary to the picture here imagine that almost all the vertices are gray and only a few are yellow so I look for a sparse bisection find the minimum bisection for the subset U so this is the bisection here and for the scaling if we look at the cardinality of these subsets the blue and the red one to scale as n to the 2 7 so that comes, the exponent comes out of the proof technique so we conjecture that this is an NP-hard problem but it's more than an NP-hard problem it's also a sparse problem so you first have to find the two subsets and then you calculate the bisection on that so you have the optimization with the detection problem mixed and so what you want to do is you want to compare it with something and the model I want to compare it with is the random energy model so the random energy model for the purpose of search is actually very uninteresting because it by construction does not allow any search you choose 2 to the n different states randomly and so you might have looked at 2 to the n minus 1 states and you still don't know what the global minimum is because it could be the last one and you have no indication from the states you saw before because the last one is also statistically independent from all the others so this is the remark on REM we assume sparsity and there are two theorems one says SMBP is upper bounded by REM for rescaled temperatures so to be precise the free energy of SMBP is upper bounded by the free energy of REM but it's also lower bounded by the free energy of REM so what confuses me is the following the free energy is a moment generating function for the probability distribution where should an algorithm find the evidence to search in the small in the sparse minimum bisection problem when it's free energy asymptotically is like the REM free energy and there you can search so that's the issue there is no gap so somehow if you can search in sparse minimum bisection but not in REM a property which is not having any footprint in the free energy must help you for that and I can't believe that this property exists so at least this is a direction to search and to use information theoretic ideas to make progress and clearly this relates to what Fonso was telling us on Monday it has to do with hypothesis testing where it's possible due to the sparsity of the problems many solutions are statistically independent from other solutions because they are not sharing any input parameter to define the costs sorry Harkin so I can in principle get a model with the same free energy of the REM but where the assignment of states to energies allows me to find very easily a ground state so the fact that you bound the free energy of this problem with the one of the REM I'm not sure that you have matching bounds lower matching bounds yes but I mean my question is whether the free energy tells you something about search or not I mean I agree that the REM you cannot search but yeah the question I don't think that the answer is so simple but the free energy is basically a moment generating function for the probability distribution and clearly the search is characterized by the posterior probability distribution but the free energy the free energy is the stochastic picture of a partial order of the states over the full configuration space so okay anyway when you have what we call the dynamical phase transition you have a phase transition from a replica symmetric phase to a phase where the phase space is shattering an exponential number of states that makes a big difference for dynamics but the free energy just changes by epsilon okay so you go epsilon below epsilon above the free energy changes by epsilon and the configuration space is completely changed so I don't see how you can relate so tightly the free energy value to the energy landscape which in terms condition a lot the search algorithm okay so the theorem says in the asymptotic limit it's the same okay but I wanted to start a discussion so you achieved your goal I left physics some time ago so it's clear that people might know more about that so what you can do is for the sparse minimum bisection problem you choose a random instance x then you add noise to that random instance x and you get x prime and you add some different random noise to that instance x and you get x double prime now you have the two different phase transitions in the high temperature phase you see nothing it's sort of the paramagnetic phase in the phase where you have the middle phase it shows you what is common between x prime and x double prime and in the phase at very low temperature you basically see the noise realization between the differences between x prime and x double prime and this is the picture depending on the signal to noise ratio you get these behaviors for the different values of this gamma and gamma is the variance of x and sigma tilde here so we also apply this to the community detection problem because the minimum bisection problem has two complicated combinatorics and we hope to make progress on this problem in the near future because it defines information radii in the solution space around a potentially planted solution so in some sense what I just showed you is the planted solution is sort of the signal and the noise which you add to it which gives you the training and the validation data sets x prime x double prime that is the perturbations which are unreliable from experiment to experiment so okay let me come to a so I seem to get some more information so maybe some of these questions are already answered it might not be such a miracle but I think along these lines you can make progress to understand the search complexity in these highly randomized spaces with very localized planted solutions now let me get back to the more overview picture we believe that learning machines have these performance because of the imitate humans and these are the pictures the deep networks of the 90s from Jan Leca and this is how it looks today and I don't have to show you how powerful they are so you see it here if you look at this i-clear paper here and you want to hallucinate from this from this cock to a vine of glass and back to the animal that's what you get in terms of trajectories so the mimicking technology is enormously powerful and it also works on flying objects so that is very powerful but what is missing what is missing and this is now philosophical what is missing is the scientific method and you see the scientists don't need a blind dog guiding his way through research we don't actually need the colleagues they are helpful but you can work on your own and so the scientific method basically tells you you ask questions you propose hypotheses you conduct experiments you analyze the results you go back at a certain point you define a theory and you go back and so that is the Wikipedia version of the scientific method and today we have algorithms at all of these edges so if we don't learn how to validate these algorithms we will not be in business of autonomously generating knowledge and this is the dream of artificial intelligence that due to the repeatability of observations and conclusions on the real world on experiments we can build our internal hypothesis classes so I think this is the open challenge which we face and if we talk about the science of data science then clearly this should be answered in some way or another I'm not claiming that the proposal I told you is the ultimate answer on it but it has to go along these lines and the only theory I know which actually focuses on localizing solutions in the output space is information theory because when you can reliably localize you have actually codes available and you can use these codes for communication so the communication metaphor is just an abstraction for making a decision process rational and I agree with Tali it should be rate distortion theory and Helmut has pointed out that there are a lot of works from Kolmogorov and others which can be used but this is the challenge and for the statistical learning theory community all these hypothesis classes we are talking about in combinational optimization have infinite VC dimension so you cannot hope in the asymptotic limit that the individual solution is the right answer you will always end up with the distribution okay why is this necessary this is now a more a political political slide which I find very nice and I got it from Alessandro Corione from IBM Research in Rieschlikon near Zurich so this is what the Americans were dying of in 2016 this is what the Americans actually looked on Google you see which of the columns made most progress in sort of attention and which were decreasing well everybody knows that burger eating is not good so it went down cancer is gaining in importance what gained dramatically in importance is terrorism the danger was around 0.01% and this is what the newspapers report on so good are humans in estimating risks and I think we need a technology in such a complicated world to help us because it's just too difficult and we can't allow ourselves mistakes in very important questions these days that's why I believe actually all this type of research is absolutely fundamental for the future of humanity so what is the outlook algorithms are models of posteriors and they localize in the solution space I hope I gave you some indications why completely concentrating on cost functions is a dangerous thing we have a case when you minimize hamming distance as you do it in coding then the concept which I showed you gives you the channel capacity of the binary symmetric channel minimizing finding the right widths of your probability distribution for the hamming distance out of sample risk minimization setting gives you the wrong capacity it gives you a higher value and so you see already with these very simple cases that the concept is not getting close to channel capacity learning requires validation of algorithms I believe that this is the ultimate challenge because we will have algorithms which invent algorithms and obviously then understanding the algorithms as to some degree the GDPR I think no what is it the European data protection law is already outdated because we will never understand these algorithms if they are sufficiently complex problem solving so I believe that the optimal resolution in the hypothesis class is something we should actually determine for algorithms and I'm a strong believer that the scientific object to be investigated is the pair between an input distribution and the algorithm this pair is the scientific object it's not the algorithm in isolation it also defines you something like structure specific information because I only measure the input bits which matter for the output since the input space is much, much larger in most applications I know than the output space you get something like a context sensitive information measure which is where the context is defined by your hypothesis classes and I hope that it actually at one day it relates statistical complexity so I still have to sort out this this puzzle with the free energies and with that I thank you for your attention thank you very much