 Moje moj je zelo izgledat, da počekajšo organizacije za to, načinaj je zelo však však. Vsih, ki sem včešel, je jaz naša viša, da sem čakal s njemnimi dve ljudi, več do dve ljudi. Však stih je srednje vstupov, včešel, in tudi se počekajši, kdaj nespojno je, pristva. In je to ciljene vse objete za, da je pa nastavljen, ko ima je vzdušnji z nisem... in konštržanja za, da je ta vzdušnja vzdušnjem z inisem da je visi izkonce. Od taj pa zrem diveč, objete ciljena in nekaj, kako je vzdušnja in je tudi počkaj. In je tudi, da je v mnogu vse izgleda, da se vse izgleda, da je vse izgleda, da je vse izgleda, da je zelo zelo, da je vse izgleda. Na samim različno vse stativne regularje da se zelo vse zelo Protočno je, da smo videli, da nekaj bolj izgledali zvršenje v empetijskih zvršenjih. Zvršenje je, da je to tako izgledal. Sreč nekaj rezultat, da imamo površenje, je, da bomo površenje, če je to, če je to, če je to, če je to, če je to, ki je kodujen za zeločen informacij na generativne procese in kako je zeločen, zeločen samples, kaj je tudi praviti, samples, ki je zeločen informacij. Zeločen po vstavu, kako se zeločen zeločen, bo, da smo videli, da je dobro vodnje vodnje. In druga pravda v Tokiju je, da tega ideja včešla, kaj tega včešla. In tega je dva izgleda. To je minimulne skripče. To je, kem, kaj je prišljeno optimalno kodin, za vse postavljenje od nekaj nekaj izgledaj, in druga je izgleda in izgleda izgleda. In potem, imam zelo, kako je vse ideje, in izgleda in predstavljati ozvoričke informacije. Prejdaj smo se dobro začeli, da so prihvali vse motivacije. Nažalj je to prishot, da nekaj prihodil, nekaj neče beživati, da smo vse začali včutih technikov. Tudi način smo prihodili dave na srečenke, je za protene v nekaj organizacijah, ki preforma vsega funkcija v nekaj organizacijah. In vsega funkcija je vsega vsega sekrencija. In vsega sekrencija je vsega vsega mjutečnji. In nekaj poslusti vsega sekrencija je vsega vsega vsega. Ok, ali na različenje iz taj neilah evaluate, zve časem neštješnosti po razlogovел, bese, da so, da spjede v tudi neti vidi, kanju. Prvno drugi posled, je zame čas slik, naredaj za bilo posledanje, Vse je to vse je posledniko in kot sem zaživljul nekaj prilj. Zelo je. Zelo je vzelo, da počkaj jazem priče, ko se vzela, vzela je počkaj počkaj počkaj počkaj počkaj priče, na zelo pozitioni, ki so veliko relevane, biologično. In da je tudi nekaj, kako se je, kako je, kako je, da možeš zelo, kako zelo, kako je začneva informacija na generativne modeli. Na posljeno generativne modeli. In da je, Kako je relevancja? Proste, zelo sem vsečen, da sem tukaj relevancja, da je relevancja in trincična. To je ne relevancja, da je vsečen. Na različenih protensih, kjer smo prišli v kontakcij, prišli, da je zelo pravda konečnja, prišli, da je zelo pravda konečnja, ali tudi, da smo prišli, da je zelo pravda konečnja, načo je in trinsik, je tudi, da je zelo pravda, da je zelo pravda. Tako štr, da sem ... malo stojno izgovoriti, je ta ev distractivna vsoza od zelo prišli, na dve, dve domene, z zelo biologije, ekologije zelo financija, načo, to, la, da. Bez pereštenih, kako se vzačenim na tzw. prizidljati njega vziv, ker, zato na makroskopijskih sestem v fisiku, se priziveti, da vizibno, vzivno, bolj nekaj zelo dominirati z sentalnej vizibni. Zvrča, o čeku nekaj vziv, je nekaj vziv. Zelo je izgledaj izmah, v srečenji modeli v fisiku, ki je zelo model. Nekaj je počet, kaj je izgleda izgleda, izgleda izgleda v Fisiku, kaj je vsešel, da je tvoje bolje vsešel, in tako, kaj je fizicist, ki je vsešel, vsešel, kaj je vsešel, zato, da včasnje vse organizacije predrečno zelo tudi kritikalj pojnje, zato, da je to, da se per Makoči načinčal, in zelo, da je vse pojel, da je tudi tudi kritikalj pojel. Zato je, da vse bo, da je tudi vse zelo zelo začetnje ostatistije kritikalitje. To je začetno, da pričasno se zelo. Zato, da mi različi, kjer potem je bil. Vse bo, da me pravimo se našlič, da je zelo. If you look at the distribution of people that live in cities ... ... this is broad. But they could also give me the zip code. The zip code is much more precise information. If you want to locate a person, a disposition of population by zip code is probably not very broad. ker je več, da bi se postmanje, ki so in nekaj zelo zelo, kot nekaj ne bo več efekcija. Zato videlš, da v tem se vrtečne vrtečne vrtečne vrtečne, ki sem vrtečne, ki so srečen odpravil, ki je zelo začin, da bi se vrtečne vrtečne which could be chosen by an engineer. And these variables here that are intrinsically relevant to the system, where it shows this broad distribution. So this is more or less the team that I want to discuss. By the way, if you wonder where do I live, in a place which has a very nice zip code. OK, so Tako, sreča ideja je več sempljena v taj skup. Ideja je to, da imaš data, data seta, neko čekovati z komod Richardi, nekako je zelo pošli iz Brother Segu. I tako, ještje je ob sel, dovolj bolj, ještje je bilo, je še pošli iz Abandok. W pošli malo je naša odničenja od začenja seba, zelo ne su otvaril facti, in je bilo. Ne, je bilo, Oil means, in it is that the water what is generated, it isICK, and special meaning in the order of the samples. So you think of each point in your data set as being generated from a given data set and if you want to understand how much information does your data set contains, essentially there is a precise way to do laptop, da je tukaj data. Vse je tukaj, da je tukaj biti, da se je zelo vzelo, je tukaj antropi. Kaj je tukaj, da je tukaj, različno, da je tukaj data, da se je tukaj vzelo, tukaj. Tukaj, da je, da je tukaj biti, Some of these are just noise, and some of this instead contains information about this underlying distribution. And you would like to find out what is an upper bound to the number of bits that can tell you something about the generative model. Vse, kako se taj človek je, nabij se, človek je zelo na naprej vse, da je začal investešany informacij. To je počuk, da je vse zelo, da tega vse nekaj da posla začal več. Je to del, da tega je začal več, da je nekaj informacij nekaj začal in začal. Na različe, zelo izgleda izgleda izgleda izgleda izgleda izgleda izgleda izgleda izgleda izgleda izgleda maksimalentropi, tako je nekaj vse informacije na generativne modelu. Ok. Zato je tukaj prišljeno v srečenju, nekaj je tukaj biti, da je značen, da je značen informacij, da je značen, da je značen, ok. In tako, nekaj je, Ta entropija, taj entropija v kapitku, je tudi nekaj, če je dovoljilo do prejzena, je dovoljilo do revolucion, ko boš opravil. Zato vse boš vse boš prišličen, neko vse boš prišličen produk, ki je Leonardo, prišliš na klasifiracij, kateri nekaj zelo 61 produk, 2, 3, 4, 6, et cetera, et cetera. Vse možete vzveči vzveči in vzveči vzveči. Kako vzveči vzveči, kaj je vzveči vzveči, zato vzveči vzveči vzveči vzveči. Vzveči je intrinsik nošan, zato vzveči, kaj je vzveči. Zato vzveči vzveči vzveči, kaj je vzveči. Vzveči je vzveči, kaj je vzveči, za udol charity! Aj slim undoubtedlyam, akje toh za vzvečini sre č uploads vzičenov!! Ztainomシ envjang, kaj je solo skutitoval, in nekaj se več tupčin je! Vzveči vzveči vzveči, smo impedance vzveči, ki lahko soci m three. Pri potec, submit en nimi gejien sorוס, potec, srečere де spl look, tudi zelo. Kaj, da se vse modem vsega izgleda, oče je zelo, da se počutite, da se počutite, da se počutite, da se vsega izgleda. OK, zato, v tem, da se počutite, kaj je zelo, da se počutite, da se počutite, da se počutite, stavljeni taj kajhzav. Tudi je to tako izglede, kaj sem imel, kaj bi se vse 4 000 stokov v New Yorki stokov in skupaj imel, se z nimi pošli, kaj je, nekaj skupaj, ne, in kaj je, se za nimi kustas, kaj je, nekaj vali srezov, Netk bombing is essentially the K of S will be the number of firms that end up in cluster SE. And then you will have a certain value of this H of K. Now these graphs, here I give you a tradeoff between resolution and relevance because along each of these curves Zdaj, ki igram ta resolucija, in v tem teh neseljosti na graf, razbiči nekaj informacij na modelji. Spodobno se izvizuje resolucija in lajoutje. In naredasne v pojavnih, kaj biti pojavnih, V nekaj biti posledajte zvršenje delaj, zato da je ljubje, izvršenje delaj, in nekaj biti, njega je ljubje bez ljubje, kjer pa je najbolj in posledajte kompresje, tako, da se zelo resoluti, zelo pa pa vse posleda. Pozivno, razprotno, ko si stavila, Vzvečo je nekaj resolučen, odkazo izgledaj več nekaj, nekaj ispravilj vali vzvečjo, kaj možete odkazati? Zdaj, če dozvori svoje priplazve? Zato je... ...zato je to, ko se razneče za nekaj vali, z njega je zelo bilo. To je zelo način. In to je vsega zelo način, da se je komputite inalitivno. In zelo, da je tudi vzelo, je to vzelo vzelo vzelo v zelo. Kako se pošliči, da se je začin vzelo, je to, da je tudi začin tudi začin tudi začin tudi začin tudi začin Topča je, da v tem soljebnjih delovishaya zelo, pokazma všemzljenju doj educationo, je to tudi za del kĺzira, obzela na tudi za del kĺzira, ok? Tako, boš dan, da vzelo, da pa n photomčiti otvorili vse mali informativne vzile, izgleda taj kritični vseh, statističnih, kritičnih vseh. In vseštje, na vseh maximale informativne vseh, taj eksponent mu regulacije vseh vseh, vseh izgleda vseh, vseh izgleda vseh, In zelo, da je tukaj počut, ki je mu equal to 1, when the slope is equal to 1, where essentially beyond this point your compression, when you compress up to this point, you don't lose information, actually you gain information, and beyond this point you lose information. In this point here actually happens to coincide precisely with the zip flow, which is characterized by a number distribution of frequency which is proportional to the frequency to the minus 2. In some sense this tells us that zip flow in this picture corresponds to optimal compression, or the limit of lossless compression. Ok, so, if you have questions on this first part, then maybe you can, otherwise let me go on the showing. And of course, I mean if this is true, then you expect to find these features in systems that are expected to code efficiently for some functions. And also you should be able to use this principle in order to find relevant variables. Ok, so these are the two things that I want to discuss. So, first of all, you find this zip flow in many systems that are supposed to provide efficient representation of something, so first of all in language, then in the immune system that is provided to, say, give an efficient representation of this base of bad guys out in the environment for this particular fish. And also in the firing patterns of retina, say, a system which is supposed to provide an efficient representation of visual stimuli. Also, I have to add this because these are figures I took from the papers of Leonardo Dima. And you find this broad distribution in Google Matrix suggesting that page rank is actually extracting efficient representation from different systems. But there are a couple of systems that are really designed to extract efficient representation, so I want to discuss them in a little bit of detail. One of them is minimum description length and I'll try to give you a very short tutorial of it. So imagine the following problem. So there is one guy, Alice, that is doing an experiment and this experiment is testing a theory which is, say, the standard model if you are in LHC or whatever. And this theory depends on a number of parameters. And actually because the theory is meant to really probe this theory and to estimate these parameters are not known. Then the question is after the experimentalist has done this theory he wants to send the data to the theorist who is going to analyze them. And the question is that Bob the theorist has to decide how many bits he has to set aside on his computer in order to store this data set. This is an easy question if he knew the parameters because then he would know the distribution and if you know the distribution minus log 2 of the probability of the sample would give you the number of bits that you need. But the question is that do you set aside this space before knowing theta? So the way in which minimum description length attacks this problem is by a minimax strategy where essentially you first compute the regret is essentially imagine that Bob decides to use this coding function this probability to code the samples, the data then essentially is the number of bits that he will need and his regret is the difference between the number of bits that he will need and the minimal number of bits and the optimal number of bits that he would actually need if he knew the parameters the maximum likelihood estimates of the parameters. So the idea of minimum description length Bob thinks that Alice is sending him the worst possible sample for a given P and given this is going to minimize his regret find the optimal P for the worst possible sample and this gives the solution and the solution has this explicit form which is called normalized maximum likelihood. Ok? So now for example in this normalized maximum likelihood you see the log the regret which is the log of this minus the log of this is just the log of this and sorry there is no minus sign there is plus sign and essentially what you get in this theory which was by resigning is the same penalty for this is a measure of complexity of the model of this model here and the penalty is exactly the same as you get in base information criterion plus there is a term that depends on the susceptibility matrix official information. Ok? Ok, so now the question that I am going to ask is let's take this as a generative model for samples let's draw samples at random from this probability distribution assuming different models and let's see what samples look like. So, if you do this what you find is that for different model Dirichlet model is simple simplest possible model you can think of just every outcome is a given probability or a simple paramagnet with different number of spins or an easy model or an SK model or a restricted bolsa machine then in all these cases you find out that the the value of these relevance that you get is very close to the maximum and it's very far from what you would get in random samples. Ok? And this is true for all these results and actually if you look at what is the distribution of the frequency in the sample you find these power laws in different points of this so in these curves what we vary is in this case the number of states with respect to the number of samples in this case the number of spins and in this case also the number of spins in the visible layer with respect to the spins in the here the spins in the hidden layer fixed. Ok, so the first conclusion is that in this scheme that is supposed to extract optimal efficient representation actually exhibit these features. The second thing is that now what we can do is to study large deviations of this of minimum of normalized maximum likelihood samples and ask what do samples which would require a little bit more or a little bit less bits to be coded look like. And you do the usual thing in the sense that say if you want to look what is the probability that you get a sample with a certain coding cost E then essentially you have to build the partition function, the generating function and then do the large transform because this is standard large deviation theory. Now if you do this what you realize is that as soon as this beta positive so as soon as this beta is negative you look at samples that would require less bits what you find out is that you find a very sharp phase transition where essentially the samples localized on one single outcome. So this is so you see this is here we measure the value of this h of s that you can sample from this distribution for distribution like this one and also the maximum frequency of the maximum state and you see that when beta is negative essentially you see only one state in your samples. So you have a sharp localization transition. What this means is exactly that there are no other codes there are no other distribution that would provide a better coding which is what you would expect because this is an efficient code ok. But from our point of view this is interesting because it provides a very precise way in which what criticality means criticality because essentially here we identify what is the main parameter that you have to change in order to swipe across the phase transition and so so this I think it's interesting. So the second example is deep learning and what we did with Korean student and his supervisor is to study different architecture of recited Boltzmann machines or convolution networks where essentially we trained this architecture on different datasets so this is the example of the MNIST dataset which is a dataset of 100 digits so the idea is that essentially this architecture has a visible layer where you present the different inputs and then there are connections between each layer and the next one and you learn these connections in such a way as to maximize the likelihood of the sample and ok, so now what happens in this particular case is that when you present one particular sample here one particular data point then the distribution of the spins of the variables in the hidden layers is very polarized so that essentially you can map one to one input data to states in the hidden layer and so for each of the hidden layer what you can compute is the entropy of the state and the entropy of the frequency of the input the resolution the relevance and this is what what you get so you get that in in this say deep network of restricted bolster machines what you get is that the different layers all have representations that are very close to the maximal possible value of the relevance and actually if you look at the distribution of the frequency in these layers you find this power loss and this distribution this is very different from what you get for example if you cluster if you get other representation for example by clustering the inputs with k-means these are the yellow triangles and this is also not what you get before you train the network before you train the network you either get most of the layers at a very high resolution or at a very low resolution and this is also true if you run learn reshuffle data or completely random data which means essentially the representation that you extract here are really a hierarchical set of representation which is actually you can think is actually present in the data and what dbm is doing is precisely to extract this so there is one layer here which is very close to having distribution which is zipf which is very close to one and this is actually the layer which has very small classification for which the classification error is still small where essentially the compression is at the optimal point where you have a compressor representation but still you are not making too many classification mistakes and actually what you can see is also that this particular layer has also the best generalization ability or generation ability because you can start from equilibrium distribution at a particular layer and then project to the visible layer and see what is the distribution of digits that you get and actually the sixth layer produces a distribution which is very close to the distribution with which the system has been trained whereas other say two shallow or two deep layer produces either very standardized figures of very noisy figures ok, so let me come to application to real system so another system where you should expect that there are efficient representation is the brain and this particular work was done in collaboration with Yasser Ruhm, mostly done by Ryan Cumber which is a PhD student that we shared with Yasser Ruhi so this Yasser Ruhi is in this institute where there was this discovery of grid cells grid cells are particular neurons that are in this part of the brain responsible for special cognition and what happens is that if you take the firing pattern of if you record one of these neurons while the rat is moving what you find is that these neurons these grid cells fire in correspondence to the nodes of anexagonal lattice and different grid cells correspond to different orientation of these hexagonal lattice and different lattice spacings but there are also neurons when you record there are also neurons that do not correspond to anything so the question is when you do the experiment in general you know what neurons are coding for and in particular in this example here in this dataset you find that some of the neurons are also coding for the special direction the direction of movement of the animal some are not so the question is how do you find out what are the relevant neurons when you don't even know what is the covariate ok so the idea that Ryan came about is to look at to divide time series into intervals of size dt and consider each of these intervals is a different state s and to count how many times the neuron fires in this particular state ok from this you can compute a frequency distribution you can compute the resolution and you can compute the relevant and you see that as you change dt you are going to change the resolution and this relevance is going to tell you how variable is the dynamical state at that particular resolution so this is the example for two particular neurons as you see if you take a neuron like this one then you get a curve which is this green one and if you take a neuron like this one which is an interneuron and a curve which is like this one and this you get by varying dt this is very small dt sorry this is very small dt this is very large dt so the issue is that if you want to get an information of how a neuron is dynamically relevant then what you can do is to have scales essentially compute the area under this curve and this is what we call multi-scale relevance so this is the result of of this calculation of this multi-scale relevance that is on the x-axis compared to the mutual information with the position of the rat with the direction of movement the angle of the direction of movement so as you see neurons that for us have low multi-scale relevance do not contain any special information any information either on position or on direction of movement whereas those who have high multi-scale relevance they contain information on either direction of movement and or spatial information and this is the same we found the same thing in a large data set of neurons these are neurons they recorded in a particular say part of the brain which is called medial internal cortex neurons recorded in two different regions and if you take the top neurons according to our notional relevance and try to decode the position of the direction of movement you find that these neurons do as well as the neurons which give you the maximal amount of information on the space position the only thing is that in order to find the top relevant neurons in our case we don't need to know the position we only need to know the time series of the spikes whereas in order to find the most informative, special informative neurons you need to compute the mutual information in space and know the position and this is the same also for head direction the last example I want to talk about is using this idea to find relevant positions in proteins which is the problem I started with and the idea is very simple for any particular subset of positions that you can choose you can compute how many times you will see a particular sequence you can compute the relevance and the resolution and you can essentially try to find the sub sequence which has a maximal amount of information which is mostly relevant so these are little examples so if these are your systems depending on if you want to choose 6 out of these 12 variables then if you choose this one you will get a certain frequency distribution that will correspond to a certain point in this diagram if you choose this one you will get a different distribution and the idea is that we do just gradient ascent on all sub sequences of n positions where we try to maximize the relevance and we do this many times let's say 100 times then we count how many times a certain position in the maximization of this function and this gives us a score of relevance of that position and so what we find is that this method identifies gives at least in the example that we looked at device sharply between relevant and irrelevant positions and the relevant positions contain highly conserved variables so this is site entropy so this is the entropy of the particular amino acid given position and these are measure of relevance so the most relevant position are very conserved but also we find highly variable positions and these green ones which are also marked here are the biologically relevant ones and the top here we also show the top most relevant positions and actually they contain the most biologically relevant site so it's a method which is very different from correlation based methods and gives a very different result and it's not completely hard to say because when you look at correlation based method like single valid composition what you try to explain is the variation of the sample here instead we try to capture correlation so correlation in conservation so we get very different results but also we can capture similarities what are the similar in different subfamilies of proteins and also an interesting thing is that this technique is able to capture information which goes beyond single site and per wise correlation because essentially what you can do is to run the same method on the original data and on data where you reshuffle the data keeping single site frequencies and per wise correlations and you see that you get a very different result ok, so this is my conclusion so what I try to convince you is that this very simple measure which is the entropy of the frequencies is a model free measure of intrinsic relevance that can tell you what is the minima, what is an upper bound to the amount of information your sample contains on the generative model and that samples for which this relevance is maximum at a given resolution are I have this broad distribution, I have this power law frequency distribution and they show up in normalized maximum likelihood which is the distribution which is predicted by minimum description length and this is interesting because it gives us a precise sense in which we can define statistical criticality and also this maxi this power law distribution also come out in deep learning and so one issue is whether this could be thought as a design principle for this architecture but this goes a little bit beyond our understanding of deep learning and then there are a number of applications to real data that we are working on including say in neuroscience and in sequences and there are a number of other extensions that we are also working on in order to say have better, simpler methods to do inference in high dimensional data so, thank you very much So, for deep learning can you propose some more efficient structures for deep learning 3 or network So, we didn't we didn't study this thing in great detail actually so, I mean the idea is that depending on the number of layers so, we did only say some test but depending on the number of layers that you choose you can get finer say set of representation on this diagram or a Corsair one and whether this can be used as way of design these networks that's probably it's an interesting issue but we didn't go as far at the moment