 Hello to everyone, thank you Yacopo for taking care of the organization. So today we have Tynki Hu, who is a student that just arrived in the QLS group that will be among us for at least a year. Tynki is a statistical physicist with specialists in in the statistical physics of learning. In particular, he studied a lot restricted Boltzmann machine, which I think will be the subject of this talk. So Tynki, welcome and thank you very much. Yes, thank you very much. Okay, let's add Brian to give me this opportunity to share some of my research. So now I will share my screen, okay. Sorry. Okay, now I share my screen so everyone can see right? I mean, you can see my screen right? Yes. Okay, okay. Hi, can I so yes, we can see the slides, but the voice is a little bit probably because there is a little bit of echo in the room. So I don't know, maybe if you open the window maybe. I will give that to Tynki. You mean I open the window? Oh yeah, there is a little bit. Okay, now, okay, next everyone. So today my this seminar is my talk is about the physics law of unsupervised learnings in neural network and I'm Tynki and I'm from the Han Han University of Science Technology and I am a physics. So now I want to share some of my finding about some physics understanding about the unsupervised learning. Okay, so let me give you some introduction of what we want to research on. Okay, so we all know that now the AI of deep learning get lots of inspirations from neuroscience and now and so we can see that the deep learning can also give some some light on our neuroscience back. So I just want to share some now some opinions about these two fields. They are interact-togethers and previously we know the convolutional neural network get lots of inspirations from the human beings vision system but now we find that the deep learning the deep neural network is also a very good framework for modeling the vision systems and for us to understand the information process in our brain. For example, we can see that in deep learning our convolutional neural network there are hierarchical architectures for models to process an image but in our brain it's similar from the retina to the alien and to V1, V2 to other air brain and let's talk about the machine learning or deep learning. I think there will be three aspects we need to focus on. The first is that the express power of your models and the generalization powers of your models and then the most important that we can put forward an algorithm to to training but there are very basic questions beyond basic on these three I think very important aspect for learning that how much data we need so I mean we must to answer that is there a simple model or toy model to can give us an analytical result about what is the minimum data sign for us so we can if we know that and we can answer that okay this model is how the trainability is how the express power is and it's based on how much model with how much data I need to provide so now we know that when we talk about the learning there are a lot of different kinds of learnings and now I focus on supermass learning. On supermass learning I think is the definition is easy that we only have the raw data and I want to search for the heightened features from the data but I don't have the label and I just can provide pictures by the queen and he has a very famous word just like I showed in the pictures he seems okay like the proof the reinforcement learning is just like the cherry and the supermass learning just like the ice but all supermass learning just like the cake just like the cake it's just fundamental for us to understand our human beings intelligent systems and also you can see the end name from the code separate herb he also think that for a human being or for a biological if a our human being for example if we are a baby we come across the word and we face lots of data without the label it's the first thing we can do and based on this if we understand the unsupervised learnings we can get a more deep understanding of the other kind of learning so our question is that can we find some basic role to control or to for us to to understand the unsupervised learning so next we focus more on a very basic and simple models we want to see if we can get some excuse me may ask a question yeah sure could you go back to the nice cake okay nice take yeah so I can yeah okay so just just to put in perspective your considerations unsupervised learning also attempts at making sense of the design on the cloth which is on the table and also the cup which keeps the cake all these things matter right yeah but you were pointing arrows to the cake which means that your attention is already directed to something which is rewarding so I think that it's a bit diminutive to sort of say that reinforcement learning cares only for a limited amount of information and that the same the way that you perceive an image is already strongly conditioned by the fact that the meaning that you attach to the objects inside the image yes yes I I agree yeah yes but I just want to see that unsupervised learning is more like I think a basic and for for us to to understand the other kind of learning so so I just stress that yeah sure yeah yeah sure yeah so so so I just so the previous just the background I want to give you about why we have interest in on on this model okay so so now I just to turn to our focus our topic today about the restricted bread machine and the basic the first papers I think to okay it's not not not not the first paper but more famous papers by the Hinton on on science in the 2006 if I remember correctly these papers they talk about to use the restricted machines to just reduce the dimension of the data and the model he used is very simple just like I show the rbm is just a two two two two layer neural network and the first layer is the visible and the next layer is the heightened so they are connected by the weight just in the photo is a is w2 right so the restricted bread machine is an energy based model it's me that's okay we can see here the the w is just like the w2 in these photos and w's connect the visible they the visible unit and the heightened unit ball and they and and this and and this and this is just like the buyers of external field on the visible unit and heightened unit so based on this energy function and we can use some for example the lot lots of famous algorithm because for example the contrast divergence and the others are the ones the version but they all based on the mcm state minimum color to to to sample from these distributions and you use this sample to train in a weight to to improve the low likelihood and this is the restricted bread machine the basic and also based on this there are lots of more advantage the version for them for the deep resume machine and the deep belief machine but if we want to use the physics to an elite code get the results of this model I think this will be too complicated because you can see there will be lots of visible unit and lots of heightened unit for example there will be 1000 and 2000 and I I feel it's difficult to to analytical and now we turn to a very base can we turn to a very basic model for example if we only consider there is a one heightened unit here and the data is the high dimension are you still stuck on this slide reducing the dimensionality of data or I'm still on this slide I don't know if it's a normal yes so sorry can you can repeat are you still on the slide the first slide on rbm or there are other slides that you show sometime in blocks no I'm just on the first slide about the rbm yeah sometimes it gets okay okay I just want to say in the hidden models he just talked about the the algorithm and he used the algorithm based on this energy function but my question that can we put forward a very basic models for us to analytical get to the right right now of how many data we need for training this model so so because now you can see there will be 2000 and 1000 units in here but what we want to consider for a very simple case I mean if we only have 100 units so what will happen and okay if in this case we call this you know one-bit rbm because we know that for the hidden unit if we consider is the binary for them both the positive one negative one so if there is a 100 unit we call it the one-bit rbm and it's my call operator they they have a primitive paper to talk about the these things about the one-bit rbm so if there is a one-bit rbm so the only weight of the features that we consider is xi here and the sigma is the data and the a is just the data index so we provide the data number in m and the dimensions of the weight of the features xi in n and so based on if there is only one hidden unit in the positive one and the negative one so we can get the posteriority distributions of the weight just like is this way and z is the partition function here so if we analytically model we can get the rfc the strength of the the rfc is just the minimized of the data it's only uh you can see that it's particularly that's the beta's power minus four beta is just the inverse temperature it means that it just shows the noise level the noise level of this model we can see that if the temperature is high means that the noise is large so we need more data for us to we should provide more data to this model for that to get some useful information so i can give you a feature here figure here so you can see at the different different temperature you can see the there is a phase transition here it's just the second phase transition because it continues and so you can see the different temperatures the from the numerical algorithm and for the theoretical predict it has different stress holds and first it's just like one and next it's just like 2.5 and they are show different behaviors of the stress hold to show the physical relations from zero the zero is just the overlap with this other parameter if this other parameter is zero it means that the machine the personal machine haven't learned anything from the data right so can you remind me the beta is the temperature of the model will we generate the data yeah yes beta beta is just the inverse temperature and the data also generates basis on this temperature okay yeah that's excuse me what is on the horizontal axis so here okay the the x read just the alpha alpha because i just assume that the alpha equal to n alpha equal to m over n the m is the number of the beta and the n is the dimension of the feature yeah so we can okay you can you can see for them pose for this for this line we know that the beta equal to one and the r alpha c equal to one minus power power power minus four so it's equal to one and the foreign you from the numerical also you can see it's it's a map map map well and the one beta go to a point eight so if you put point eight into this expression you can get to the position of the stretch hold of the fifth position the critical point yeah okay so so this is the one beat r bm so we can get we call this the spontaneous symmetry breaking because before that we can see it's just like the random gas because there's no no buyers of the weight to be positive one or negative one because here we only consider the binary case and because we don't know if they're positive or negative so we can see the random gas but if the data is more than the structural alpha c we call this that's about turning a symmetry breaking appear and the machine begin to learn something from the data oh okay and basically on this and we we want to see that's okay we know that based on these there are some other other guys to uh to do some um i mean numerical work for example in balala and the peter solvice and they also are training some restricted bottom machine and i i just we can just show you here the p is just a hiding the unit they find that no matter p equal to one two three five or ten the alpha c the critical point is just like very very similar positions of the their gamma is just what i call the alpha so he just do this by the simulation and he didn't give the results so that this will let me think that can we get formula or to analytical solve that if there is a more hiding the unit can we decide the alpha c if there are more hiding the unit and if they are hiding the unit the hiding the feature has some correlation so what they are there will in fact the alpha c and so it's either what i want to solve the next step okay so because they put forward that that's just how it didn't depend on the number of hiding the unit but can we prove this is what i uh research uh interesting so so next step we can consider we try to prove um i mean the proven method we can use some phoenix no records way to prove that okay so uh based on this if we consider the more hiding the unit we consider first we consider the two hiding the unit here okay so you can see this is the rbm and the igk adjusts the the input data and the x y is the two hiding unit and this is the two features that can say one and then connect two and based on this we we can consider a factor graph here so the factor graph here so we can just push the igk here is just the weight and the every weight we can know there we have two elements first is from okay there index is also i but the first from the feature one and next is one feature two and the data so that's like the constrain like a constrain to constrain the values of the every element of the weight if there is no data to constrain the every elements of the weight will be positive well an active one and just randomly but if more data show here so there will be show some buyers buyers right okay and uh based on this we can know that if we if you give me a lot of data we can infer the the features from the the data so we can write a big based on a big uh based on the base law we can know that if we this is the likely hood here and this is the posterior here and what we find that's different from the one hiding the unit there will be two partition function here the first partition function is based on the uh likelihood in that it's related to the consign one and the consign two the next uh partition function the sigma outside is basically it just not related to it's just the average of all the consign one and consign two so we have to solve these two partition functions against analytically without an analytical expression for these two roughly can uh then two partition functions and we can uh get the results and because if you know the previously in the phoenix we only consider one one partition function but but now we we have two so can can we solve it okay sorry can you explain again the meaning of these two partition functions okay okay okay yes so okay so we can see that's okay so from okay so from that is pressure the p uh sigma a based on the consign this is uh this is a distribution right so this distributions there are because there are two consign one and consign two but this this distribution that has our partition functions basically it's all it's only related to the consign one and the consign two so this they this partition functions is just related to the average p consign a based on p sigma a based on the consign right i mean every so because p sigma a based on the feature is our probability distribution this probability distribution that have a partition function okay okay yeah yes and the next that is now we have add data and you just product that and after you product you have to sum over the consign one and the consign two right so the denominator become another partition function so i mean the sigma here just sum over all the consign sum over all the feature one feature two we know that for the z they can they can set one side two they just sum over all the data sigma but he didn't consider i mean they just sum over the sigma but the but the omega here sum over the data and sum over the feature so there will be two position function i'm i'm i'm clear i think if you understand yeah okay uh okay so there are two position functions we have to deal with its two position function okay if you see here there are the two position function and the first we can see if there is only 200 units because n is a very high dimension and we can either tolerate its passions and find that the position function z we can add the analytical result it's only related to the q q is just lack the correlation between the two features so this will be uh all the parameters uh after we will use so because of this now we have no how to deal with these two potential functions because in our case the first potential function z we can get to the analytical result of the n is the high dimension here okay and if we have already know z here's and we the only the only remaining potential function is just omega outside so we can use some so from okay if we know this and we can use some algorithm to to to training and because we can use the cavity method uh i didn't show the details but i just we can just define the message from the factor to to note and the note to factor and to to training this a model and we can get the margin probabilities of each of each cosine one cosine two that's like the magnet hidden for us to to infer and now we can learn the features from round data but if we want to prove the first trend isn't the critical point and get to the whole face diagram and what we need to do next so so we need to use the uh replica method from the data physics to get the analytical expressions of the free energies of of the position functions the the capital sigma here so this average we have average on the architecture and mean as the distributions of weight and on the data samples so we do these two we call it a quick average and then we just okay there is a replica trick in that we just copy the system to n times and it makes and go to zero and we can get an analytical expressions of this free energy okay and the basis on this replica method we have to define how many other parameters we need okay so next we think here we need to define eight other parameters i i will just show you that okay now we have two feature right so these two features and because now we we use a teacher students as now real it means that there is a teacher they have two two feature we call this a feature one true and a feature two true and our students has also two feature we call this a feature one gamma and a feature two gamma gamma in replica index so these two feature also have a overlap we call it r so it's other parameter and the average features with it is true teacher features also have overlap we call it t1 and t2 and if i cross cross eight the average features which they are different teacher's weight for that pose that students the students the feature one with the teacher's feature two so we call it the tall one and tall two and the average feature they have there a self overlap just like the ea parameter okay so we can call it it's just like the q the q1 q2 and the average two features from a different replica index has has a small r and the same replica index we can call it the capital r so if we define this eight other parameters and also we will introduce that there are conjugated eight or the parameters in the conjugated space so if we solve these equations so i think finally after some calculation we find a very elegant formula of the alpha c okay so now the q i can just show you here q adjusts the correlation because the two features and we can see that the alpha c depends on the beta is just the north level of the inverse time feature and the q is the feature the overlap so in the two two half in the case if q is equal to zero it means that if these two features are just orthogonal so the alpha c is equal to beta of my power minus four it's equal to the one of the unit cases but if q is not zero so we can see the alpha c will reduce and i mean that's more hidden unit and there have correlation we'll reduce the beta the minima data the meet it's better for us for us to learn because these two related features can provide some information to us so i think it's common it just makes sense but we use more analytical to express these results and so we can i just put the figure of the how the beta and q will influence the alpha c and another thing i think of very interesting that based on the previously results if i just find that okay i just find this this figure i mean that when q is very small you can see the when when q is very small so alpha c just reduced very sharp and when q become more large so it will just not reduce very large and finally it will reduce to just 0.5 level here so this makes me feel that the weak correlations will do better right so we can know that in the neuroscience and we all know that the neuroscience the the correlation between the different synapse just like the one over square and the one over n this this order and we see that in in our models it just the neuroscience just physical machine learning models we know that the weak correlation is not equal to the no correlation it's very it shows it's significantly important because the weak correlation here shows his powers to reduce the data you need very significant here so you can see it's very very sharp so i think it's the first as price for us to show that the models we consider here has such a property okay next things we get that now we have eight other parameters and we get the shadow point equation of this eight other parameters so we think this is the equations for the also what we're learning because if we solve these uh eight other parameters also the there will be eight conjugates the other parameters so actually this will be a 16 or the other parameter here if we solve these other parameters we will still we will know that how when alpha become become large these other parameters will change and so how this system will be because now we know that the different features will will learn different from the two teachers right to two teachers wait so we can see what will happen so what we will see that we find the whole free stack when of this system there will a picture here so when we see that the first we call that's the alpha c in that the spontaneous symmetry breaking the the the stretch hole it just what we get okay it just wait what we get here uh sorry just what what we get here okay so we we call this alpha c is uh simultaneous symmetry breaking stretch hole so we call this is ssb the simultaneous symmetry breaking okay so so when the alpha is smaller than this alpha c so we can see that all other parameters will equal to zero there will there are nothing in the symmetry states and when alpha the begin the asset alpha c access to b but it's more than the simultaneous symmetry breaking and we we we will find that all the t1, t2, t1 come to it equal and the q1, q2 and r equal is mean that in these regions there will be symmetry on the students and the teachers so the students can't distinguish each other because of this uh if you because we know that the t1, t2 is to distinguish the different parts of the picture one of the picture two but in this case there will be no difference because in these regions they just begin to learn but they just learn the common part if the students learn the common part they can't distinguish each other because they they're different parts it just in random as it just the overlap will be there though so if they only learn the common parts so all the overlap we will show some symmetry here so i think it's a surprise for us that the system will be automatically first learn the common parts maybe i think it makes sense because if you for example if you first learn math and then you turn to phoenix and you are first want to learn something they are have some some similar i think either for you and then you go to to learn the different aspects of these two fields but i think from the equation of the system we find it interesting because it will the the the learning of this system will also show the similar behaviors they first learn the common parts and the next when the r fun more bigger than this threshold we can see that the q1 and q2 equal but it's not equal to r and the next step that's uh when the when the r fun with it more big so it will be it will be so some bad fabrication here and r will be have a so you can see the r will be have a turn over here and the q1 q2 you'd always see because it is just the overlap of the kassel one the student kassel one student can type two but we know that in in these regions that the q and r they have just to separate because the students begin to learn the different parts and when when r has a turnover here and when r just go to from from increase to reduce it means that the two teacher will be will be separate so now the students began to know we are different and they also began to know that their teachers are also different so that they have distinguished that the reason of why there will be two solution is that there will be some perpendicular symmetry because you can also see the first one in your teacher or second one is also your teacher so because of the this perpendicular symmetry there will be two applications of this solution but their free energy will will will be seen okay and we can know that okay so we kept from the numerical results so we can see this is the way what we from numerical predicts of the SSB and this is what we predict by the perpendicular symmetry breaking we call it the psp and finally when alpha will be larger and larger the small r will will turn to the true correlation level we send it equal to 0.3 just from theoretical we just set so when the when the alpha is a large mean that we provide more data then the two features we all learn they are we will turn very very similar to the true features of teachers and their overlap will turn to the true overlap of the teachers so this numericals and we match very well of our theory and also we can see the red one is just the common part and the black one is the different part but we can see after the q and r has just separates so you can see at the beginning part the alpha see when alpha see when alpha is very very small so the different part they have to learn and after they separate the two students have began to learn the different parts and before that they just learn the common part so you can see they just learn some common parts and after alpha beyond the alpha see so you can see the common part is just the very stable but the different part they just learn more and more okay so based on this we will draw the whole fish diagram of the of of this model is based on there will be two parameters the first is the beta and next is q so we can see that the first it just the random gas phase it means that it didn't know anything and all the primary just all all equal to zero and and after that we will turn to the ssb phase ssb means that's a pertinent symmetry breaking it means that the students began to learn something common and the overlap were not equal to their role and after the ssb and and the two students began to learn that we are different and they began to learn the different parts and after that they we can turn to the psb t i mean that's the teacher we also the the students will realize that not only they are different their teacher are all the different so they began to learn that and now we can just just draw the whole of this diagram here to based on these 80 equations so i think this will be because we know that the prominent symmetries we talk here is very common because if we have more data more hundred units there will be more complicated complicated symmetry here so we think that maybe the um maybe the supermarket learning is equal to the ssb plus psbs and psbt so we think maybe this we just find some nature of the of our supermarket learning here yeah okay and i think we have some two contributions here that first we show that if if the overlap equal to their roles the critical learning is not dependent on the number of the hadn't neurons but now we have only prove in the n equal to two but we didn't go to n equal to do five ten more and we didn't try but we saw that in n equal to two case the critical points just match was the bala and the pitasol let's do they are gas we we met well and next that we know that the weak correlation is not no correlations the weak correlations will reduce the data size very significantly and the next thing that we we think the arms to progress learning will it would be a universal that it will just shows it will express that as sb faith and the psb faith okay the next thing we want to consider that the role of the pre-knowledge so based on the premise models if we only plus okay we add some pre-knowledge to the system okay if we just pre-knowledge our system of the how the true correlation that they do will be and it's just like in the the best optimal case so we can know that the artist so this gamma is just the this sigma is just what we get from for the alpha c but we know that if we put the pre-knowledge here's alpha c will more it will reduce more so it means that we provide the correct pre-knowledge and it will help the system to reduce the minimal data they need to to to learn to learn from the data okay so I just show the figures here that's there show some difference of the pre-free and we with pre-case we know that with pre-case the minimal data will reduce here and how the first diagram with the okay I just show here okay how the first diagram of the we have pre-knowledge here so we can see just the these two colors okay it is so these two lines and this just what what we call it in the pre-free case but it just these two lines is just with pre-knowledge so we can see that if we have pre-knowledge the psbt and psbs they merge together so in the permittance symmetry breakage case of face there will be no two brush there will be one one one case so I mean that's the the point of the yellow the point of of the black and the red they just separate and the point of the r will be just turn over they mesh together to to this point so as we believe is because we now in a psbt case so it will be a little more light here and based on this property we can have a more symmetry and the first diagram here and so as I see this is what I introduced about we just we have top population in in paper so I think it's just what we see that we find the permittance symmetry breaking and I think we it's what we call them the philic laws to to to are very important to understand that also what's learning here okay it seems very much thank you very much thank you yeah so if anyone has specific questions please feel free so thank you very much so I I didn't understand how does the structure of the data enters in your calculation oh you mean the calculation of our the replica calculation here yeah no I mean or in general I mean what what would you imagine that there are some data which are which have a lot of structure and a low dimensional nature that should be easier to learn than highly structured data uh yes noisy data sorry yeah yeah so now we just want to first on one thing that's how the permittance symmetry uh in in uh I think the permittance symmetry is very common in what's learning so we want to know how this permittance symmetry will will will influence and what will happen if we consider the permittance expansion symmetry so how the first diagram will will be this what we uh what we can consider more here so we just introduce lots of other parameters you can see here to see how this permittance symmetry will change with the alpha yeah yeah thanks so maybe I don't know if uh this is related to Matthew's question but what you consider is the is a teacher-student scenario right so yeah the data is essentially the same structure as the yeah that you're using to to learn right yeah yes now we consider the teacher-student scenario for example you can you can see here and we we have a teacher and the teacher have two two features and the three feature one true and the feature two true and now we have a student we also have two two feature feature one and the feature two but because of the but because of the function it just the even function so if we change the consigned one the consigned two for example if we first write a consigned one the next we write consigned two so the the distribution will be same right so we want to know that's because this shows some permittance symmetry we want to know that's how this symmetry will will influence uh the system this is the k point what's what we want to do yes uh I wonder if you can understand more really of what we want to get because uh we know that's uh there are actually there will be two symmetry firstly if there you consider the consigned one if you consider it is the negative consigned one if the the distribution will be same and if we consider at first the consigned two and next the consigned one the distribution is also same so we want so there will be a lot of symmetry so we want to see how this symmetry will influence the learning process and I have a question so uh you consider an unsupervised learning setting yeah um I I know a bit more about the supervised learning thing and in for example in committee machines there is a transition where if you consider a few hidden neurons at some point you have a so-called specialization transition where the these hidden neurons start to to align on different features while before the transition all these neurons essentially encode the same feature is it kind of connected to this permutation breaking symmetry in your model is it of the same nature or yeah yes yeah sure because you consider is about the community machine that has a multi-layer case right and the multi-layer case because I think the permanent symmetry is not only in unsupervised learning in supervised case just like you mentioned on the two layers the community machine and there will be some hidden and they also show some permanent symmetry and I believe if I remember five ways Sablinsky and had some papers many many years ago talk about the talk about the permanent symmetry of the community machines on the supervised case but no no no one consider this on the supervised case so it's what way uh focus on okay yeah yeah and you could also consider settings outside of the teacher student setting or for me for me for the mean for the for the mean mismatch case right yes okay okay because firstly if we actually if we just I'm here if we consider the if we provide the true pre-knowledge of the how the how the overlap of the two feature this case is just the best optimal case is that's a match case and the first one I I consider here it is a mismatch case because the models are the model here is the different from different from the teacher because we for the teacher models we know that we can set the how the over how the correlation level of the two features the level you can set 2 equal to 0.3 0.5 you can set it for teacher but for students you don't know this information and because of this there will be no nethermory affinity and no such good properties here so because so we have a very complicated phase phase diagram but if we setting a match I mean that if we consider here it is just if we consider the the true if we provide the true pre-knowledge and we guess in a PS optimal case so the first diagram will be more more more more easy so there will be only three phase diagram the first is random gas and the asset being and the PSB so the the two branch of the PSB will disappear because of the PSB optimal setting yeah yeah any other question okay thanks so if there are no questions maybe we can stop here thank you very much thank you yeah thanks very much yes thanks