 Welcome everyone to the second talk of the day and our first keynote speaker on this third day of our summer school. It's my great pleasure to welcome Seb Hochreiter, one of the pioneers of deep learning. Okay, so Seb is now directing the Alice unit lintz and the lintz Institute of Technology, the AI lab there the Institute for machine learning at the Johannes Kepler University lintz. So he has numerous important roles. He's also the founding director of the Institute of advanced research and artificial intelligence. And well, he is known for this fundamental contribution to the field as an inventor of LSTM of long short term memory in 1997. And later on he did a PhD at the technical university Munich was a postdoc in Colorado and scientific assistant and working group leader at TU Berlin. He moved to to lintz to the Johannes Kepler University and became professor and head of Institute of bioinformatics. So then it's, he just mentioned that in the pre talk so then his focus was bioinformatics and recently he has made a big come come back in AI you can you can say and he has received numerous honors for for all his accomplishment. And so in 2018 he was the Austrian of the year in the category research to name one of the highlights here and this year he received the IEEE CIS neural networks pioneer prize. This is a very fitting price I believe he's really one of the big pioneers in neural network research we are very happy to have him here. I should not forget to mention that he's also in the board of directors of the Alice initiative you can see this in his virtual background already so he's also shaping this European network of excellence in machine learning. So we are very happy that you are here, and we are excited now to to follow your talk over the next hour. Thanks, castan. Thank you for this kind of production. I'm very honored to hear all of this. Yes. My lsdms we are already used in medicine in bioinformatics, but today I want to introduce something else, which can also have multiple applications in medicine or bioinformatics I will show you this. I already mentioned it. It's my background also check out Alice. Perhaps you can become Alice PhD student. Also there's a big programs there it's a health program in Alice. We also apply for a molecule program where it's about drug design. Alice. There's a lot of things going on castan is in the center of everything. You can ask him how to become part of society. You invited. We need smart guys. We need innovative guys motivated guys, please join us as Alice network, especially health molecules. So today I'm going to talk about modern hopeful networks. I have to put on his pictures away. Sorry for that. I want to introduce a new paradigm. And as a paradigm is to use associative memories in deep learning architectures. It's sort of new. It's already known partly from attention mechanisms from from the transformer architecture. But what we have here, here we have standard deep learning architecture. It's an input. We have something perhaps we have a matrix of different vectors. We can push this matrix of vectors through a layer, have another matrix of vectors, each vector is pushed through separately, and we do this, and at the end we get the matrix of vectors, and we can then combine this matrix of vectors to a final output to make a prediction. The matrix is a single vector, but in general you can push a whole set through deep neural network sets, what's done right now, but I think that's not sufficient sets function approximation, we should go one step beyond that. That's my idea here. Here I have a deep learning architecture but in every layer. I have an associative memory. I equip every layer with an associative memory to give additional information to enrich some processing by external information. This external information can be everything. It could be information results intermediate results from previous layers. We know this from attention mechanisms, but it could be also external information like names of diseases. The expression profiles as references. It could be pathways, informations, and stuff like this. And what we humans do, we humans are not function approximation approximators, we humans always do processing this context we always have associations. It reminds us to previous experience in stuff like this, and I want to build such networks networks, which are enriched by some context to processing this context external context or internal context several deep learning would go beyond convolutional or we can include attention mechanism, memories, you can also process points that's all this kind of stuff. So idea is integrate more than hope with networks into layers of deep learning architectures, and this gives us many new functionalities. We can do association of two sets, we can do pattern search and sets, we can do pulling operations, we can do memories, we can insert memory we put the sequence, and then we can do stuff what is the answer gated to current units are doing. We can do something like vector quantization of course we can do transform attention, but that's well known but it's only one aspect of this can do sequence to sequence learning. We can do all kinds of point set operations. We can do multiple instance learning, and we can implement at least some models of SVM or K nearest neighbor methods. And this is for every layer. Thank you to the modern hopeful networks, modern hopeful networks introduced by the seminal paper of cutoff and hopfield right is it's original hopfields hopfield of 1982. He's now I think 87 years old but still has had a new paper in 2016. And this modern hopfield networks, they are generalized by the immunity deal in 2017. And some new stuff with this modern hopfield networks is said to tremendously increase the storage capacity so storage capacity is much, much larger. And a second property which is important for us is second verge extremely fast. So what are standard hopfield networks which are known from sold textbooks. So our associative memories and associative memories are known from the 60s and 70s. This is really old stuff. And so they're popularized by John Hopfield in 1982. And became famous under the name hopfield networks. So standard hopfield network, so standard hopfield network as introduced by John Hopfield in 1982. We have n patterns, this is a binary patterns, I have to emphasize this is binary pattern x1 to xn and also components of this patterns. As a minus one or plus one, so if you know the dimensional space, but the binary either minus minus one or plus one. So we construct, if you have patterns we want to store the pattern towards this end we want to construct a weight matrix. We are the patterns are stored in the weight matrix is nothing else sense up some of all the products of this patterns patterns we want to store. And then, after having constructed this weight matrix, you have a state or query pattern. Okay, it's also binary. And it's also in the dimension space, and we have an update rule. We start with XI, we multiply XI with this weight matrix w which stores the patterns, subtract some constant vector. So sign. And now we have a plus one or minus one, and now I'm multiplying by true and subtracting one. Transfer us to minus one or plus one, one of the two things sets up that we get the new side. It's a binary from the old side. And if we do this iteratively. It's known for standard hopefully networks that be minimized and energy its energy is this binary energy be is for binary energy. It's often like this, but I write it like this because that's an interesting way to write it. It's success a state patterns a query pattern here. It becomes dot product sub pattern XI and the stock product is just weird and I go over all patterns. I have a state pattern, make a dot product with the sort pattern squares is not product and some it up sets the energy. And here's some linear term is not the important part here. So this standard hopefully networks, property of standard hopefully networks used to or something like his homo, but homo stored but now I only give in as a state pattern for this is XI for XI I only do half of homo, half of homo is masked out, and then I retrieve homo again. But that's the best case but normally things go bad for standard hopefully networks to assist but now I have stored three patterns, and now I get, first of all, it's complimentary, and I get different pattern out it's complimentary of this pattern. So I don't get out for more. And this is a problem because the patterns are highly correlated. The same here now that's marked two times in. I do half of homo in and I get out mark because it's a dominant pattern it's much stronger stored in the bait matrix and so far I get out this energy minimization goes to this energy minimum because it's more dominant. There's a lot of a serious minima we store more patterns. Now again, sustained pattern side, I give half of homo to the network, and what I get out is a pattern like this, that's some kind of average of everything. If you look here, you see the same pattern here, it's some kind of average over all kind of patterns. It's not what we want to have you want to have homo, so full picture retrieved, but yeah, it's an average of everything. It's not what we want to have. This is expressed also in the storage capacity. Hopfield used the diagonal matrix sets a diagonal matrix of the product to zero. You can do that it doesn't change anything. And then with this kind of hopfield networks. The storage capacity, meaning how many patterns can I store. So it's a number of patterns I can store, storing and retrieving without interference. It's proportional to the dimension of space. There are a lot of errors you have to divide by log d. If you allow small retrieval errors once in a while and retrieval error, since the storage capacity is open 14 times the dimension of the space in which the patterns live so binary patterns. In 2006 in the Seminole paper of Crotoff and Hopfield, the modern hopfield networks were introduced. I show you the energy functions, that's the energy function, still this modern hopfield network of Crotoff and Hopfield are binary, binary networks. But now we have the same, almost the same as the standard hopfield networks. Look at this. We have here the dot product squared. Here we have a dot product, but we apply a function capital F to the dot product. But what can capital F be what Crotoff and Hopfield said it can be any polynomial it can be set to the power of a for a equals to recover the standard hopfield network, but now we can do a equals five or a equals 10 or whatever, and get the energy function. And we can also do a minimization of the energy function. If we look at the component number J of this side of the actual state of the hopfield network, we can check here, so energy. So J is equals one. I pulled out this component. And here side J is equal to minus one. And now I check this energy. If I want to update only the chase component of side is the energy is plus one was energy is minus one. Minimize energy. And now I choose of the two. So one which has lost energy. And perhaps I have to switch side or it remains the same. I do an update of side J of this one component and doing this iteratively always looking at another component randomly. This converges to an energy minimum. That have been shown. And now, it's also shown sets energy minimum correspond to the sport pattern, but now the storage capacity, how many patterns can I store is now polynomial in these dimension of the space, still, we are in the binary case is the binary networks, without errors. Now I have D to the power a minus two, if a is is equal to that's a standard hopfield network, it's D, we know this, but if it's a is equals 10, then it's D to the power of nine. And this could be a large number of dimension of the pattern D is large. I still share some components here we have to divide by log D but log D doesn't hurt as much. And if you allow small retrieval errors, get some constant, depending on a but not depending on the dimension. And it's again, polynomial in D, and that was a huge achievement of product of an update to do this, because now, many, many more patterns can be stored using this new energy function on also a subject of the energy function. And this modern hopfield networks of crop of one and hopfield, we are generalized by the music in 2017. And we again have this function F, but now for the function F, the music, as I did not use polynomial button exponential functions it was stronger. And now we have energy expressed by the exponential function, and it turns out that the storage capacity is really exponential indeed the dimension of the space where the patterns live. The storage capacity is actually a proportional to two to the power of the half. So this is another very important fact that convergence or retrieval of the pattern can be done with one update this is a theorem from the paper of the remit sigil. We have this X, tell you, it's the same as XI in our case XI, if XI is close to a pattern, this is given by this radius and so on, if size close to a pattern, then some probability, no matter what patterns you look at, no no matter what component of XI you want to update. If you do an update, you would be XI of the chase component set such a component is not already as a same component of the pattern, which is stored, goes to zero. Meaning, it's very likely that you do one update one update of one component, and it's the right one. I said, you miss it. So far, if you touch every component once off this XI or here X tilde, then you already have retrieved some pattern. So far, it's super super fast in convergence, you only have to touch each component once. Wow. Here we have what we know standard binary output network. This is horrible. Retrieval, it's an average of everything, but if you do now so modern, still binary, hopefully network we store the same pattern we give in half of Homa, we perfectly can retrieve Homa it's still binary. And even if you store many more patterns are highly correlated, because the pixel correlated close by pixel have the same color. We give in half of Homa. And again, we can retrieve home or if there are many more images stored in this hopeful network, but this is a capacity of this modern hope for that. But now, to our application, this modern hopeful networks are binary. But I said I want to integrate them into deep learning architectures, but still, I want to learn this deep learning architectures by creating design by back propagation. What I need is, I need some continuous version of this modern hopeful network. So continuous version should be even differentiable to allow for credit descent for allowed to back propagation. And what we want to keep still is retrieval with one update because this means retrieval and update is the same. I can still activate to an activation of a layer activation of a layer versus activating the layer or retrieving a pattern from a memory. It's the same procedure. We don't see a difference. And I still want to have high storage capacity for complex tasks. Perhaps in medicine and bioinformatics want to store many data items you want to access and several such capacity should be very large. So, to construct continuous hopeful networks, we introduced a new energy function. Still we have n patterns. This patterns are still the dimensional, but now, so continuous patterns I have real values, this vectors with real values. We construct a pattern matrix matrix we need m m m is the length of the largest pattern, and we have a query pattern it's also a real valued vector from the dimensional real value vector. And we use for the difference energy functions to lock some exponential function. The exponential function has this form. It's only the sum beta some parameter times the components, some overall components lock and some normalizing constant better to the minus one. That's a lot some exponential. The exponential be defining new energy function and the energy functions here you can see the lock some exponential this is minus LSE. Here in the state we have a quadratic part in the state, and this is the only constant terms in this constant terms to make energy larger sensory it's easier for us to prove stuff and bounded. So we have this energy function, this energy function is now for real pattern real value patterns. It generalizes energy function, we know from binary pattern. This energy function is from the music it's, you can rewrite it is the minus exponential of the LSE of this with beta equals one. So this new energy function is similar to the energy function, introduced by the music deal. But for continuous pattern we need still says quadratic term, otherwise state pattern would go to infinity is not bounded for binary we don't have this it's binary as the length is given. Here's the length is not given and this can make problems, but this is a function works well as energy function. And we will show we keep properties like exponential storage capacity, and even retrieval of the one update sets important features we want to keep. And this update rule, you have a new update rule for the continuous modern hopefully networks. So new Xi is so all Xi times some pattern matrix sister wall is the vector of products size multiplied by each pattern and this dot products put in the vector. And we do not soft marks of suspectors of dot products and multiply explicit soft marks. So if Xi is very similar to one pattern, the softmax is like something zero, zero, and at some point is a one, and we have many zeros, then you retract this pattern from here. If Xi is very similar as a high dot product or a specific pattern, you would retrieve the corresponding pattern, but otherwise you have a, you have an average of patterns. So what you do you minimize the energy is this update role. And this is, I will skip this. It's going to far. We can prove with this update role set up the converges, normally to a local minima in general to settle points, but we only observed until now local minima maxima. So what you observe, and as a stationary point, we never saw. Normally was always in our minimum. We also have to think about what does it mean for continuous values of storing and retrieving for binary vectors it's clear you want to have if there's a one we want to have a lot of minus one. But for continuous vectors, what does it mean if you're very very close to the pattern, is it still retrieved. Yes, I would say yes. And we define starting with retrieving like this, we assume set around each pattern, which we want to store Xi says a sphere. And in the sphere says a fixed point of this update role and all points, which are in the sphere converged to the six point we have a Xi is a sort of pattern is in the sphere. And this point is in the sphere, and all points converges a fixed with all even Xi converges to this fixed point Xi star. And important spheres. Don't have any point in common, so I separate it. And then the section is sent to set. And now we say X is retrieved sex I saw a pattern is retrieved. If an iteration gives a point. This is quite an iteration point Xi tilde, which is at least absolute close to the fixed point saturation would go to the fixed point but we stop it already if you have some close. And then we have a retrieval error between this point where we stop the iteration and the store patterns that's retrieval error, meaning we have this fear in this fear. We always converges to this fixed point each fixed point identifies one of these four patterns. And we have a retrieval error where we stop. And then we have a rate and centrifugal area is the difference as a norm to the store pattern. And now, first of all, we can show it has exponential storage capacity it's shown here. That's a property book we want to have, but I go over this. And also, as a second property retrieval is one update only. Also this works what does mean retrievers one update you have a query pattern side. And f of size this is the update. And after one update step. So different between the updated Xi and the fixed point is smaller than epsilon if it's more than epsilon we say, this has been retrieved, and we can chose this is typically the case. So only thing what we have to assume is that the patterns are separated, what does mean separations we have this delta I separation means sets a dot product between a pattern with itself or the length of the pattern squared minus the dot product of this pattern with other pattern should be comparable to a large, meaning if two patterns are very, very, very similar to each other. Then, if J, J equals three is very similar. So they get here a dog product which is almost the same dog product is here. And so it's not well separated separately. So this Xi multiplied by itself should be larger than Xi multiplied by some xj separation in terms of dog products. So, you see a retrieval. After one update is possible. If the separation is large enough, and we see this here we do here, some approximation with the Jacobian sets coming from the mean value theory. And here we see sets here, our separation, this separation is multiplied by the beta and the separation is multiplied by the length of the vectors. Here we have a term, this is small. This is large. This is in terms of much smaller than this delta thing because Xi is close to x, otherwise I would not retrieve x and so fixed point is also much closer to Xi. And if this is sufficient large, then we have retrieval is one update, but we need a separation. At least, if two patterns are very, very close to each other, we cannot pull them apart, we will always retrieve both of them. And also, with the same formula you see the same four times beta times m sets as a error as a retriever you do. And normally the retriever is also very, very small. But what happens if some patterns are not very separated. So this might have with some patterns some vectors are very similar to one another. Some patterns are similar to one another, perhaps three or four patterns are similar to one another, but this is similar to all other patterns, so not similar to other patterns. So I mean, they're well separated to all others. There's a cluster of patterns are similar to each other. So we get the fixed point, near the similar vectors. So fixed point is some average of the similar patterns, and this fixed point is a middle state of this energy function. And if it starts near this fixed point or near this similar patterns, even if the start with the similar patterns, say go to this middle state to the second rush to this fixed point. It would always give us an average of some patterns. What we already saw with the standard hopfield networks where we don't want to have it, but here we can do some kind of clustering averaging of similar patterns. Here is this continuous new what an often networks working. We start many patterns. Here we see it's already gray value. We do half of Homa in and perfectly retrieve Homa, which we already know from the theory, this should work. Even if the patterns are highly correlated. But if Peter is too small. That's something that's in rush temperature, we get some kind of averaging of everything we have to have high enough Peter, high enough Peter. The separation is always multiplied by Peter if the separation is small, you need higher Peter to pull it apart, apart again. If you did it for different Peter, you see first, it's an average, but if Peter becomes larger and larger as an image of Homa becomes more and more visible, like some fog is lightning. It's going to be in our home or better and better because the softmax are pulling out this image becomes larger and larger. Another case, small Peter, you get one fixed point audience middle it's the average of all such what patterns is black dots as a start patterns, it's on two dimensional. As you increase Peter, you get fixed point here fixed point here and for large enough Peter you get already for fixed points as a fixed point sitting almost on top of the start patterns, and if you still increase it you have a fixed point for quadrants now, and also fixed points it's sitting on top of such large patterns already. A very interesting fact is, said our new update rule, that would be a do for our communities with an open networks is the same as the transformer attention mechanisms, it's not different at that. You have a new energy function, you have this update rule, and after doing, like here, some reformulation because we do it in column writing and say to it in a row writing, we get a transformer. It's the same. The question is, how can we integrate this is a new modern hopefully networks into architectures how do we do that. And so more mess more possibilities how to do it. First possibility would be the layer hopfield here are and why comes from previous layer or directly from the input are and why are as a queries. Why as a stored patterns. Both queries and stored patterns. This is the queries. And this is the store patterns. Both theories and stored patterns. Come from the network come from the input might be intermediate results of the previous layer or perhaps if it results from the previous layer, if the results from the previous layer. But the idea is, you propagate both securities and the stock pattern through the network, but you can do different things like for software pooling, you only propagate. So vice through the network so vice come from previous layer. What you store is more intermediate results or if it's an input from previous layer, but the queries are fixed you always do the same queries, queries are learned. What you said would be like multiple instance learning, yes a why keeps the instances, or sequences we are always you always do the same sequence in each layer you present original sequence. And now you probe is the queries always the same sequence that's learning a sequence analyzing a sequence. In the software layer propagate are other securities. What you store here is fixed with the store is fixed, but what you propagate you generate new queries. You always probes the same stored memory, but now this different queries. Do you like something like SVMs, or can you just neighbors, because here, if you stores a training data you can do use a training data training data is towards the training data and you again and again probes the training data you want to see what is closest to your current thing. You can also training data, or what has more similar values to security you generated. This is the first layer. This is this attention thing, you can do a transformer tension you can do sequence to sequence learning point set operation because you compare actually two sets as a two sets are computed in previous layer. One set gives us a queries the other set gives you what you store your start values. You can do all kind of refill based methods or decode encoder attention attention to something else sets no one from the attention mechanism, you can do more you generate over sets and throws it to sets to each other. But here sets with the different here by is generated, and you always generate a new data set you store it, and then you do fix queries. You can do something store also data. Now at every layer you do different queries, and then you generate a new, why it's a higher layer and you just fix queries. What you can do is you can do convolution species. You can do this, even LSDM or cruise getting the current units are possible here because you can always store the sequence here was a modified sequence, but always do fixed probing like is there a certain pattern in the sequence is a certain signal in the sequence. So you can do some queries start a static, or even can be low and can be adapted. It works for multiple instance learning where you have instances or a modification of instances, you can do pattern search. So here, it's where you generate securities, but what do you have in some memories or was fixed. You can do in some memory cluster centers templates or references. Can you just neighbor similarity learning vector quantization, or you do only discretization here, what we often will be training data. And the action we also do can be the actions pathways gene gene sets. This is all things we can put into it. So far, at every layer, so network has access to external knowledge to knowledge, what genes belong to which pathway, which pathways is disease names we already did. This works good. The network you don't have to learn it. It can memorize it because you explicitly provide it. Coming to the experiments. We did this deep learning architectures with modern hopfield networks to many tasks. For instance learning small classification test drug design problems. I will go. And we know already since this because we said it's transformers and the transformers of blood models work well for natural language processing sets known. Let's let's not do this. This is working well for my business learning problem. I want to show you one specific multiple instance learning problem because it's medicine or bioinformatics. This is immune repertoire classification. Here's ideas for my business learning your few indicative patterns of a large set of patterns, and you have to identify syndicate patterns. This is about you extract immune cells from the human body and do sequencing. And now the goal is you want to know, is this person immune against a certain disease like is this person immune against COVID. If you extract the moon cells, you sequence the moon cells because this is the only cells where DNA can be modified, and you look whether there's some pattern indicating immunity. But the problem is, there are so many of them in sales. It's a huge repertoire. You expect about 300,000 of the cells, you sequence 300,000 cells, and then you have the moon repertoire. Now I have 300,000 immune cells, and now I should figure out. Hey, is this person immune against a certain disease. And this is one of the largest multiple instance learning class because it's a multiple instance learning each cell is an instance, and you describe so patient or you describe so person, but also immune cells, you have extracted and said sequenced. You have so multiple means multiple instance learning methods failed because there are so many of us as instances normally have hundreds that's but you have 300,000 instances. And this is modern hope for network system works. What we did is, we used to soft property pooling stuff. We put all the instances after after the sequences, we embedded the sequences, push them through neural network, and then have an embedding space. The embedding space is what 300,000 patterns. And we learned fixed queries. So queries are specific for the disease. Now for this disease, you look to assist the sequence to assist the sequence to assist the sequence. So here, for real world data. This is modern hopeful networks work very well. So it gives the best performance, but also on the real world data where the signals were implanted. We worked here with some guys from Oslo. And then report our field side looking at this. You also looked at LSM generated data and as a simulated data. Most important is real but because you want to know whether it's working really but it's really working much better send the previous methods, if messes which were specially specifically designed for this task. And as ideas, hey to all the cells in a memory, learn probes to look for specific patterns. I sense a memory or not. If sense of memory. I have a system response. I know this person is immune against this disease. If it's not in. It's not. And it's not like it's it's a clear pattern. So pattern is not fixed one is not a motive which is fixed. There are some variable tease meaning you need something like similarities. You need this neural networks. It's not only a discrete search for different patterns. It's not learning data sets, but I don't want to go into this. We did this for small. I also will skip this small data sets. Here we will come out soon with a new method based on this side here is for small data sets we can keep the whole training set in the memory. And every layer, we can prove against a training set. I have a new data item is a similar. Does it have this feature. This is related to that. So we can learn somehow learning algorithm. We have a training set this in every layer presented to the processing. With the news results that come out soon. We see, we on small data sets constantly beat extra boost and my message like this. That's not where the comments that deep learning beats extra boost on tableau data and small data, but with this method it works because we give the training set again and again. We have some small data sets, but here, have something interesting also in terms of bioinformatics medicine. We did it on the product design data set. Some main areas, one was antivirals. We put in inhibitors here specifically human Peter secretors. Then we look for a data set with metabolic effects here's a blood brain barrier, probability, you know it from from your brain. And very general side effects. If you do drug design you want to see a side effect system standard data sets and all in drug design. We looked at them, and we used hopfield area be compared to support vector machine to achieve boost to random forest and to deep neural networks, but also to graph neural networks, different graph neural networks, because the system is very popular, and it turns out, said in two of the four data set, we got the new state of the art with hopfield networks in the other side do well. So competitive but not the best ones, but that's impressive because they are very strong competitors. And here, said by storing all the data here you store all the subgraphs you store all the side chains or from from the from the truck. This all the time to the original input really helps to get you new results. To summarize. We want to introduce a paradigm shift by integrating assisted frameworks into deep learning architecture paradigm shift in deep learning, don't do deep learning to deep learning by integrating context by using some external knowledge. It could be a knowledge graph. It could be a database it could be some back, it could be terminologies from medicine from bioinformatics. You don't have to learn it, you can present all the stuff to the network directly, and you're much better and much faster believe me, you can learn attention, memory, vector quantization set association. In every layer, you can build something like an SVM model, while can your neighbor learning vector quantization, you can do an every layer pattern search your search and a database you want to access one every layer, you can present templates we did assist for templates and technical reactions, and we get here in in in drug design for a retro synthesis, figuring out how to build a molecule we get the best results, unknown results, even in zero shot learning and the pharma companies, completely excited what's going on here. And it's okay to the actress we put them in the memory and gave us a network. And here's some a block entries, where you get the software, also some paper, and there's also a video by Yanik Kilcher, you can check out this. And so, I will stop here and thanks for your attention. Thank you for this fascinating talk. It was really exciting. And it was actually easy to follow you because of the clarity in which you presented this, although it was very complicated material thank you very much for that. I'm sure there will be questions here I see virtual applause coming in for you, and I also see raised hands yeah Giovanni goes first. So first of all, thank you. It's really an amazing talk. My question is on. Well, I have a few but I'll start with one on the use of hopeful layers for interpretability. So for for biomedical applications often it's very important to give a good explanation of the results of a machine learning model. I think these hopfield layers could be used to try to provide these in some ways so for example I could probe with different sets of patterns that I want to compare my hidden layers or my results to so that I can keep as an explanation. So I would like to close to say the closest for expression network and the closest affected pathways. Do you think there's any possibilities for interpretability for these hopfield layers, or it's something that's not really that's a feasible. The answer is yes, I will go into this elaborate on this answer. The second thing I have to say, we never thought about it, a good idea. Yes, yes, because what you can do is if you provide something in your memory, you later can check what the network has used. Did you use this item, or did it just use this item. You can use something like pathways which pathway did it use, you use a knowledge graph, which of the items of the non strap did it use. And there you know, at least the part it used to compile the final answer its outcome, because the stuff which it had not accessed it will not have used. And this gives a little bit of durability. Another thing is because you are the boss of the external memory. You can put in here human concepts, because who tells you said the neural network will learn something like human concept something. And if you have something like a speed acceleration survey sets of concepts we know from mechanics from physics, but a linear combination of these items work very well for a neural network if you do a linear transformation of this is also something that gives you a little bit acceleration, a little bit of speed and a little bit of survey you did. But now for humans is not clearly more about this means, but for neural network is the same. How can you guarantee that the neural network gives you something that you as a human can understand, but here with the modern hopper networks, you are under control, you only give concepts. And this is a human have an idea about what it means. And so no network can do everything with it can also do an average of concepts, perhaps send your last, but if you clearly see, so the network accessed my concept number 17. 17 was blah blah. Now, I have a connection to human understanding to human human concepts and for nominal networks, it's hard because do we understand what this new lens code for. I understand that sounds actually really exciting. Thank you. Thank you for your view. Volker is the next in line. Hi, welcome. Very exciting and very interesting to read up on. So, in fact, my first very first NIPPS paper was a hopfield paper. I applied hopfield networks to optimization problems. And, and this was, I think, sort of dormant for many, many, many years and now this is coming back in quantum computing, because for example, the D wave style of quantum computing is pretty similar to these old ideas of using hopfield networks for optimization. So I see an application of these modern networks and optimization and maybe even a path to quantum computing. Not very clear. I see some similarities like energy minimization or fast energy minimization. I think sometimes talks about energy based models. There's a connection also in quantum computing. I think there are different ways to do it. One is quantum annealing and quantum annealing if you have huge networks, you can perhaps use this techniques to be honest. I'm not so deep in this topic. Perhaps I'm amateur. I'm diligent in this topics. I see only answer surface. There could be a connection in terms of energy and energy minima and stuff like this. Also, in terms of discretization you go to energy minimum, if you ever go to the same energy miniature with discretization kind of stuff. I see connections but I don't have a clear picture. I hope there's something but I'm not good enough in the other fields to really crap it. Yeah, maybe these nonlinearities are not really very easily be able to match because I guess in some sense quantum computing is sort of linear. So nonlinear stuff you need in your modern hopfield networks might easily be mapped to this type of quantum computing. I would love it because then it can be very big thing. You can put the whole knowledge graph, the whole Google knowledge into a memory and send access it somehow by associative associations. I mean, that's something very interesting because we call it quantum computing is not really good on big data. So if you can sort of keep the big data somewhere else, sort of a hybrid solution that could be attractive. Yeah, yeah, but I'm not good in quantum computing is not my field. I say stupid things here, but it sounds very, very cool. Okay, thanks. Thanks to both of you. We also have a tool to allow for questions from the YouTube audience. I'll read out one question from there. So Christian Bock is asking, storing date training data makes me think of overfitting. Can hopfield layers learn patterns that do not exist in the data, for instance, a mean over observed patterns is the question. That's an interesting thing. Yes and no, of course it's possible. First of all, be used it in a symbolic kind of thing with chemical reactions. A mean of chemical reactions is not a chemical reaction. So we connect symbolic to sub symbolic computing because chemical reaction applying a chemical reaction to a compound is completely symbolic. But in other cases, yes, you get this meter state states, and you see as an average and average makes much more sense. It comes up with something. Yes, a new prototype, so to say a new cluster center automatically and we observe system application. And then we say hey, it makes absolutely sense that we use this as a prototype. So depending on the application, some applications we say you cannot average because it's symbolic and average of symbolic things might not mean anything. In other cases, it makes perfect sense and pull new things come up. Thank you. Thank you for that answer. I have a related question which is you showed these applications with molecules and with with graphs and graph convolution networks. And, and you mentioned that you're looking at at subgraphs there in different layers. And I wonder if you look at a graph and all of its subgraphs then, then I feel that this condition of separability that you described, maybe very often violated because a graph is very similar to to its subgraphs. Yeah, how do you deal with that is it just that your theoretical guarantees don't hold there or, but it still works in practice or how do you deal with this phenomenon. And don't deal with it to first of all we have, we have an embedding of certain graphs into a to a to a subsymbolic thing and into vector space, but the nice thing is, we can generalize our subgraphs said we only look at so much green, but not what a specific molecules are on this. We have a couple of subgraphs in over different, you start the different subgraphs in your database, and now you make an average, but the average is only the backbone, because all other things are some nice modifications, but what what's in common is some background and therefore we have a meta stable state which represents the backbone, and to put something on the background you go to the specific patterns which are similar to just a similar to just because they have the same backbone. Sometimes it helps, but I don't know whether it always helps because you get those similarities with what don't want to have, but in this case it sometimes help because say you can channelize if you have different backgrounds, and you have another thing which is a similar because of this meta stable state, a new structure you never saw is still similar because it isn't the same of this meta stable state of this fixed point because it's another similar patterns to the five patterns perhaps you already have or whatever. Okay, very interesting, very interesting. Thank you for that and Giovanni has another question please Giovanni. I have a quick question, when you define how the patterns are stored and retrieved you define the retrieval distance with the norm of the difference between the updated pattern, the updated query and the pattern. Is that the Euclidean norm, and is that an issue if your patterns are really high dimensional. Again, I define a norm on what. You define the patterns. Okay, yes, we used a cladian norm, we have simplicity. If norms, let's say, compared to each other, so differ on it by a certain number, but the cladian norm is the best one because also since the energy function we used a sort of top products, top products are very, let's say, the Euclidean norm can be derived from the top product is applied by the top product, therefore, we're working with top product also this energy functions as this psi transpose Euclidean norm and top products go very well, hand in hand. Euclidean norm is implied norm of this Hilbert space of this top product sets of feeling we didn't use this fact, but for me, it's natural to use an Euclidean norm if you work with top products if you work in a Hilbert space. So yeah, my question is basically, does it not become less informative if your patterns are like thousands dimensionals or tens of thousands. That's, that's an interesting stuff. But we didn't exploit anything. What I want to go is, what we did right now is, we have now this is hopefully network. So that's a clip model is a clip model is stuff from open AI. We have very high dimensional embedding vectors like 1024 2048. And it's still working. My dream would be I have embedding vectors I have, because what I started so hoppin network is not so original vector, but some thing which goes to the neural network and then have an embedding vector representation of the input. So that's one of those. And what I would love to have is a sparse vector, a very high dimensional vector but the sparse vector. If I have a sparse vector selectors are hopefully orthogonal to each other because if you have a sparse vector if only a few components are on, and you multiply it with another sparse vector you get zero because otherwise the components which are non seater have to match several you increase the orthogonality. And I think humans do things like this, I said many concepts but for in a specific situation you only use few concepts, but we did do this sparseness. But sparseness would help to better retrieve sparseness would help to make the vectors I still also have a network more orthogonal to each other. It will help. And then I need a very high dimensional space to code my whole input into sparse components, so if I want to go into then 100,000 dimensional spaces with a few handful or hundreds of components switched on said would be my dream. But yeah. That's a good good end to this public part of your presentation thank you so much that we were all fascinated I can say that for the whole network. And now it's time for this this session with our doctoral students they'll take you to a breakout room. Yes and I we say goodbye here. Thank you again. I really enjoyed that. And now I wish you a good discussion round with our doctoral students in the network.