 And it's my pleasure to introduce Johannes Koffler, who is going to talk about deep learning for quantum physics and astronomy. Please go ahead. Thank you very much for the introduction. Good morning, everyone. Welcome. It's a great pleasure for me to tell you a little bit about the research I've done in the last couple of years at the Institute of Machine Learning in Linz. Now I'm, after a parental leave, here back at Quantum Information in the Institute for Integrated Circuits. So the task of today in the first 10 minutes or so will be to give you an overview of some fundamental neural network architectures. You all, of course, know feed-forward neural networks, and probably most of you know convolutional neural nets. Residual neural networks are maybe not so well known outside the field of machine learning. And then I will talk a little bit of recurrent neural nets, in particular the long-short-term memory, and then two slides on graph neural networks. And it's actually the residual networks, the long-short-term memory, and the graph neural networks, which I will then later use in showing you three applications. First is modeling multi-particle high-dimensional quantum experiments. Second is predicting outcomes of pairwise planetary collisions. And the third is to learn the ground state properties of quantum Hamiltonians. Okay, so without much further ado, you all probably are aware of this picture in one way or another. So about 10 years ago, the deep learning revolution has started because we had fast enough computers and large enough data sets. Actually, deep learning is a subfield of machine learning, which dates back to the 80s. And machine learning is one pillar and one subfield of the big field of artificial intelligence, which dates back to the perceptron and the ELISA back in the 50s. Feed-forward neural networks are, so to say, the vanilla standard setup. You have an input layer, then some hidden layers, and then output. It's known since Kurt Warnick's work in the beginning of the 1990s that feed-forward neural networks are universal functional approximators. So even one hidden layer that is arbitrarily wide is enough that such a feed-forward neural network or a multi-layer perceptron, one hidden layer already makes it a multi-layer, is enough to approximate any function you wish, in principle. This is just an existence proof. It, of course, is an art and not easy at all to do the learning algorithm and to have this universal approximator in the end for your data. The learning of a neural networks usually works via backpropagation. This is an algorithm dating back to the 80s. Joffrey Hinton, I think, and others invented it. So the weights are updated via gradient descent with an error signal. Very early, and this led to somewhat dying out of deep learning in the 1990s, is the so-called vanishing gradient problem. If you have a deep neural net and you have an error in the output layer, how does the error signal propagate back to the input layer? It usually gets weaker and weaker, and you have to fight against this vanishing gradient problem. And there are many solutions. Many of them developed only in the last 10 years or so. There's batch normalization, good choices of activation functions, but also LSTMs, where they are back in the 90s and residual nets. Convolutional neural nets are your networks of choice if you deal with picture data. So even a low resolution picture with 250 by 250 pixels with three-color general is an extremely high input dimension, almost 200,000 input nodes. But the good thing about pictures is that pixels that are close to each other are correlated. And some basic patches like edges and corners appear again and again. It doesn't matter whether you train on cats and dogs and horses or on houses and airplanes. Even in our brain, the very first things that happen in our brain after the visual signal is processed is to get corners and contrast changes. And these would be the low-level kernels that you train. And the miracle or the magic of convolutional neural nets is that you apply small weight matrices, also called kernels, also called filters, and they are usually just three by three at most, five by five pixels. And you slide them over the whole picture to get the next feature map of the next layer. So you reuse only a very small number of weights. In this picture, it's only nine weights. You reuse them again and again. This is in very stark contrast to a fully connected network where every pixel from the source image would be connected to, so to say, every pixel of the first feature map. And that would be a quadratic number of pixels. So this weight sharing is the magic of CNNs. And since about 2012 or so, all vision challenges are won by convolutional nets. So this has completely changed the decade-old game of vision tasks and everything today, facial recognition segmentation is done by convolutional nets. Another architecture that is very powerful are so-called residual networks, or rest nets. And the trick here is that you have so-called skip connections, or highway connections, or shortcuts, and they jump over one or multiple layers. So you not only now have layer connections from layer to layer, but you have these highway connections that jump over multiple other layers and make, so to say, a fast, possible error signal to the lower layer. And in the initial training phase, there's a fast adjustment just of these few layers that have these skip connections, and only later there's an adjustment of these skipped layers. And this leads to much stabler gradients in learning and also then to faster and better learning. And these residual networks have allowed and opened up the path to have deep learning architectures of thousands or more layers, which is really impressive. And especially this combination of residual nets and convolutional nets is extremely powerful. Then another very important architecture are so-called recurrent neural nets, RNNs. So on the top right here, we have the picture of a feed-forward net. We have X1, X2 as input. Then we have a hidden layer here with activations A1, A2, A3 and then output. And in a recurrent neural net, in every time step, you would allow a feedback from every node of the hidden layer to itself and every other node in the hidden layer. So the nodes are connected now to each other in the hidden layers, and this allows for some kind of internal state or kind of memory. And also you can now process arbitrary or not predefined length of input. So I can input one input vector, in that case only X1, X2, so two words for instance. And in the next time step, I input another X1, X2 entry and another one. And with every tick, the recurrent net sees the new input but also remembers the past inputs with this internal connection in its hidden layer. And this is the most promising architecture in general for this classification, sequence generation and meta learning and so on. This is still a standard architecture for translation. You would enter word by word here in red for instance, then the recurrent neural net would say tick, tick, tick and then it would output the words in the other language. Or it would many to one, you input some words and it would output you only which type of language, English, Italian, French. So there are many different ways these networks and they are usually trained via back propagation in time. So very similar to the standard back propagation algorithm you just have to be a bit more careful to really tick it back in time until the first input. And one particular variation of the recurrent neural network is the so-called long short-term memory LSTM which was invented beginning of the 90s and it is a very powerful architecture because it can store information indefinitely by an integrator but it can be selective about what to store via gating. And this is actually one of the holy grails of RNNs that you have the selectivity and the integrator is just a cell state being added up to its own past so it stores information over time by just adding up and there are no vanishing gradients. You have this stability here in the LSTM learning because of this integrator property and then you have gates, an input gate, an output gate, also a forget gate and they serve as an attention mechanism. There is a feedback because of the recurrent hidden state and the output gate can influence the values for the next hidden states and all these gates, the input gate, the output gate and also the forget gate, they are full neural networks themselves that you have to learn. So the LSTM in some sense is a recurrent neural net where other networks learn what to look at, what to forget, what to memorize and how to weigh the importance in the future. And until the transformers came up a couple of years ago, LSTMs were the state of the art in speech text generation and recognition time series prediction. So all your voice control in the mobile phones, the translation that happened between in all the 2010 to 2020, so to say, this was all based your Alexa series, whatever you had, this was all based on the LSTM architecture. Now I want to say a couple of words on graph neural networks. They are particularly interesting, I think, for physicists maybe because they have a very good inductive bias and are very well suited if your process data can be represented on or as graphs. So, for instance, you could want to learn the toxicity of molecules and then your molecule would have an adjacency matrix. You could put it mathematically in a graph with the adjacency matrix telling who is correlated to whom or who is connected to whom. And then there are three possible things you can learn with such graph neural networks. One would be a global graph property that could be, for instance, the toxicity of the molecule. It would just be toxic or not toxic. Or it could be the phase of a many-body system. You enter a many-body quantum Hamiltonian here and you just want to learn whether this is solid, fluid or whatever. So that would be a global property of the whole thing, like a phase of toxicity. Another way to use graph neural networks is in node-level tasks. So every graph consists of nodes and edges. And in node-level tasks, you want to predict the nodes. So, for instance, the spin values in the spin lattice. And then in the node, the spin value would be represented or embedded. And the last way to use them are edge-level tasks. And then you would want to predict the edges. And that could be the links in the knowledge graph or the coupling in a Hamiltonian lattice, for instance. So, in general, graph neural networks are optimisable transformations on all these attributes, the globals, the nodes and the edges, preserving the graph symmetry. So, in every layer, you have this graph symmetry and typically you do not change the connectivity in the layers. You could do that, but that's typically not done. And now I want to tell you one more slide about how these graph neural networks work. This is actually a new architecture. It's not older than 10 years. The main idea here is pair-wise message passing between the neighboring nodes or edges. So, let's look at this simple graph here in this simple graph, we just have five nodes, X0, 1, 2, 3, 4. And everybody is connected to X0, but no other connections. So, that would be our simple graph here. So, we have a node set and an edge set. And how now do we update? How does the learning algorithm work? We would look for every node at its neighborhood. So, let's look at node X0. Who is connected to 0? Well, 1, 2, 3, 4. So, we have to look at this whole neighborhood of all four others. And then, we would take for... To compute the next internal hidden state or node representation of X0, we do what? We take the representation at the given time step. Then we take the neighbor and the edge to its neighbor and put it in a message function. So, there's a message, so to say, from X2 to X0 that takes its values and how strong the edge is. So, we have a message from X3, from X4 and X1. These messages are aggregated up with this operator here. And then, together with the node value itself, then there is another function, the update function, that then computes the next node representation at you. And now, again, this update function phi and message function psi are neural nets that need to be trained. And the advantages of such graph neural networks are that many problems in physics, chemistry, territorial optimization, like traveling salesmen and so on, have a natural graph structure. So, gene ends are very often a natural approach for such problems. And they allow to adaptively learn the importance of each neighbor. So, there's a very close connection, maybe to the real-world problem at hand. The disadvantages compared to other approaches are that they are computationally quite expensive. Typically, they can only be implemented in a shallow way, like three, four, five layers, so that limits the performance on large datasets somehow. And gene ends are fundamentally limited due to some graph theoretical theorems. In particular, they can't be more expressive than the so-called Weisfeller-Liemann graph. Isomorphism test, yeah? But there is a way out. People have and are developing so-called topological nets that allow higher expressivity. But, again, no free lunch, some advantages, some disadvantages. Yeah, with this, I am done with the introduction to this talk. And now we will come, ah, sorry. I forgot, of course, the last one I just want to mention. Now, everybody knows this now. Transformers, I think they have been put forward in 2017 or so, but they have completely revolutionized the field in the last year or two. And now it's almost every day in the news about something about large language models, JetGPT, and so on. So this is yet another architecture that is closely related to recurrent neural nets with the difference that you don't have a sequential input one by one, but you have a massive input all at once. And the self-attention mechanism is somehow dealing with this massive dataset at one. It can learn how to weigh the input according to already learned measures of relevance. Yeah, and to close this introduction, there are many more architectures that I have not shown today, of course, and you can pick your favorite network from a big list of options. There are many other architectures, but the one I have just shown in the last 15 minutes certainly make up the bulk of the workforce, so to say. Okay, that brings me to the first project I want to talk about. It's called Quantum Optical Experiments, modeled by a long-shocked memory by LSTM. And here we were motivated by the question, can neural networks design quantum optical experiments? This was our main motivation. So the physical scenario we were dealing with was we are interested in three photon states with orbital angular momentum. So actually experimentally, it's a four photon state, but the orbital angular momentum is used as a trigger to know that the other three photons are there. And orbital angular momentum, as probably most of you know, is a degree of freedom of a photon that is measured in multiples of h-bar, and in principle it can take any integer value. So it's not a Q-bit like polarization, but the orbital angular momentum is a Q-dit, where d is in principle arbitrary large and experimentally, d equal 300 or more has been reached. So this is to find interesting high-dimensional entanglement. So every experimental setup consists of a sequence of beam splitters, wave plates, holograms, and so on. And given a setup, if I give you a sequence of these experimental building blocks, then it's very easy to compute the final quantum state. You can put that in a computer and clearly do the unitary, make your matrix vector multiplications or whatever, the final quantum state. The reverse task is extremely hard. If I give you a high-dimensional entangled quantum state and I ask you, how do you build that? This is close to impossible. This is extremely hard. And what Mario Glenn has done for a couple of years already, he implemented a random search. The algorithm was called Melvin, and it just tried out random setups, looked at the final state and classified its interest. So how is it, is it a maximum entangled state or not? What is it, Schmitt rank and so on. And this is very costly. It costs about one second per simulation. So it is straightforward, it is easy, but it takes time. And our goal was to improve this through machine learning because inference times in a trained neural net are very fast, milliseconds was less. So our quantity of interest was the Schmitt rank vector. This represents the dimensionality of the states of photons. So let's look at this example state here. It's 0, 0, 0. So three photons, 0, 0, 0, 1, 0, 1, 2, 1, 0, 3, 1, 1. This is a maximum entangled state. And its Schmitt rank vector is 422 because Alice's photon really uses up the four-dimensional space, 0, 1, 2, 3, that's four-dimension. Bob's photon only lives in 0 and 1 and Charlie's photon also uses up only 0 and 1. So that we call a 422 state. And our goal was to find maximum entangled states with high Schmitt rank vectors, 7, 8, 9, 10, and so on. And we had some millions of examples. And each sample consisted of the experimental setup as the input. So this is a sequence of these components, wave plates, holograms, and so on. And in the end, it had a target value, a supervisory signal, a flag, which said positive and negative, indicating whether or not it's maximally entangled. And a three-tupper, a vector, that says the Schmitt rank and mk. And we had, yeah, like millions of examples. And the leading Schmitt rank n was from 0 to 12. But you see in the higher samples, you don't find maximally entangled states so easily anymore. Yeah. And we used, we clustered the data by the leading Schmitt rank and did some cluster cross-validation on the fold 0 to 8. Cluster cross-validation means you train a network on the folds with Schmitt rank 0 to 7, and then you test it on 8. Then you train it on 0 to 6 and 8 and test it on 7. Then you train it on 0 to 5, but not 6 and 7 and 8 and test it on 6. So you always leave one out and train on the others. This gives you very good generalization characteristics. And moreover, to not only have generalization, but really out-of-distribution capabilities, we did not at all train on larger equal than leading Schmitt rank 9. So we used that as an extrapolation set or so-called out-of-distribution set. Yeah. And as an architecture, we took an LSTM. So x now would be the sequence of wave plate, beam split, hologram, and so on. And these are variable sequence lengths. Some sequence have 10 elements, some have 20. And the y hat would be the output signal. It would say whether or not it's maximally entangled and the three tupper of the Schmitt rank vector. And setups are then proposed after training. Setups are proposed randomly, and we characterize a setup as interesting when it's classified as being maximally entangled and when a Schmitt rank vector is in some closeness to uncharted territory. So we were interested in finding new states. And these are then checked by the simulation on what they really are. And here I show some results. There's some trade-off between a rediscovery ratio of what we knew already and the precision. But in general, the model performed very well on a wide set of parameters, so true negative and true positive rates and precision were all very well, especially when allowing some distance to uncharted territory. And this is again, this is the cluster cross-validation. So it shows that the true negative rates are high for all validation faults, and in general, all metrics are very good also for the extrapolation set. So this is quite remarkable that this LSTM performs very well on the data in a part of, so to say, the Hilbert space, which it has never seen in training. So we demonstrated that LSTM can learn to predict the characteristics of such high-dimensional quantum states without any explicit knowledge about quantum physics. And this, in principle, can be used as a generative model where you can, yeah, the training set was calculated by Melvin over many months, and it was just randomly selected sequences and then computing the target values. This was the training set. Yeah, yeah, yeah, yeah, yeah. Training set is very important, and this is a huge amount of computational power in the training set. But once you have it, then you can have very fast inference. Yeah, and you can build a generative model out of the discriminative model, where I would give you a certain number of elements and then the LSTM would predict the next best element, like then you should put a hologram or a wave plate or whatever. Okay, this closes the first application. Now to the second one. This is ResNet for the prediction of planetary collision outcomes. This was a work mainly pushed forward by Philipp Winter, who had an astrophysics background and then switched to machine learning. So the motivation here is we want to have a fast and accurate treatment of collisions in n-body planet formation. And this is a very challenging task because there is a very fast way to do that. It's called perfect inelastic merging, but it is also very inaccurate, yeah? Yeah? Okay, so I understood the question, how does this compare to genetic algorithms? And I have to simply admit I do not know. We have not tested it. To my general knowledge, deep learning outperforms genetic algorithms in tasks which require large datasets usually. But we have not looked at it or tried it. I'm sorry, I can't comment more than that. Okay, so there is this fast but inaccurate way to do it, and then there is again a very accurate but slow way to do it, and that's called smooth particle hydrodynamics, SPH, a very expensive way to make these simulations. And our goal was to tackle the problem with machine learning, in particular RESTNETs, and achieve a good trade-off. So you invest some time in SPH simulations, and then on that dataset, you have a good network that can make good predictions and fast for new data. Yeah, and here you see some collision here where both planets, so to say, survive, and here they actually merge, the blue and the red one, they collide and merge. So there's this hit-and-run scenarios and mergers. And this takes a couple of hours, such planetary collisions. So previously in the astrophysical community, such things were tried already, but previous datasets did not take into account the water of planets, which is quite important for good simulations, and they did not take into account initial rotations of the two planetary systems. And we did all this. We created a dataset of about 10,000 SPH simulations. We did consider water fractions and rotations of the target and the projectile. We randomly sampled over a wide variety of initial conditions. So what is the initial rotation? What is the impact angle? What is the impact velocity? What is the impact ratio? Something between 5% and 100% from the smaller to the larger target. The total masses are also between Ceres and Earth. And we also had some algorithms that identified the fragment. So after these collisions, all hell breaks loose, but sometimes small fragments can be identified, and sometimes they are not even touching each other, but only gravitationally bound. We also were able to have that treated correctly. In general, there are three major outcome regimes. So there can be erosion that the target loses mass, so that the projectile hits the target and takes away some mass. This is how probably our moon was created, I guess. And then there's accretion. That means that the target gains mass from the projectile and in hit and run there are two remnants. And it depends on the impact velocity, the impact angle, and the mass ratios on where you are. And previous approaches in this community just took classical machine learning algorithms, like support vector machines, SVMs, KNNs, K nearest neighbors, RF is random forests. Some feed-forward neural networks have been tried and in regression gradient boosting has been yes. Angular momentum, the rotation is important. Yeah? Yeah? Yeah, here, it's hard to draw three, four, five-dimensional graphs. Here it's only the impact velocity and the impact angle over the masses. And our approach was a genuine deep learning approach. So we built a residual net architecture with an autoregressive model that ticked through time steps. So we had multiple iterative steps. So we have an inductive bias due to the physics here. And we have a weight-tied residual net with actually three shared feed-forward networks. And we do these adaptive updates, which effectively is something like an Euler integration. So this is one of few examples, actually, where we have a very physics-inspired way how the network works and therefore also an interpretability of the result, which is quite nice. So let me quickly report here on the results. So this rest net really significantly outperformed feed-forward all other classical approaches like support vector machines and so on on most of the tasks, including an out-of-distribution set. This is always particularly interesting to the researchers how well does a trained neural net generalize to data out-of-distribution so that it really has never seen. Our rest nets are slightly less parameter efficient. That means they need more parameters than the compared FNN, but they are significantly more data efficient. So we can train them much better with a small amount of data. Yeah, and this advantage is due to the algorithmic bias. It's just the network itself that has an architecture that is close to the problem at hand. And in the immediate states of the rest nets, they are interpretable because they store the physical values themselves. This is not at all typically the case in a neural network. It does whatever it wants, and in the end converges to some output, but whatever is in between typically is a black box. And here, this is not the case. Okay, that brings me to the last part of the talk. The last application is we want to use classical shadow measurements to learn ground state properties of quantum Hamiltonians. So predicting ground state properties of large scale quantum systems is very interesting from a fundamental research perspective, but also very important for developing quantum technologies. And as you all know here, of course, the exact description requires exponentially many measurements, which would be a full tomography. So this is a big overarching problem, but of course there are some solutions or ways to tackle this. And one is the so-called classical shadows that have been developed in the last five years ago, so by Aronson and then Juan King Presquil. Presquil won the bell prize half now for the classical shadows, I think. So these classical shadows are a way to randomize the measurements and then with the only logarithmically number of... So logarithmically in the number of qubits with only a logarithmic number of measurements to have an approximation to the density matrix which allows to predict certain properties, like for instance two point correlations. Obtaining data is still expensive, so we were interested and motivated by the question can we have a sample efficient machine learning model? And for some tasks it is proven that there should be machine learning superiority, so to predict for instance these correlations of lattice Hamiltonians. So our goal was to leverage the graph structure of these Hamiltonians. We were really motivated by taking the physical situation at hand and using graph neural networks and we wanted to show that these GNNs, these graph neural networks exhibit better sample efficiency so that they need less data presented to them to already make good predictions. Our problem Hamiltonian at hand was the 2D antiferromagnetic random Heisenberg Hamiltonian. So random means the coupling JIJs are randomly chosen between 0 and 2, so they are all positive, so this is antiferromagnetic and it's Heisenberg because it's really x, y and z. In principle this has couplings between all spins, but in the end we took nearest neighbors only. And the ML task here is by observing that the JIJs are an implicit representation of the ground state, implicit in the sense that the JIJ make up the Hamiltonian, the Hamiltonian determines the ground state. And based on the couplings JIJ they are really given to us, this is so to say our input to the graph neural net, we want to predict expectation values of the ground state two point correlators. So not of the ground state itself, not even of the ground state energy, only about the two point correlators. So trace c row where c is this xx plus yy plus zz over 3. Yeah, you're right. The structure of the Hamiltonians is so to say given and then the JIJs tell you, yeah, you're very right. If I only give you chase and don't tell you whether it was Heisenberg or using Hamiltonian or whatever, then you know nothing. You're right. So on the left here we see the illustration of such a 2D Heisenberg Hamiltonian with 20 spins and only nearest neighbor couplings now. This is quite important for the next slide. So we now restrict ourselves to nearest neighbors only and choose some random couplings and the strength are written here as numbers and are illustrated by the width of these gray bars. And of course for such a Hamiltonian you can compute the so-called ground truth. This is machine learning language, so the actual values of the two point correlators are computed from DMRG for one Hamiltonian here. And also then you can think of what is the average correlation over all possible pairs, so 0 with 1, 0 with 2, 0 with 3, 1 with 6, 1 with 7, so on. So all pair combinations that have only distance 1, sorry, that have distance 0, well that's the spin with itself. It's of course perfectly correlated. Whenever you look at next neighbor correlations you get a negative value because it's an anti-ferromagnetic system and for d equal 2 they are then positively correlated and so on. So you can look at these average correlations over distance d and this is what we will later also look at. Yeah, we compare the following methods but actually I'm only interested now in showing you the difference between the first two. One is a fully connected graph neural network where the prediction of the correlation is directly calculated from the edge embedding. So we have a graph that has these 20 nodes and every node, so to say, corresponds to a spin of the real system but now in reality only next neighbor spins are coupled but in the graph we want to predict also a correlation between spin 5 and 10. So there's a distance of 5 maybe and so we need to have an edge in the graph where in the edge we can learn the correlation. So the graph doesn't look at all like the system because in the graph everybody is connected with everybody and we have an edge learning task. You might think that this is not ideal and indeed it is not ideal but so this requires edges between all the nodes in the edge you learn a hidden vector and that is fed into a feed forward network to obtain the correlation. Then we took the very physically motivated so-called pair-wise GNN where the edges only are present for the nearest neighbors so now our graph neural network looks exactly like the spin net. It looks exactly like that, 5x4 and only the nearest neighbors are coupled but now how do we get correlations? How do I get the correlation from node 5 to 10 if this is not connected in the graph and the answer is in the following way. You compute the correlation prediction based on the inner product of this node embedding. So in every node there's a big embedding space there's a huge spin, so larger than the real physical spin and then the inner and then the spins are learned such that the inner products lead to the correlation to the two-point correlation and we compare this to an MLP which has done before already and to a neural tangent kernel which was also done before already in the last two and our quality measure is the mean squared error on the test set so it's the squared difference between the prediction C hat so this is the prediction of the correlation value between spins i and j and the ground truth, the true value is C ij and in the test set of course you know all these values this is how you learn and this is the result here so on the left we can see the error the mean squared error on the test set versus the number of training Hamiltonians 10 to 300 and we can clearly see here that this pair-wise GNN the graph neural network that respects the physical graph structure is the best and similarly we can look at the error versus the system size so how many... we now take 80 Hamiltonian we take 80 different Hamiltonians and now we vary the system size 4 by 5, 5 by 5, 6 by 5 and 7 by 5, so up to 35 qubits and also there we see that the error is best and then we calculate the average correlation for next neighbors and again the pair-wise GNN is winning and then we do it for d equal 5 so for the distance 5 steps to the next, next, next neighbor so to say so the conclusion here is that the graph neural networks outperform all other methods and it has the same graph structure as the Hamiltonian so this inductive bias seems to have a good effect here yes in the pair-wise GNN we learn the nodes in every spin there would be an embedding space with a big vector and to compute the correlation between two nodes doesn't matter whether they are neighboring or not neighboring so we take the scalar product plus some processing of that it's not directly this but almost and the outlook is here to use GNN in larger experimental setups if I have one minute left I want to show you a follow-up work which was done by the same consortium but I was in parental leave so now I'm talking about work not done by me but direct follow-up work on what I just talked about and this is basically theoretical work but also partially experimentally so it was already previously demonstrated that learnability of observables can be done when the number of data scales polynomially in relation to the system size N this is a result for PAC learning from a couple of years ago and this work now improved this bound to a logarithmic scaling by introducing a locality inductive bias so the number of training data to learn properties should scale only logarithmically in the number of qubits which is extremely powerful and the trick here is that the feature vector is formed not for one node itself but from a locality region so in this hexagonal net this for instance it would be always a full placate that we would write into an embedding vector and just very briefly it turns really out that this new model requires less shadow measurements so less measurements of these classical shadows I have talked earlier it has a better sample complexity and better system size scaling so indeed the theoretical results somehow now merge with the experimental endeavors and the holy grail of course would be to have only a logarithmic number of measurements and samples from some maybe classical shadow techniques and then have good predictions for certain properties for instance two point correlators of physical systems and that in time brings me to the conclusion of this talk so I have tried in the first 10 or 15 minutes to show you that deep learning is an extremely powerful technique and that there exist many different neural network architectures specific problems require specific architectures there is no one size fits all solution in deep learning and very often depending of course on your problem at hand but especially in deep learning you do need good and large data sets and also large computational power otherwise classical machine learning techniques like support vector machines gradient boosting and random forest and so on much better so if you have only small or medium size data sets deep learning might not be the way to go forward but deep learning can be applied in a plethora of fields and subfields of physics and science in general namely whenever you have enough data to train your models and the drawback is that typically I showed you this one exception in our astrophysics project but typically you get the black box solution and you get the credibility that is a drawback we have to live with somehow and with this I thank you very much for your attention thanks so much for this very nice talk we already had a lot of questions but there are more thank you for the very nice talk I have two quick questions so the first one is I mean the result that you had with graph neural network where you basically are on your Eisenberg disorder Hamiltonian where you had this locality between the interactions so did you try to see if you have a fully connected graph neural network and you have an all to all interaction are the results better or the previous one you are referring to here the local regions of parameters no the previous sites okay you are referring so here the pair-wise GNN is outperforming the fully connected GNN so if we have a graph where everybody is connected with everybody such that on the edges we can try to learn the correlations this is poorer than only the pair-wise GNN where we respect the physical coupling structure what happens if you have a fully connected Hamiltonian not pair-wise on a fully connected graph sorry now I got the question what if the Hamiltonian is fully connected we have not tried that if I remember correctly I reckon if the Hamiltonian is fully connected then also the fully connected graph neural network will probably then I guess the pair-wise connected graph will perform poorer as it will not respect the structure anymore we have not done it but that would be my guess for the fully connected Hamiltonian I expect the fully connected graph network so you need some knowledge of the Hamiltonian in your answer so my second question is for the other slides where you have this embedding of the placets and you see this logitemic dependence on the system size so what happens if you it seems to me that your system has translational invariance suppose that you have like disorder between the nodes do you think that you still have this very nice logitemic scaling that's also a good question that has not been looked at yet I guess that even then some advantages should remain because this really is about the efficiency of the embedding space in machine learning so I don't know this is a good question it has not been looked at yet but my guess would be that even without the translational symmetry it would get a bit messier in the implementation but I believe that the advantage should remain because I feel that this is part of the embedding in the machine learning protocol one more question about the correlations in the Heisenberg model so if I understood correctly you're basically learning the pairwise correlations here so is it possible to use your setup to learn pairwise correlations maybe three-body or higher-body correlations? Yeah, good question well, the thing is that for the pairwise correlations we know that they behave well for data that has been acquired from the classical shadows but it is known that the higher the correlations are the less good the classical shadow techniques is in predicting them so it's our first step generation where we have this big information that cannot be saved anymore by the later machine learning of course but yeah, these higher-order correlations are an open research question Maybe you can use this to denoise the shadow in some sense Thanks Thank you One more question do you have any sort of estimate how resilient your predictions are for the setup if you would have small terms in the Hamiltonian that you did not know there would be Oh, you mean if in the Hamiltonian itself there is some some other term Like a small magnetic field in the lab No, we have not made any attempts in that direction or any studies also very good question I can't say anything to that Thank the speaker again