 We're going to be talking about representation learning for quantum systems. Go ahead. Thank you, Agnes. Well done on the pronunciation. Good morning everyone. How many of you were at the school? Because I'm hoping I've kept the talk a little bit more accessible, I hope. So please also feel free to interrupt and ask questions. And basic questions but for the for the experts if you want to ask more detailed questions please do that too. Representation learning for quantum systems what I want to talk about and let me start by checking if you're all awake. I'm giving you I'm very proud of this animation. I'm going to give you 10 seconds to solve this question starting now. I have to show this because of the animation. Anyone wants to take a guess or knows the answer? What did you do in your mind? No, I think if I didn't make a mistake the answer should be a little bit easier. So what did you do in your brain when I asked you this question? Yeah, so you transformed the Roman numerals to numbers, Arabic numbers exactly. And then the answer should have been this. Then, okay, so then I will, this is my mistake, but I wanted to deflect it. I just took this literally from this book. I did not make up this, sorry? 221, you're right. Yes. Okay, 221 divided by 9 was then probably the answer you had. Well done. Here's the importance of having representations, but also having correct representations. Here's another one that, this is not a question anymore. Here is an important practical representation. This is about how you represent lists in memory where if I asked you to insert the number 6 into a list in the bottom left here, if I give you a data structure, this is for computer scientists, but if I give you a data structure that is a so-called linked list where you can only ask for the next and the previous element in memory, then to find the number 6 you have to start somewhere and do order of N operations to find where to insert the 6. Whereas if I had given you this data structure in a binary tree where you can always ask larger or smaller left and right, you can do this in log N operations. This is the same data, it's the same task. I'm just listing these to highlight the importance of representations. I'm sure you're all aware of this, but I want to drive that point home. Here, very practically, data structure representation is important. For those of you who were at the school, here's an example that you have also seen. The same is true also for things that we do in physics in tensor networks. You have a representation of your wave function which is exponentially large, which is a problem if you want to go to large systems, but it turns out that at least for some types of systems, actually you can do this for all types of systems, you can always have this representation where you decompose this exponentially large number of elements, this tensor, into something that has fewer. Did you all see this in Miles' stock? Where D is this bond dimension, this is the dimension of these matrices? You can always do that. In the worst case, these matrices have to be again exponentially large, and then you don't win anything, but it's a good representation because for some states with not too much entanglement, this representation is also useful. It's not only different, but it's also useful in that sense. Then in general, it's a bit of a vague question perhaps, but finding a good representation then depends on what you want to do with it, and then we need to define what is a good representation. Many people have asked this question, but I think a useful one for me is if it's representation that I can use to solve a task. I'm not just asking how do I represent this data, but I want to do x, y, or z with this. How do I find a good representation for that? Making a subsequent task easier. There's many things you can say here. It's either easier to interpret. That's what I will focus on a little bit in the next example. Maybe it's less memory, tensor networks, for example. Maybe it has other properties like being more symmetric. Maybe also for tensor networks, it's easier to manipulate, and I can also have any or all of the above. I maybe want to have one that is less memory or more memory efficient and easier to manipulate both and have them all in the same part. Then this is a very standard thing that people nowadays do in semi-supervised learning, which is also relevant if you have data, for example, from experiments where some samples have labels, and many do not, because they're perhaps expensive. You have to have a human label the data, or maybe it's easy to generate a whole bunch of samples, and then some of them have class labels. Then what you do in such cases, or a thing to do in cases where you have this separation, few labeled samples, many unlabeled ones, is to take the unlabeled ones, learn a representation from that. Come back to this in a moment, and then with that representation, solve the supervised task, where you do have the labels. This allows you to do bootstrapping. You can then also extend that. Once you have learned it from that representation, you can assign labels to the unlabeled data set. It's a very interesting topic, semi-supervised learning. It's what it's called. There's also self-supervised learning in this setting. If this applies to you, if you have experimental data, this is worth trying. Now, I said representation, I said learning representation. So far, for example, if you think about tensor networks, it's a representation that we have come up with as humans, the same for the data structures. But for a lot of learning tasks when it comes to deep learning, the idea is that we would like to extract such a representation from the data. This is what representation learning is all about, and if you have been doing machine learning, you have all been doing representation learning, maybe just not very explicitly. So here's, I took this from this quote because I like it from the website that show here. The idea, therefore, is to learn a representation but make this now explicit by embedding it as a separate step in a learning routine. Here it says, is a representation learning, is a process in machine learning where algorithms extract meaningful patterns. We'll come to that in a second ago. Also, what meaningful means, from raw data to create representations that are easier to understand and process. Maybe interpret even. So that's what the next sentence says, actually, representations can be designed for interpretability if you care. They can reveal hidden features or they can use for transfer learning. They can use for many more things like single shot learning. If you're interested in doing that, again, if you have very costly data or very costly experiments that give you few data samples, that might be interesting. So the idea really is to have this learning of the representation as an explicit step. And one way of doing that is dimensionality reduction. Using, in this case, an autoencoder. Did you, does everyone who does not know what an autoencoder is? Nice. Okay, good. No one, or those of you who don't know, don't dare say it. An autoencoder is a very simple thing where some data goes in, you compress the data and you also decompress it and then you just compare the output with the input and you tweak the weights of this network so that the output resembles the input as close as possible. Just reconstruction, reconstruction loss. This narrowing and going back out is called an information bottleneck. And this central layer, I will call, I mean, some people, many people call it feature space or latent space. I'll use probably both. Typically, I call this latent space. And then the layer, the variables, the numbers in this layer are called latent variables. And so we can, we can then start playing around with this type of network where we have typically this inherent trade off, right? I could, if I make this network large enough in some sense or expressive enough, and I'm, there's also, there's all kinds of practical things that I'm sweeping under the rug here, right? I'm assuming that these neurons have some fixed precision. They're not infinitely, infinitely precise, because otherwise I could stuff everything as an important point. I could stuff everything into one neuron if I just, you know, take each sample of my input and code it into some set of numbers and have this single neuron just learn by heart all the samples by putting the representation as digits behind the comma, right? If you know about space filling curves, this is what it would be doing. But that requires that you have one number with infinite precision. And we don't have that. So when I talk about large enough, I just mean a large enough network, many, many neurons. I meant just showing a small example here. I can then try to make this information bottleneck a little bit smaller, constrain it, force it, force it to represent whatever I give as input in fewer and fewer numbers. Typically, there's a good middle point or all the way, all the way to, you know, maybe one single number in some cases, where nice properties could mean it's interpretable. I can use the same representation for several tasks interoperable. Different properties that you might think that you might call nice, which again depends on the on the task that you're considering. So here's a task that we considered. I give you a simple, very simple two qubit density matrix, which I generate according to some simple circuit. And I would ask you, I allow you to represent this internally in your brain with one single number. So I give you the density matrix, you can extract one single number. And then after you've done that, I give you many samples to learn from. I will then ask you for that single number. And then based on that single number, you need to reconstruct as well as you can the density matrix that corresponded to that single number. Does anyone want to take a stab at what you would do for this particular case? Is the angle, what were you going to say? Quaternions, okay. Yeah. Also, that's not a bad representation for matrices. Anyone else who agrees with angle? I need many people. So in this case, that's exactly it. That's a little bit of a silly question. If I give you this matrix, if you get many of these matrices, of course, with fixed numbers, with alphas, some numbers, so these are literally floating point numbers, you could hopefully learn from that that there's this one generative factor, so it's called, which is the alpha, which is the angle alpha, right? So if you were to take many of such matrices, squeeze it through an autoencoder with one latent variable, we did that. And I don't have, I'm not going to show you that result here, but you, it does exactly that. It learns to just extract this angle, okay? That's not very interesting. So instead, we scramble that by scrambling with higher measure unitary operators. So that the numbers that you get in this matrix are now no longer recognizable as that angle. What do you think happens then on each of the individual ones? Yeah, so explicitly that's indeed, so we're not scrambling something that is still left in this representation. And if you figure out what that is, then you will have the answer. Yeah, for a given, for a given alpha, for a given angle for each shot, we will do different random unitaries. Yeah. Okay. So the answer to that is what I'm going to go through now. Of course, we had some expectations, we had some, maybe, you know, hopes or what would come out. The bottom line is we as humans are doing something useful already. That's, I can, I'll show you in a second. That's that's a good thing. So here's how, here's how that would work. The same model we had before, just to show you the whole pipeline. If we take this data set as input, the MNIST numbers, we would also compress that and then decompress it. You would get some representation in this middle space. In this case, it would be a two dimensional one that I'm showing here. And we have colored here, the different classes, the different numbers in the data set with different colors. So you can see in this case, if you didn't have those colors, this result is not particularly useful because it doesn't cluster, but it is intuitive because now if you do imagine if you didn't have the numbers like we do here, you can see some relations that, I mean, it's a little bit small perhaps here, but that some numbers that look alike are close together in these, in this space. And I just want to show you this so that you have an idea of what we're doing. So now what we're going to do is replace this data set of numbers with those density matrices. And then see what this later space will look like. Generate many of those scrambled as a function also of this alpha. So what comes out of this circuit would be a pure state. We turn it into a density matrix with alpha because later we will want to also include mixed states. And then embed this into a latent space of two dimensions for this particular representation. And what comes out is such a structure, not always the same, but something that looks quite ordered and structured. I'm showing you here also already some colors. But even without these colors, it's a very it's a compact shape. It seems to have some symmetry, right? Maybe two lines going through. So already if you do the simple auto encoder on such a simple data set, you get some structure. Is the is the pipeline clear? Yeah, great. So there's structure, but we don't have a handle on this. We would like to have a way of tuning or playing around with how with this information bottleneck, apart from just asking, maybe one number, two numbers, three numbers, also for a given set of two numbers or one number, we would want to have a handle on on the representation that it learns. And so the way of doing that is by what we did so far is go I'll come back later to what state of the art is now is to do a variational auto encoder. Have you seen variational auto encoders? Also a lot of nothing great. So in a variational auto encoder for each latent variable that I I think of doubling them. I now don't predict that latent variable, but I interpret this prediction as predicting the mean and standard deviation of some normal distribution. Okay, so for each input that goes in, I compress it with some encoder network and I interpret that middle layer as means and standard deviations of normal distributions. Okay, I'm just showing you one here. If I had two latent variables, two of the sets I'm showing you here, or in the previous slides, I would so for two, I would have four, I would have four numbers here, right? Two means and two sigmas. And I didn't tell you this, didn't show you this before, but in words that's exactly what I did here. Density matrix goes in. Some gets compressed down to some numbers and gets decompressed and some other density matrix comes out or something that we interpret again as a density matrix. And we're just doing the mean squared error between those two element wise. Minimizing that means that it tries to reconstruct as well as it can what goes in with what comes out, right? But because we have now this middle layer, this variational altering encoder, we also have this extra loss term which I will call now a regularization term. And what it does, forget about the beta for now, what it does is it tries to keep this middle layer, this distribution as close as it can to a normal, to a standard Gaussian distribution. So mean zero and standard deviation one. I'm not going to go into why this makes sense or how this leads to more interpretable or actually disentangled latent representations. This is a very standard thing that you do with variational alter encoders. And there's an extension of that which is this beta variational alter encoder. And this is the thing that we are going to tune. So this is going to put relevance on how strong do we want to have this representation be regularized. Again if beta is zero, we're back to an alter encoder. And if it's one or larger than one or larger than zero, we are trying to enforce some statistical independence. I'm not explaining this, but this is what it does. Some statistical independence on the different latent variables. So it tries to say if this latent variable changes property X in my data set, then the other one, if I have two latent variables, I had better try to change something different than whatever the first latent variable is changing. Go for it. That's right. So the question was indeed, if you have more latent variables, then what I'm trying to do with them is to have them all separately. Each of the latent variables of those pairs be uncorrelated standard distributions. Yeah, more questions. Sorry, yeah, in the back first. Yeah, so this is exactly. So your question is if the number of input layers is the number of elements in the density matrix, that's what we did. Yes. So I think what Manin talked about, for example, would be a different representation where from these density matrices you first measure two body reduced density matrices and then use those as input. This, what we do here, indeed, obviously will not scale to large qubit systems because you are feeding in the whole density matrix. But you can also do a representation learning of that first and feed that as input. There's many ways that we can try to reduce that. But for here, we're going to focus on this interpretability and not so much on scaling this to very large qubit systems. Yeah. Does that answer your question? Another question, yeah. If they are overparameterized. Yeah. What happens? So you're asking what happens if you have more latent variables than you have generative factors. Yeah. Ideally, what will happen is that they all become redundant. They also are being copies of each other and you see that, you detect that. In the worst case, if you don't regularize properly, they're going to learn, depending on your activation functions, but they might learn all some non-linear weird combinations of that input of the generative factor. And then that's a bad situation because you have difficulty keeping them all apart. Yeah. Anything else? Keep the questions coming. That's great. So, recap plus a small demonstration there. Density matrices scramble, density matrices go in. We, in this case, give the latent space, we give it an eight-dimensional latent space. And then we start tuning data. So, we train on the same data set several different models with different datas and then check. So, on the x axis here are the latent variables, actually the means, the mused latent variables, and the color, and y is actually the beta, the color code says, for each of those latent variables, how much did it contribute to this Kullbach-Leibler loss? And we've sorted them from left to right in higher-intensity. So, you see that for this data set with the scrambled row, it almost never, it's very faint, but it always never uses more than three latent variables. And more interestingly, there is a range in beta space, let's call it, which I see, okay, there's a range in beta space where it finds a one-dimension representation, which is what I asked you before, find me, represent these data, the density matrices with just one number, okay. So, let's take a look at one of them. Here 0.75, I think, was the right thing, the beta space, beta has shifted. And then we look at, so we're still plotting, it has eight latent variables, we're still plotting number one and number two, just to get a visualization. And clearly something happened, and that was also the right interpretation of this plot, namely something happened, okay. There's clearly some structure here that we can still exploit. So, let's investigate that a little bit more. Now, let's take that one latent number, number, the first one we called that 0, and plot it against what we have the data, so we know for each sample what the alpha was. Remember, this was a very small, but this was the circuit with alpha, and we see these two lines. So, there's, you know, there's clearly a symmetry that we did not get removed from the system, but that's okay. Does anyone, this is a question for the information theorist, does anyone recognize what such a curve as a function of this, you know, alpha would be? Yeah? So, what alpha does, so the answer that Benoit gave was entropy. What alpha does, of course, if alpha is 0, I never explained the circuit properly, apologies. If alpha is 0, I never do a rotation, and I just have two non-entangled qubits. If alpha is so that I do exactly a bit flip on the second one, then this circuit is exactly the circuit that generates a bell state, so I have a maximally entangled state. So, between here and pi, I go from a non-entangled state to a maximally entangled state. And then a measure for two qubits specifically that we already knew is, for example, the concurrence, which concurrence is not, you know, it's not easy to calculate per se. You have to first compute this matrix R, for which you have to do the square root of the density matrix. Sorry, I just realized you don't see my cursor on this screen. I've been pointing at all kinds of things. The density matrix R, which you first have to do a transformation with sigma y, take square roots, and then of this R, you compute the eigenvalues, and then it's this function, the max of this, with the largest eigenvalue minus the other ones. And there is also this, yeah, go for it. So, first we, okay, I'll come to that in a second. For your concurrence, actually it turns out it's the same thing as negativity for two qubits. Thank you. This is, I cannot help, this is the joke that a science teacher once told me that, how do you distinguish experimentalists from theorists? Is you give them a laser pointer? Exactly. You give them a laser pointer and the theorist does this to see if it works, and the experimentalist does that. I am more of a theorist, but I've learned not to do this anymore. So it's not easy to compute, but this curve, so we, how we found this is looking at, you know, we expected, because we are scrambling only locally, we expect that things like non-local information between two qubits, so things like entanglement entropy, would be preserved. So we intuitively already knew that we should be looking for entanglement properties. And then, you know, there's not that many that you compute for two qubits, and so here this is, this is the one that we knew to look for. Other questions? Okay. And so this curve, it's not exactly the same. You see there is some linear shift in the latent variable. This thing starts at zero symmetric, and so you have to, there's linear transformation that you have to do on that, but it's a monotone in both cases. So if you now, if we now take, you know, the value for a given representation for here, the value of the concurrence as a color, then, then this, oops, then, then, you know, you again see that there is some structures, the same color that I showed you in the very first plot where you have, with the autoencoder, you also see that, you know, moving towards larger z zero, you get more concurrence and at z zero in the middle, I don't have any labels here, it is also so. So this representation clearly is entanglement information. So if I told you, if this was something else, I would have been perhaps even more excited because it means that what we have been using as representation for two qubits as a single number should maybe not have been entanglement information. Maybe maybe this compression method, this quote-unquote AI, would have come up with some other number that we should have been looking at. Here at least it confirms that things like concurrence or negativity, which is the same for two qubits, works. And what I'm not showing you here is that this thing works also for density matrices for mixed states, which is the nice thing about concurrence. For the expert, concurrence works for two qubit mixed states, same as negativity, but there are cases for which negativity no longer works if you go to more than two qubits, three also works, but if you go to q-dits for example, then in those cases we don't have, I see someone maybe disagrees, the other way around, did I say the other way around? Okay. Yeah, concurrence is only for two qubits, that's correct, sorry, I missed that negativity works for others, but if I'm not mistaken, also if I go to larger systems, maybe more q-dits or q-bits coupled to q-bits, does negativity still work? There is a strong question, so ideally we would try, of course the next thing that we would do is we didn't do that yet, is try it and see what happens, what set of numbers, because if you have more qubits you can make more bipartitionings, would this method come up with as for representations? Just to show you already flashed it before, of course it's interesting to also now look at what happens at other points, can you, could we have seen this just from looking at those and the answer is yes, so here you see the representation, it comes up with, if you have three latent variables we're just showing two here, interesting, you know, if you have the colors, if you didn't have the colors this would not be any informative at all, but there is this point at 0.75 where you have these two latent variables and you have this very clear one-dimensional representation, again also separated from in color scale. We did try some things just to see if this two qubit thing works for subsystems of three qubits, this is just as a check, so we also know how to make three qubit entangled states, in this case, W-like entanglement for those not going into detail here, this is a W state and you can also parametrize those with some angle so that for a zero you again have three separable qubits and if you increase alpha you get more and more like a fully mixed maximally entangled state, sorry. And then with three qubits, like I said, you can now start thinking about the bottom two, let's call them A, B, C, how are B and C, what is the entanglement between group B and C with the first and all the other subsets and for that, of course, in this case maybe this should have been called negativity, but this also again works, so there you see the correlation of this latent variable we didn't take out, so these absolute values to take out that symmetry that you saw in both directions and they're nicely very linearly correlated again with concurrence here. Okay, so yeah, so yeah, yeah, they're asking can I tell you a bit more about this linear correlation, so I think go back to the two qubit case where now we talk about concurrence then if you make the same plot of this latent variable and plot it against concurrence if there is a linear relation between the two then they're clearly linearly correlated, so one is a linear rescaling of the other, right, so this is how we checked is is that latent variable indeed doing something like concurrence and if it's linear then yes, so I think you should turn this question around and it's, this is a confirmation that whatever Z is doing is linearly related to the concurrence, right, if this was not, if this was not a linear relation then Z was not concurrence, it would have been something else, does that make sense? I can, yeah, yeah, yeah, yeah, that's right, I cannot I cannot prove that it's exactly computing concurrence but whatever it's computing is linearly correlated with that, but you're right, it could, I don't, yeah you're right, it could, it's a coincidence perhaps but the interpretation is that concurrence therefore is probably that the number you should be computing as a human too to do this representation yeah, yeah, yeah, so this, yeah it is true, so that's why, I mean that's why we have eight here so for this four by four you would, or you would need eight if you want to fully capture the whole density matrix indeed and if I turned beta off entirely so to go to zero and without, and including the scrambling you're right you would need eight latent variables and you, you could then have one latent variable where like learning the elements of density matrix, right so this would be, I think this is what you mean and then you can see exactly that here what happens for example in these cases where you have a little bit more it, I think if I'm not mistaken but I should also look at Felix, where is Felix up there he did all that you will find that it starts learning exactly this redundant information so I think you're right if you remove this, if you remove the data and you're really just forcing it to do a perfect reconstruction then you get this result but as soon as you include data it, you introduce this bottleneck and you, you see that it starts finding other combinations does that answer your question oh that it learns those features yeah maybe there, yeah yeah I don't think we ever truly looked at all of these and see if for example those would map to other preserved information maybe we did not check we can, I'm happy to try yeah because then I would want to go to a few other very quick, quick statements or quick things so what, what for representation learning has emerged as as a useful thing to do nowadays a state of the art is to do contrastive learning contrastive learning doesn't have to be for representation learning per se but it turns out it gives you good representations useful ones to do other things with and so for contrastive learning the idea really which is capturing the same information the same statement as with the variational autoencoder is that you are trying to take inputs you're trying to map all of the inputs into a latent space but now I try a different way to make sure that if the inputs are similar they end up close in the latent space and they end up far away if they're dissimilar in that latent space and here contrastive learning does that explicitly so instead of saying I want to do that by trying to have uncorrelated gaussians here I'm really saying I'll explain this in a second I was really saying in that latent space I introduce some distance measure and I'm trying to then maximize the distance between dissimilar samples and minimize them between similar inputs and then of course I need to have a way of saying which inputs are similar or dissimilar and you do that by picking an example of your data set which we call the anchor and then compared to that I pick two other samples one I could pick from the same distribution that I you know called a positive sample one that I know is similar could be this picture but with noise it could be one that I know is supposed to be another type of dog and I also pick one that is dissimilar maybe because I'd pick I just take it from a different distribution or because it's a different animal right I use the same encoder to bring them into a latent space so each of these pictures is now represented as a feature vector this latent vector one for the positive sample for the anchor and for the negative sample and then I try to to make the maybe I should have flipped these but it depends on if this is a distance or not but I try to make the anchor and the negative sample as far apart as I can and the anchor and the positive sample as close as I can right this turns out to work really well and so what we're trying now is to do this exactly on these systems on circuits now not with two qubits but with six qubits or maybe more or less for these measurement induced phase transitions and I think maybe you've seen a little bit about that in previous talks but the intuitive idea I think here is very simple if I have a circuit where I don't do any measurements I likely at the end get some some state that has that is entangled has some measure of entanglement probably if I do this with random random unitaries random hard measures I get to some volume law entangled state and intuitively if I now start measuring everywhere in particular at the output then I get the product state and so I I get lower entanglement and somewhere in the middle between not measuring at all or measuring always there happens to be a transition between when the average circuit with measurements is area law or volume law if you're not familiar with that don't worry about it just saying how does the how does the entanglement scale with the number of qubits so we do this we do the same thing we pick a density matrix the negative one is with measurements and the other one is the same circuit but scrambled and encode that and very sorry very preliminary but also intuitive result is immediately what you find is it it starts distinguishing those two classes what I didn't tell you is that the measurements or happened with some probability that I'm not showing you here just because it's preliminary and intuitive hopefully but immediately the representation that you find in one dimension already gives you those that are very entangled versus those are area law entangled and I hope this is also intuitive there is then a way of doing explainability on these looking at individual pixels but what you see what you see here is a density matrix of six qubits and the dominant features that it looks at for determining is it area law or for volume law and I think you know I hope no one it's a it's a simple but important result but I hope no one disagrees that for volume law you probably have to look at many more entries then time in my time is up I think the last thing I wanted to quickly show you is because I cannot help not I have to talk a little bit about quantum games this is quantum games so I think my favorite playground for thinking about representation learning is in quantum control problems quantum error correction you all know hopefully this card call problem in reinforcement learning where you have a stick and you try to balance it right you can you can make a quantum version of that and then have a reinforcement learning agent try to control that system not going to explain exactly how that works with all the weak measurements and the control but here too you can think about how does an agent how does a reinforcement learning agent given for example the wave function or weak measurements what is the representation it uses for for doing that same thing with quantum error correction which can be on the surface code a Torah code or color code can be interpreted as a as a single player game that you can also play with a reinforcement learning agent and again there's a neural network in the brain of this agent and you can ask what is the representation there there's also a few other games that I would encourage you to check out um there are less you know those those are more entertainment type of games but still making an AI for chess is already a difficult problem we all know alpha zero now or mu zero even there's an extension of quantum of chess in which you have some set of quantum rules there's a whole there is some something called quantum game theory combinatorial quantum game theory so you can you can add you can embed this with a lot of mathematical rigor but the AI for storage games had better also find a representation for these states of the system and it's interesting for me at least to think about what would the what would that representation be so any further questions please feel free to find me here or reach out thank you for your attention thanks a lot for this great talk are there any more questions for the contrastive learning yeah what's kind of so I know there's the self-supervised contrastive learning where we augment the data by cropping or adding noise or rotating in order to not have labels yeah so my question is how like how little can we have of similar data because like there's a limit on how much we can augment the data before we kind of like crop out all the information or like transform it way too wrong and the other question is if it's maybe like the hundreds variation of the same data point is it possible for example to train a hundred autoencoder on density matrix and then we have hundred different embeddings maybe the autoencoders have different kernels or different just by randomly you know learning different embeddings and then now we can use these embeddings for contrastive learning yeah okay so okay two questions the first one I I think the first one really probably depends on the exact problem if I understood correctly like how many how low can you go in terms of number of samples for the for data points I don't think I have a good rule of thumb for this I think there's probably a little bit trial and error is that do I understand that correctly is it yeah I've like I've seen papers with like few tens to hundreds and I've seen papers with like hundred thousand different for each data point so I was like it's way too different yeah so I think I think a good approach that if you if you can get more data get more data I think one thing we clearly see is more is better even for these small systems but I I like for this particular case I don't remember exactly about 10 is not enough a million is probably more than you need but somewhere in between is a good point where you find stable representations so that was that was one for the second question there is there is more I think to add because a lot of it also for us for example here depends on the specific loss function so I I didn't spend the time on this but we we did the I'll keep it short we did the mean squared error element wise on the density matrices now there's more off-diagnose than there are diagonals so in this mean squared error we are already emphasizing that it should be looking at the day off-diagnose more just from the loss function for reconstruction so with that choice we're already biasing it a little bit towards looking at off-diagnose right and so which we what we're trying what we're also trying to investigate now is what happens if we already do different input representations so I'll change the representation that we give it instead of the density matrix do what marine did the two body density matrices or some other set of POVM measurements right and I think this is what you're getting to a little bit is that then if you because that already means we have a different representation that we're feeding it feeding this this autoencoder right and so I think there's definitely an interesting idea to train many different autoencoders and use those as inputs to to the representation learning scheme yeah thank you there any more questions you can also shout and I can repeat it but uh yeah yeah but if in our case for the in the contrastive learning the negative samples were those with measurements no no I think no we're just trying to we're trying to we just say some samples like is it similar to the anchor or not and the anchor did not have the measurements and the other one did have the measurements and then we're not telling it explicitly that you know it's volume lower area law even though very likely with measurements you get to the area law but but yeah there there's no yeah I mean it's a yeah it's a very simple step I think still but if this didn't work then something was wrong are there any more questions that note I also wanted to ask so so regarding the question just now yeah um if you want to learn something that goes beyond entanglement if you're thinking about that so so is there maybe a way that you would say okay anchor and positive sample you already destroy entanglement in a specific way and maybe then you learn something else that is stored in the circuit I'm not sure I fully understand you're you're asking if if it might learn something that is not related to area lower volume law exactly and if you maybe have to modify the setup such that you're already between the anchor and the positive sample destroy the entanglement such that it does not learn that yeah yeah I think I mean we're just starting that so that excellent suggestion that's yeah we'll try we'll talk more yeah that's good yeah well then let's think over it again