 Next up will be given by Emil Mathieu, who is currently a postdoc in Cambridge. Take it away. Thanks, Mayelou, for the introduction, I guess the invite. Indeed, I'm a postdoc in Cambridge, and so I'll talk about geometric neural diffusion processes where we basically propose a probabilistic model for modeling tensor fields and feature fields. And I guess I'll give more motivation and details in what follows. So this is joint work with Vincent de Tortua, Michael Lechisson, Valentin de Bortilly, Iwaite, and Rich Dono, which I guess are all collaborators either in Cambridge or Oxford or in Paris. So first I'll have a very brief intro to joint modeling. I guess you try to motivate why do we care about this probabilistic setting. Then some background and continuous diffusion models, although I guess quite a lot of that material I guess has already been introduced in the previous talk. And then I'll show how we can extend that setting to the function space in order to model stochastic processes and eventually how to model feature fields, how to incorporate geometric constraints into the model. So first on Gibra to modeling, so there's quite some settings where we truly care about this kind of probabilistic modeling and not just a deterministic guess. So one is in molecular confirmation generation where say you have access to a molecular graph and you like to predict the 3D structure of that molecule. And so you could kind of just have your graph neural network that takes that as input into a 3D structure, but because of temperature fluctuations, this 3D structure is actually not fully frozen, it kind of fluctuates. And also depending on the specific choice of molecule, there might be like different kind of meta-stable structure in which this molecule kind of like could be. So it doesn't really make any sense to like kind of make one guess because actually there's like a full distribution over the 3D space of each of the atoms. So here's kind of like intrinsically random, so you'd like your model to model and incorporate that randomness. Another one would be for instance in meteorology, let's say you like to predict like precipitation in the following like hours or days or even weeks and that's actually like extremely tricky. I mean I guess I'm not a physician myself, but having told you like atmospheric physicians it seems like like the physics is kind of like still like not fully well understood and it's almost I think also like quite chaotic. So small like change in the initial condition would tell you that it's going to rain here but actually not rain like 500 meters away. And so here's more like the kind of like the lack of knowledge due to the lack of understanding of the physics but also of data that would make us like incorporate that like of good prediction of the essence into the modeling so that we kind of know how much we actually don't know. So instead of predicting like how much precipitation there'll be, we'll have like a full distribution over you know the precipitation. So I guess this like a different perfectly where it's kind of like there's some interesting randomness and the second example where it's more like we want to introduce uncertainty in the modeling to capture or kind of like you know limited knowledge about the problem. And so deep dive to models are basically like neural network kind of based model that parametrize a property distribution and so the idea is that you know we'd like to be able to either or both draw samples from the distribution or also you know evaluate the likelihood of some kind of samples under that model. And we kind of typically assume that we'll have access to samples from a digital distribution you know some kind of data set that we have access to and because in the setting where we have access some kind of like a normalized density or some kind of energy usually people refer that as like a sampling problem. And most of I guess like following the previous talk, most of probabilistic models we kind of follow into this kind of like quite a level framework where one would start with some kind of distribution. So for instance a normal or either something that's like a checkerboard and the idea is to have a parametric transmission to push forward that density along that transmission and that would induce another distribution that's like potentially like quite more complex. And then the idea is to like tune and learn the parameters of that transmission so that the induced probability distribution would kind of like fit you know would be like pretty close to a digital distribution that you'd like to fit like the data distribution that you care about. And once you've done that you can generate new samples or evaluate any you know the likelihood of your model for like if you give a new sample. So that's kind of the general framework. And diffusion models like do indeed fit into I guess the setting and as that remain previously they're like they've been shown to externally well in practice and so that's why we kind of in this talk build on this class of model. So I guess there's some of the lab is still like give a brief background on continuous diffusion models. So the key idea is really to start with some kind of like you know data samples so we call so I guess different slight difference with the previous talk is that I the time is kind of like in there. So we'd say like the initial data distribution you know the samples would come from like y is zero so t equals zero and then we like knows them until we get to like some kind of like base or limiting distribution at t equals one so that's kind of just a small kind of and so the idea is to construct you know to district the data with a continuous series of noise and this would be formalized by by a stochastic differential equations which we could which you know people like have introduced as like the for nosing process because you'd like continuously nose and destroy data samples until like you know time capital t. The idea right is that that would converge to a known distribution such as like a question and then the core idea of diffusion models is like oh but you know since we have this forward nosing process going from like you know the data distribution to a known distribution can we you know simulate from the time reversal starting from like this known data distribution to samples are like distributed accordingly to the data distribution that we care about. So in a question one have like kind of like a white class of SDEs with a drift term in B and like a you know volatility or diffusion coefficient in sigma and in question two we have a specific type we refer to as like the long run dynamics and what's kind of nice here is that the long run dynamics admits as like an environment measure something that's proportional to exponential minus u with u you know I guess being some kind of like energy potential and so you can kind of choose u so that you know towards what this stochastic process is going to converge and so if you take like a specific example of b just being zero so you know if you use like constant then b is going to be zero then you just have simply this like brain motion term and then the environment measure would just be the the low bank and another kind of like indeed like very well studied example is the oceanic process with a drift is linear in the in the viable here whitey and here if you choose kind of the coefficient white wisely then you the environment measure would be a standard normal with zero and and violence one and what's kind of quite nice is that conditional like a data sample why zero for any time t we we we explicitly know the form of the conditional whitey given why zero and it is Gaussian and with minimum parameters depending on like why zero and you know the value of the noising time t yeah and so I guess I kind of didn't put that slide but for people who are more familiar with like this and like discrete diffusion models you can see them as like discretization of the of these SD's and kind of vice versa so the kind of region in this way yeah so like some future interpreter there's like any question or something that's unclear or some remarks also I'll give some I'll leave some time at the end for that but you know yeah if there's anything so is that really clear okay so then what's nice so what we care about is this time reversal you know when we sort of time index like you know what what is the dynamic of that process and what's nice is that under like very mild assumption it has been shown that this time reversal also satisfies an SD's that has like a pretty similar form which is given here where the diffusion term sigma is the sum the only difference is in the drift and so where you have kind of like the opposite of the forward drift of the noising process drift so you can see that you know if sigma is zero basically there's just like the time reversal of an OD but here since we have some noise sigma then you have this extra term in yellow which is referred as the Steinskull and I guess intuitively right so here pt being like the density of the forward process so intuitively that tells you you know what is the kind of the direction of of you know the of the the density at time zero at time t sorry and so it kind of tells you you know where in some way you would like to go in order to get closer to the datamin for especially as t gets close to zero so that gives us like a recipe like how to you know to build a generative model by you know starting with some yt bar so weight weight sorry y zero bar so being the kind of initial value of the time reversal so kind of the last value the end value of the of the forward process and then if you were to be able to simulate that then you at time you know t equal one you'll get simple that distributed according to the data process which is what you know that's kind of the aim I guess there's you know obviously a few kind of like issues which prevents us from doing that the first one as mentioned earlier is that actually we don't have access you know to y capital t which is kind of the end time of the the diffusion we have access to like you know y capital t given some you know specific y zero but not to the marginal so I guess the idea is that you know if you run t long enough and indeed specifically if you use like like long run dynamics this would converge with like geometric rates actually converges like pretty quickly so it would be approximate but you know you know it can be like relatively close to actually start with this you know invariant limiting distribution then the other kind of issue maybe the main one is that we don't we don't have access to the science goal because it would involve so solid focal length and so we don't want to go through that through that avenue so the idea is to learn it and that's what we that's what I'm going to show in the next slide how to do that and I guess the the last issue is like even if we were to know the science goal apparently it has like a very simple functional form like you know linear in in whitey then there's no way we're going to be able to solve that that sd in closed form and so the idea then is need to like discretize it yeah so how to indeed like approximate the skull so there's like kind of like a pretty simple skull matching identity has been shown which involves just like you know derivative of the log and then just using the fact that the you know marginal is the integral of the joints and and basal and you can show a question three and what's nice is that the right hand side of question three is a conditional expectation with the integral being the first term and so that tells you that the the left hand side of the equation is the minimizer of that following you know like quadratic optimistic pressure optimization problem over the functional s and so so that kind of readily gives us a you know a loss to you know to use in order to approximate the Stein score s and so the idea is to use a neural network with some you know finite number of parameters theta and to plug our score network in equation four which is called the denoising score matching loss and since we you know we know what is yt given y0 we have that in closed form can both compute this conditional score but we can also you know some you know approximate this expectation with like a Monte Carlo estimator and so finally we would you know this sd wouldn't be plugging the score network instead of the true Stein score then you know one could rely on like a first order discretization scheme for instance in order to approximate that sd and then starting with like the you know here for instance like white noise so Gaussian noise on each of the pixel and then using this Euler-Maurier scheme in order to approximate simulate the time reversal one could sample you know obtain simple that are approximately distributed according to the data distribution okay so the idea of continuous addition models like quick recap is I guess like kind of two things one is to build construct a a nosing process that's formalized with like an sd in order to continually destroy the the data sample and then to approximate the time reversal process okay so is there is there any question on that part because I guess then I'm gonna build on that so okay so now I'd like to I guess introduce the diving into the the method that we proposed this probabilistic model of a feature field and so that's that's kind of the the aim of of this talk and so like briefly and I'm gonna mention that I introduced that a bit more formalized just after like tensor fields are like mappings from like an epic space to a tensor and tensors are like geometric object that transform in the particular when the the coordinates change so you could think of like nuclear in space like a rotation or translation of your frame and we like also to enforce in violence for such objects over with respect to group transformation and we can have achieved this in two ways first by extending different models to the two function space by creating the finite marginals for different inputs and then by enforcing uh in violence indeed using a kernel and a score network which are a group equivalent so I'll go more in details how we do so in practice so a few examples first to motivate like why do we care about feature fields in the natural sciences there's like many problems where it kind of like objects of interest of study are feature fields so one can think of like a case you know in oceanography where the sanity is like a scalar field or you know current is like a 2d or 3d vector field but also like the vorticity you know in the atmosphere being like a pseudo vector field which is slightly different but also in mechanics with like for instance a high-dimensional ring to a strength tensor which is also like a tensor just an objects like high-dimensional and so more rigorously so a feature field can be seen as like a tuple with first a mapping f you know mapping some input to like some multiple space which we assume to be a cleaning here and some representation row which kind of tells you like what type this feature this like tensor that is mapping from like a group element to a linear transformation here over all the and so focusing on like the Euclain setting if when the group being like the Euclain group which is like all the other matrices of the Euclain space so think like indeed like translation and rotation so then acting transforming a feature field with some group element g on f of x so the feature field f with you know g being like u a translation composed with a rotation h then it's going to be it's going to transform the field as given by the right hand side of equation five which is that first I guess if you look at like the top left picture this is the original feature field then you would kind of like look at the value of your feature field at g minus one of x and then you would update your transform that feature field according to row of h and here row depends on the specific types of field that you're like interested in modeling so first kind of field a row would just be one so actually you wouldn't change anything so that's why the middle and right hand side plots I can have exactly the same in terms of the the colors but if you look at like the the black arrows so that's like a 2d vector fields then you would additionally row would be row of h would be h would be the identity mapping so you would add additionally need to like rotate these arrows and that kind of you know concept of representations I can have generalized to like higher dimensional tensors you know higher dimension than like vector field for instance this stress tensor in mechanics and so what we want to do is like to build a probabilistic model over search objects and so first I'll show how we can extend different models to the function space and so the idea is to basically if we look at like different value of our functions at different inputs x1 to xn so we have answered values is to know is all of these values like jointly with a kernel that's given by hk and so then you have like for instance like a multivide austen and back process that are nosing each of these values with this this kernel k and so that's why you have this gram matrix k of xx in front of the brilliant motion and what's nice is that so I guess this is similar to kind of the finite setting because you're only looking at like a finite number of inputs x1 to xm so you know that this process will converge to a multivide normal which mean and covariance are given by m and k and we show that basically since it is true for any finite marginals in practice is building a diffusion models of a function space and so the process will converge to a Gaussian process with min and kernel given by autos of m and k and so you can see why t has like interpolation you know between stochastic processes why zero being like the data process that you're trying to model and why infinity being the kind of this this gp at the kind of infinite time limit and same here everything is kind of like available in close-up everything's Gaussian so that's that's very convenient for training purposes so depending on the specific choice of kernel then the process will converge to different gps here with like you know a smaller length scale and I guess you know in the infinitely small length scale then you would converge to white noise so I guess here you actually know each of the y xk independently yeah yeah yeah that's like a modeling choice I guess depending on you know what you may know about the the data process and I guess same with the mean but I guess you would typically assume zero constant yeah so that's like um yeah so that's like up to you and so yeah kind of then we similarly have like a time reversal process and the only difference is that we have this like gram matrix k in front of this goal but otherwise you know it's kind of like the same the same ideas in the as previously and that time reversal we like you know converge you know we start with a gp so why zero bar is being of the gp and then the idea is to simulate the time reversal so that adds a y bar a capital t then we we have you know sample path our distributed according to our data process y zero and so we don't have access similarly to the steinskull so we have to approximate it and since we have you know the steinskull is preconditioned by this gram matrix we can have suggest to directly learn this with the neural network and same the the conditional yt given y zero is available in closed form and so what you know we also use this denoising converging loss and we kind of use a weighting that involves this gram matrix so that it's kind of multiplying both you know both terms of inside this quadratic term with k so then we have like k times the score that appears and we can directly approximate this with the neural network and then also on the right hand side it's nice because we don't have any like k inverse we just have like I guess we have like a actually it gives a composition to do and that's that's the training laws that we actually use okay so yeah is there is there any questions on this okay so now I'll go into like the invariance kind of considering that we like to enforce in our neural diffusion process and so I guess a stochastic process f that comes from a priority measure mu is said to be gene-variant if the the measure itself is gene-variant and so maybe in terms of like sample it's easier to understand is that if you have some like input pairs you know x and f of x and then you have like this you transform the same set with a group of m and g and then those two sets would have the same distribution if the stochastic process is gene-variant so that's that's the definition so you can think of like in 1d if you have like you know samples a simple path then if you're just kind of like translates that entire distribution then if f is translation variant which is often refers like this is stationary then the samples pass will have the same distribution so that's kind of the property that we like to enforce in our model and what one specific and kind of like simple example of this is an invariant gp and as I have been sure that like a gp is basically gene-variant if and only if the mean and the kernel are g-equivalent defined as in equation 9 and so that implies that basically the mean is constant and for the kernel it's kind of like a bit trickier and like I'm giving some example in the Euclidean setting so you can the simplest idea is just to take k being like k0 times some identity matrix and k0 being group invariant so for the Euclidean group you can think of like you know like the lvf or the mat and kernel for instance but they have been like suggested like more interesting and so I guess this diagonal kernel you can see there's like here the x and y component on the left hand side being independent gps so I guess you have two kind of scalar gps but you can also use like you know kind of more interesting kernels which will lead to the you know sample vector fields on the middle and the right hand side which have zero divergence and zero curl and so this satisfies this g-equivalent constraint and so they they lead to like g-invariant gps so I guess what what you like from here is like how can we enforce that group invariant invariance constraints into our generative model over you know function space and so what we show is that you need kind of two things so you need to indeed target you know have as this limiting process a group in variant gp and so as as just shown you can do that by having a group equivalent mean and kernel and then the other constraint is for this core network to be also group equivalent and so that can be done since we are like parameterizing this core network with like a neural network we can basically choose the specific neural architecture to enforce that and so depending on the type of field that we modeling you know like a scalar field you know that we want to be translation variant or you know vector field or even high dimensional objects we can just choose a neural architecture that enforces this group equivalent with respect to this type of field and so here's like a bit of an illustration of of what that means in practice in terms of samples so on the most left hand side kind of column we have samples from this you know in variant gp so this is what we start this is like kind of our white noise and here we have like a an rbf kernel and in the bottom row this is basically the sample that's above but that is transformed by a 90 degree rotation and then the generative model you know applies this denoising scheme by approximating the time reversal with s core network that is that is here e2 equivalent and so what you have when it is fully denoised at time equal one is that the sample at the bottom is still the 90 degree rotation transform sample at the above and so that's because both samples have the same distribution they kind of the same apart from this like 90 degree transformation so I guess in practice this noise we should like this group invariance over like kind of stochastic processes but in practice what we are often interested is more like the in the conditional setting you know if we have access to like some set of observation and we like to be able you know conditioning of this observation you know sample from the conditional process you know could be for like I guess as the previous example from like weather forecast you like have some data from like you know weather stations but then you'd not like to know the precipitation like different space time kind of point and so then the property that's like kind of more of interest is this more like group equivalence which is given here which is like you know if you were to transform your context set the set you're like conditioning in you'd like your conditional process to also be transformed in a similar way and so I guess here there's like an illustration in figure five for like a 1d kind of scalar field example if you translate those three red dots with which we are conditioning on by like two then the conditional process which similarly be transformed so all the sample path in blue are you know also translated from the left to the right and then you know another example on the right with like a 2d vector field where similarly the the transformation is like a 90 degree rotations so the context sets being like here like these red arrows and then you know we're like transforming the context sets by like a 90 degree rotations and then the conditional process is similarly transformed and so then the question would be like how can we you know how does this you know is enforced how can we enforce this similarly to these you know gene vines of the you know unconditional process and basically what can be shown is that this kind of two sides of the same comes they are actually the same thing so if your stochastic process is gene-vined then this conditional process is going to be G coins so it's going to like it comes so free so and we've shown how to how to enforce our stochastic process to be gene-vined by having like a G coin channel and a G coin score network so then in practice we'd like to be able to get samples from that you know given like a context set so we calling up the context set C and you know given some like cray points that we're interested in like X stars we'll be able to sample those like Y stars and so they have been proposed in the literature by in this paper by Brian Tripp et al kind of like a relatively and I guess also by Yang Tung relatively like simple scheme which which involves kind of two things one being to noise the context so that what we are you know YC what we want to condition on until you know this level of of noise T that we are looking at at the moment and then then we apply like a denoising step you know we'd like some discretization step gamma on both the context YC and the viable we're interested Y star and then the idea is just to repeat this until you're like kind of fully denoised your your samples and then your what can be shown is that as gamma tends to zero or with like an SMC creative step you have samples that are like kind of exactly distributed according to the conditional process but what we've upset in practice is that you really need gamma to be extremely small and I guess the smaller gamma is the more steps you're taking and so the more like calls to your network you have which is obviously that's kind of like the the commercial cost that that you know scales linearly and so you like to keep a gamma pretty pretty large as large as you can and because if gamma is like not small enough then basically empirically this we tend to dismiss the context to sample from the unconditional process and so in the literature by Fugumal et al in this repaint algorithm which is kind like this because it was looking at like image and painting so you know you're like high part of the image and you like sample from from from that there's this kind of simple yet nice idea has been suggested which is to add for like a given level of of noising tea to add some like renoising and denoising kind of steps in order to increase the correlation between y star and y y c and that indeed like actually works well in practice and so what we what we show in this work is that this is actually a discretization scheme of long run dynamics and I guess yeah maybe I don't have too much time to dive into this but you can indeed of this kind of repaint corrective step you can simply do a large amount of mix in order to get your your at fault so I guess in if it is this we have an illustration of this so the predictor step which is the one from the previous slide would be this blue arrow where you're kind of like making a guess you know trying to remove one level of noise and you know approximate that but you know because it's like approximate your discretizing things wouldn't be exact so you'd be kind of off shooting the kind of true trajectory of your denoising process and then this long run corrector would be this kind of this red arrow that's kind of like you know projecting you back in some sense onto this you know the true the true level you know and true your your true kind of conditional density of your stochastic process and what we show what we observe kind of empirically is that you really need to do some long run steps because if you have zero on the y axis you have some kind of measure of how good you're like fitting this conditional sample and you need to do at least a couple of long run steps to to yeah to match that efficiently um okay so yeah is there any any question I guess I'll I guess tell like a few minutes to go through some experimental reasons that we there we have so first we looked at like 1d kind of datasets so which you can see here in the first column uh where the first three are our Gaussian processes um and the last two are like non Gaussian uh and so then we train our model along like other uh baselines that I'll share in the next slide and here you can see in the middle column like just samples from the kind of you know unconditional uh process and then on the right column right inside column uh samples from the conditional process and then we measure the predictive log likelihood uh and what we can see is I guess this this model like is able to uh you know fit both Gaussian and non Gaussian processes as opposed for instance to like null uh like some different class of neural processes which uh you know we will relatively like very well on on Gaussian processes but maybe not as well on non Gaussian processes for instance on like this so to make sure that I said um and then on the on this bottom row we do the kind of the same thing but instead of like evaluating within the training range we evaluate like outside of the training range yes yeah so I skip all of this in a because of time constraints but so I guess we basically just computing the log of the joint so of both the variables you know y star and yc and then we also computing the log of the conditional set and then we subtracting this too and you can yeah basically you can compute the probability flow of of you know the joint and this uh yeah it's kind of pretty similar otherwise on the regal um yeah and so here you know models that basically have no uh kind of like translation invariance uh constraints you know wouldn't be able to like perform like in any way uh well outside the training range um but uh such stationary neural processes such as comms gp or gnp uh perform as well and similarly for uh this uh the model in blue it perform as well in the you know outside the training range um but this n dp star hey is like is exactly the same model as the one in blue just does not have this translation uh in variance constraints in this core network and it's basically like performed like really badly that's why we didn't even report the results because kind of it's never been trained on that range so there's no reason so anyway so that shows that basically this this model like works well in Gaussian and not Gaussian settings and can generalize having this like stationary assumption so then we look at uh 2d Gaussian vector fields uh so with the candles that I've mentioned earlier and uh similarly we can see this n dp star does not have so neural diffusion processes does not have uh a group of covariance uh constraints and it works pretty well nonetheless uh but not as well as the one in blue which you know we we use like uh an e2 equivalent uh parameterization for this core network where it's kind of like you know matching exactly the predictive likelihood of the datasets and I guess most may be useful is that uh you know although they perform both like relatively similarly like the equivalent uh stochastic process uh in orange can perform very well with like extremely few data points uh as opposed to the other one which which has like requires like way way more data points to be able in some way to kind of learn this invariance property uh because the orange one kind of like works on like a state space that's like way lower dimensionality so you kind of like giving these heads down it's kind of massive heads down and yeah no that's that is true yeah uh no that's that's a fair point yeah it's really we should do that yeah yeah I mean it's true maybe this yeah this is maybe a bit unfair you're right uh I guess yet typically you would add some determination uh that's that's a good point so I guess yeah it wouldn't be as dramatic uh and very yeah lastly I guess okay I've very skipped that but basically here we show we look at like cyclone trajectories uh which are like you know we model as like being sample path uh you know functions from like r to s2 so r being the time index and s2 being like a position on earth and we we model this this this data set so you can see on the bottom left um like samples from the data sets on the bottom right samples from our train model and we basically built on like a prior work where we extend different models onto the manifold setting um anyway I guess I'll skip this but uh yeah as a quick recap uh make sure everyone's kind of like have the the key ideas is that we the aim was to build a probabilistic model over uh tensor fields or like you know collection of tensor fields feature fields um and so first we extended the different models over function spaces by correlating you know all the finite marginals with some kind of kernel and we also showed how we could incorporate a group invariant and that's with two things one by targeting an invariant Gaussian process by using a a group of covariant mean and kernel and secondly by uh parameterizing this core network with also a group of covariant neural network uh and then also I showed like that the long joint character was really key to be able to effectively sample from the functional process and uh then yeah I guess I showed like a few a few examples on how like this class of models because as opposed to like neural processes doesn't have this diagonal uh predictive covariant structure uh it's able to like really have like really good modeling capacity on both like scalar vector fields and with processes with uh euclidean and spherical uh output uh yeah thanks for your attention uh yeah I guess if we have a bit of time for our remarks that's uh lovely and uh also yeah definitely looking forward to uh uh chat with you during this weekend to meet you yeah which this one or this one yeah I guess I'm wondering how I see so in the so you would you would still know is the context but you would uh simulate the reverse probability flow or d in this step yeah that's a good point okay I guess yeah so yes yeah so what would be kind of too big because in practice like I mean for the problems I showed after like even like a couple of thousands of steps wouldn't work so well so that doesn't sound like a massive amount of step of noise injecting each time but maybe it is still too much indeed like that's that's a good point yes yes yeah yeah so adding large one but no but we did actually uh try both uh that's a good point indeed like reversing the probability flow or the the time result and uh indeed the probability flow works a bit better but still like with no kind of corrector like it's still like you know so that's that's a good point I think that it does help but still like with no corrector basically you get samples are like yeah basically very much not really conditioned on the context side so they look good but they kind of like tend to slightly ignore the context set uh yeah more questions yes so on the score network um so yeah I guess indeed I guess in the literature this shows that although it kind of like in some way already telling was the kind of right function class then it seems that it may have an impact on like the optimization program so it may be harder uh so for the 1d setting it's kind of in some way we didn't change the architecture we just kind of remove the center of mass of the delta so that's like a just very simple trick to get a translation variant network and for the second one we use like a like a kind of s2 transformer kind of architecture so it's like a a graph base with like an attention kind of idea uh so I guess a bit hard to because yeah I guess I'm not sure how I can specifically single out that you know it's not easy to train because of this specific because of the equivalence for me the issue was more like the fact that because it's like a graph base architecture like uh it scales like doesn't scale so well with the number of of you know inputs that you want to model your stochastic processor um so yeah I didn't really notice like let's say like instabilities or thing like this was more like instead of like memory scalability uh but I guess this is more like maybe the specific choice of like graph architecture I guess maybe I like smarter choice I mean if you work on the grid then you yeah you wouldn't need to do that but I guess the idea right of this stochastic process is that you'll be able to evaluate them anyway yes I'm at the follow-up question would be do you think the extra I mean and that's the question we always ask on those with extra work seems to actually be worth it and in your case maybe you can comment on how much extra work you have to do to make it equivalent compared to turning it on the augmented data I see and I guess then do you see that you are actually doing better yeah so I guess history I didn't try which is maybe a bit a bit unfair and silly but you did know in your slide you compared to the non-equivarian version oh yeah yeah but I guess they didn't try like uh like with data augmentation which maybe would be a bit of a more fair comparison that is true uh but no so it does definitely I mean I guess here like I mean obviously since you're not trying this kind of network on this kind of border modularization like outside of this kind of minus one one range there's no way it can work if you don't enforce like session it but it was like other and neural process like also can do this like convolution kind of based neural processes but also if you have like a gp that is stationary you or something of that so here's like really like just wouldn't work at all otherwise and hey it doesn't really require any just like yeah removing like the center of mass that's like doesn't really change anything so that's like very easy yeah so here I guess yeah it it does require like kind of like knowledge and has about how like to work with equivalent neural network which are a bit more tricky yeah trickier because I guess you need to know a bit more about like uh representations and yeah how to implement them and such things so yeah I mean it's hard it's hard to tell I mean it's true I didn't value it on like data augmentation but at least with all that like it does you know I guess you're like cushioning out like a big part of the state space uh so I mean these are a bit toys something I'm working on uh is like working on like more like weather data with like wind direction and like temperature and pressure fields so I'd be interested to see like how well that works and hey maybe this environment constraint would be too strong because if you have you know if you're like modeling a specific you know if you in some way you kind of discard you know this is land this is like sea but uh I think you could soften the covariance by amortizing this core network this search kind of like topological I mean like like geographical data so you kind of have like a soft equivalence but yeah so that's something I'd like to try I haven't really tried so yeah I don't think it's is necessarily worth it for every problem because it could be seen as like a pretty strong constraint in some instances but others like you kind of know that you have that on it's like partially you have this and so then I think it does make sense because then you can generalize better and train with less data any more question so if not I suggest we thank again the speakers of this session