 insights from Precise Asymptotics by Takahashi Takahashi from Tokyo. Please, the floor is yours and welcome. Okay, thank you for the introduction. I'm Takahashi Takahashi from the University of Tokyo. As I've already introduced, today I'd like to talk about a mean field of nurses for ensemble running. Okay, let me start. This is a train. First, I'll describe some background and motivations for this talk. And then I'll introduce a framework for analyzing the bargain by using the replic method of statistical running method physics. After that, I'll talk about sparse ligation and imbalanced data as an application. I should put one disclaimer here because I'm not a good ingredient, so I'm happy if you could speak slowly when you're asking questions. Okay, let me start with explaining the motivations and settings. This talk is basically about statistical running. We are given a data consisting of pairs of input X and output Y. Usually, these data points are independently generated from some unknown distribution. So the aim is to run a model to make a good prediction for new input X. In this talk, we consider the running of linear models. I mean that the output for input X is a linear combination of the weight and the bias. To run the parameters, I'd like to consider bargain, which is a classical method for ensemble running. In bargain, we consider minimizing this kind of randomized cost function. This is an empirical risk measured on the resampled data set, which is obtained by randomly sampling each data point from the original data set. So the resampled data set D stir looks like this. Here, this red color integer, C mu, represents the number of times the data point X mu and Y mu appears in the resampled data set. There are two popular ways to construct a resampled data set. One method is a bootstrap, which is a very traditional method of computational statistics. In this case, each data point is obtained by sampling with replacement from the data set. So each data set that point may appear several times like this. In this case, C mu follows basically independent person distribution with mu mu V. This parameter specifies the size of the third data set D stir. And the other one is resampled subsampling. In this case, each data point is sampled with result replacement. So the size of the resampled data set is strictly smaller than the original data set. In this case, obviously, C mu follows a simple perinoid distribution. In bargain, the final smitter is obtained by taking the others with respect to the resampling variable C. It is usually used to reduce the variance. So it is often combined with less bias, but with high variance models. For example, decision trees are combined with bargains. So 82 is the famous random force algorithm. What I would like to know is what happens when used with linear models, which is the most simplest variance of the neural networks. Especially, I'm wondering about the difference between just standard digital regularization with random regularization with debugging. Typically, you know, in learning with linear models, we can use other L2 regularization with such kind of regularization to reduce the variance. Are these two different? This is a question I'm basically recent to consider. Actually, there are several research results regarding this question. This paper made by Brezhne and his co-authors, they consider the ordinary squares with subsampled examples, subsampling of rows, and they also consider subsampling of futures, subsampling of columns. What they showed that is that the average of this minimizer, the average of the minimizer of this randomize cost function is equivalent to the least estimated, with some ultimately tuned regularization parameters. Similarly, that's what I found in other settings. For example, in Rust engine, Zori and Kuroho showed that they only consider the subsampling of cute examples, and they showed that the digitalization with subsampled examples is equivalent to the digitalization with other regularization parameters. Similarly, empirical results were found in other maximum like estimated. So, this is a short summary of the story so far. Summary is very popular and theoretically we're motivated. They can deduce the variance of the estimators. But, however, in digitalization, unregularized maximum like estimation, the summary seems to be very similar to L2 regularization, at least if we focus on the generalization performance. This is interesting because in the sense that we can use it, it can provide an alternative way to introduce digitalization. But, on the other hand, I think it's a bit disappointing because it implies that we may not need to consider unassembling in the context of running linear models. So, I wondered if we could find any interesting features or some fruitful gain in other more abit structured settings. For this, we consider the bugging in slightly more structured data settings. I mean the sparse regression. Here, the teacher has some sparse structure and the other one is running a crash-fire from label in branch data. In this case, the data has cluster structures and the number of samples in each cluster is different. Oh, okay, okay, thank you. Sorry. I also described the typical method for analyzing such ensemble running. This is a basic motivation and background, okay? Okay, then let's move on to the replica method for analyzing the bugging. This is an optimization problem used in ensemble running of linear models. Here, the estimated W hat for one randomized cost function depends on both C, the sampling variable, and also obtained data set D. The goal is to clarify how this estimator depends on C and D separately. Why separate it? Because the sampling bugging is made on a fixed data set D, so this average estimator depends on condition made by condition on D. But of course, because the average estimator still depends on D, so we also want to know the dependence or data of this average estimator. So we need to develop some analytical framework for this kind of problem. To describe the replica analysis for bugging, I will first, very shortly, review the classical replica analysis for the empirical risk minimization. This is a very traditional replica analysis for the empirical risk minimization. Here, there is no additional randomness C, so the goal is just to clarify how it depends on the obtained data set. For using the replica method, we usually introduce both one distribution like this. Because this distribution concentrates on the minimizer of the loss function in the zero-temperature limit, the bet goes to infinity. So the analysis is replaced by the analysis of this possible distribution at the zero-temperature limit. By evaluating the fluctuation with respect to D of the estimator, we use a replicated system of this form. We can duplicate the both one factors in times and multiply it, multiply them, and finally take the average over the data set. So this is a normal distribution that does not depend on the data. One silent feature of the replicated system is that the fluctuation regarding the obtained data set can be extracted from inter-replica correlations if we can extrapolate the computation result for finite integer n goes to zero. For example, the two-point correlation on the replicated system is the second moment of the estimator, average over D, and the higher order equivalence can be computed similarly. So fortunately, for simple data structures, we can obtain a very simplified marginal distribution of this replicated system. Usually, the system is independent of the index of the component of the original estimator. This is very mean field-like, and the parameters of this marginal distribution can be determined by the repregatimetric subtle point or the state of addition of the AVR algorithms. In this single-body problem, this minimization problem, expectation with respect to D is approximated by simple Gaussian random variable, Guzi. So because this problem is very simple enough, we can obtain many desired properties by using some numerical integrals or some analytical methods or some blah blah blah. So this is a summary of the crash-curve deprecone analysis. We first define approximate distribution, and then replicate the system and the average over D. We hope that the fluctuation with respect to data can be exhausted from entire replications. And then we simplify the replicated system for integer-valued n. In very simple cases, we can achieve such simplification by using the exact computation. But in other cases, we need to use approximate inference methods as well done in the late 2000s by Mazan and Opa. I think this kind of topic will be treated in winter morning. And finally, we extrapolate n code as 0 to obtain a simple FFT problem. How can we extend this analysis to the bargain? Obviously, the key point is the second step. Because we have, in the crash-curve case, there is only one source of random variable, the authented set. But in the case of bargain, there are two kinds of randomness, C and D. One way maybe to consider this kind of replicated system just replicates the randomized Boltzmann factor and the average over C and also D. But this is not correct. We cannot separate the contribution of C and D from this replicated system. Actually, the correct way is to use this kind of replicated system. This is a bit complicated, but the idea is very simple. The fluctuation with respect to C, conditioned on D, would be obtained by this replicated system. We just replicate the Boltzmann factors and average over C. From this replicated system, we would be able to extract the fluctuation over C conditioned on D. But we need to finally obtain the fluctuation over D of the average estimated. To obtain such D fluctuation, we should further replicate this abrasive replicated system. Then use this nested replicated system. Actually, we can extract some Boltzmann fluctuations from the inter replicated correlations of this nested replicated system. This is an example. In the first term, only the replicated indices inside of the replicated are changed. A1 and these two red colored replicated indices are the same, but the next blue colored replicated indices are different. This is the second moment of the estimated average over C and D. And in the second term, we change the first loop of the new A1 and B1. And then we obtain the second moment of the average estimated. So we can obtain the variance of the estimated. Maybe we can visually show this randomized nested Boltzmann replicated system. Inside each public box, the Boltzmann factors share the same random value of C1. And then these Boltzmann factors are multiplied and averaged over C1. And the other public boxes are the same. And then we multiply these average Boltzmann factors and finally take the average over D. As you can see, this figure is very similar to the three diagram of the one step of depiccassimitive breaking analysis. So we can associate the order parameters for each hierarchy of the this nested replicated system. At first step, two Boltzmann factors are picked from different public box and then it is described with order parameters. This is the average estimated, square of the average estimated and finally taken the average with the data. And in the second layer, we pick two Boltzmann factors from the same public box and then it is described with order parameters. And the final step is just pick up the same Boltzmann factor twice. So we can depict this kind of order parameter structure in the paradigmatics in the form of the paradigmatics. So order parameter is the second moment of the average estimated and the variance with respect to the resampling and the rest one is the fluctuation with respect to the Boltzmann distribution. So the analysis of this nested replicated system is essentially the same with the 1SB analysis. But here the break in two which corresponds to the break in a positive parameter in the 1SB analysis is also taken to the zero limit. So this is like a 1SB analysis but very close to depiccassimitic solution. Yeah? So could you use a microphone one? Okay. I want to ask what types of matrix is this? Is it a convolution matrix? This matrix just summarized in a product of the order parameters. Each component, if we define this matrix of Q and then Q, E1, E2 and E1B2 nested replicated system in the computation of the position function of the nested replicated system this kind of parameter appears and I summarize these parameters. So I think this is not the same with the confusion matrix. Okay? So just clarification on the notation. So for the second term is the variance of the Gibbs product WI but the C is but conditional on the D. So the first two terms are fully conditional on the D is the things on the inside. Yeah, yeah, yeah. The average of C is taken by condition on D. And the last one is just over the randomness of both simultaneously. That's the notation. Okay, thank you. Okay, no question anymore? Okay, ready? Proceed. Is it evident that you should have this kind of 1SB structure or could you think about other structures that... Is it convenient for the computation to take this 1SB structure kind of matrices or is there another fundamental reason? I think I couldn't get the point. You take this kind of structure which are reminiscent from the Parisi approach for replic asymmetry breaking. Is it because analytically it is convenient or is there a fundamental reason for taking that structure also for that specific analysis? It's just because in the case of analytically solvable case using this matrix is just very convenient. I don't know if there is no further fundamental reason. So is it expected to be exact? The computation, the solution obtained from this type of NSATs is expected to be exact? Yes, okay. Okay, this is a summary of the replic analysis for assembling. The first step is exactly the same with the traditional analysis. And the next step we need to construct a nested replicated system. We first replicate the system in one time and average over C and then replicate the system in two times and finally take the average over D. We hope that fluctuation of C and D can be separately extracted by carefully adjusting the replic indices. And then again, simply for the replicated system by exact computation or some approximate inference. This step can be made by, this step is basically similar to the 1SB recognizes. So if you can compute the replic asymmetric solution for one problem then extending it to the resampling problem is very easy. And finally we need to take the limit in one and two goes to zero. Actually this kind of analysis has been used several times. For example, the derivation of the state of addition or the replic analysis of the approximate resampling algorithm these two papers use the same kind of analysis. And recently, Brunner-Lorado and his co-authors also considered the similar kind of formulas for analyzing the random future ensemble. They also derived the proposal rigorous analysis for this kind of problem. Okay, so let's move on to the applications. If there is no more questions. Then let's move on to the application. The first problem is the sparse regression. This is a setup for the sparse regression. We have a pair of input and output. And in this talk, I do not show any structure on the input obtained data set, sorry, input. I mean XMIR is generated from a spherical Gaussian distribution. But the output Y is generated from a linear teacher with some sparse coefficient. It is sparse is since that each component of the parameter is generated from Gaussian distribution. Here raw specifies the sparsity of the parameter. Of course the goal is to run nice predictor for new input X or equivalently to run this sparse coefficient. Since one popular method for estimating such a sparse vector is to use Russell. So let's consider minimizing this kind of randomize cost function. This is a combination of the randomize squared error and the L1 regularization. In this talk, we consider the boot sub case. So XMIR follows the person distribution with mean mu B. Because here XMIR is normalized by the mean of the person distribution it contains a non-randomize case in the limit of large mu B limit. Because in this case the average goes to 1 and the variance goes to 0. So there is no randomization. We consider boot sub case for the predictor for new input X. Then by using very simple computation we obtain the formula for the generalization error. Except that the estimator is based with the average estimator this formula is completely same with the traditional linear regression problem. There are two questions that I'd like to address. The first point about the improvement of the generalization error. Of course the bridging procedure will deduce the variance of the estimator to some extent but why is this significant? Or is this different from the L2 regression? This is the first point. Someone is not muted. The second question that may be a bit more interesting is the choice of the best hyperparameters. There are two hyperparameters that we can control. The first one is the size of the boot sub case. If mu B is made less there is almost no randomness so the bugging is not so efficient. But when mu B is very small then the variety of the estimator for each randomness cost function is very less so assembly will be very significant. Of course there is one more parameter, the regularization parameter. I'm interested how does the optimal parameter depend on the dataset I mean the sparsity of the W0 and the size of the dataset. This problem is a bit simple but I think it's a nice good start point. As already explained we can obtain a simple effective single body problem by using 1SV like analysis. In the end of the day we can obtain this kind of effective single body problem. But here the local field, H, has two sources can be written by the superposition of two Gaussian variables. The first one comes from the obtained dataset. This is very traditional one and here we have another one, red colored eta. This Gaussian random variable effectively represents the randomness that comes from the resampling. So the macroscopic quantities that appear in the generalization error this guy and this guy can be used, can be replaced by using the solution of this effective single body problem. Here the average with respect to the resampling can be replaced by the Gaussian random variable eta and the average over the, in this empirical average or the average over the dataset can be replaced by the Gaussian W0. And here is the result for the performance comparison. By using the subtle point condition of the 1SV analysis I numerically computed the generalization error for each row and alpha and I numerically optimized the generalization error This figure shows the ratio of the generalization errors with the bugging case and with the bugging case. The horizontal line shows the sparsity of the parameter and the vertical line represents the size of the dataset normalized by the input dimension. And as you can see the generalization error is uniquely reduced when W0 is not so sparse but when W0 is very sparse there are sufficient large number of data set, data point, the improvement is not so significant. So if we only focus on the generalization error then the bugging is very similar to error to regularization. This point is more clear if we consider the elastic net case which is a combination of the L1 degressization and the L2 degressization. In this case, almost no gain was obtained because there is a slight gain so in strict sense bugging seems not to be the L2 degressization but almost similar to that. This is the optimal choice of the hyperparameters but in this case something interesting happens when W0 is very sparse or there are sufficiently large number of data points then the best choice here the optimal parameter is plotted in the log scale. When the situation is very nice then the optimal choice is to use a strictly positive and relatively large regularization parameter. This is because in a good situation the L2 degressization meter without the nanomization is already very strong but when W0 is not so sparse or the data set is not so very large then the best regularization parameter seems to tend to be infinitesimalism more. What's happening in the public region for confirming this point I also plotted whether the number of the unique data points in each sample data set is greater than the parameter dimension or not. In the public region the determined situation is achieved in each randomized data set. The region below the red line is very trivial because the data set is already smaller than the parameter dimension but what's interesting is the upper right region in this case in this region even though there are very many data points the best choice is to use small mu B so that for each randomized data set under determined situation is achieved. To summarize when the parameter is not so sparse then the regularization parameter tends to be infinitesimalism more and the number of unique data points in the sample data set is smaller than the input dimension so it says that the best choice in the not good situation is to use the ensemble of interpolators this is the minimization of the error norm by keeping the input-output relation exactly. Schematically speaking this is a phase function if sparse regularization cannot find the sparse structure of delta 0 sufficiently then the best choice is to give up to find the best sparse estimator but instead best choice is to maximally increase the variety of the estimators but keeping non-trivial regularization which is the ensemble of interpolators. Actually increasing the variety of the weak runners which is the estimator for each randomized data set is a very well known strategy of ensemble running but I think the appearance of the phase function in the hyperparameter space is not so trivial and this is a short summary of the sparse regularization sparse estimation packing is a special use in point of view 0 is not so sparse but this property is just very similar to the digital regularization so almost users in the elastic net case but what's happening inside is very different from the digital regularization optimal choice of new B and random shows the phase function when W0 is based sparse or alpha is large using, packing is not so efficient but in other case there is a reason that the best choice is to use the ensemble of the interpolators also this problem is very simple but something non-trivial thing appears from the sparse structure this point is very different from just considering the digital regularization which is measured this is the result for the sparse estimation if there is no question then let's move on to the next application running crash failure from the imbalance data set this is the setup for the classification problem there are two components of the Ducatiian big distribution the first one is for the positive examples blue one and the other one is for negative samples the goal is to run a crash failure that can generalize the way to both positive and negative samples we want a high accuracy rate for inputs from either classes in particular we focus on the case of imbalance setting the point is that the intercept of the classification plane which is depicted by the black dashed line is strongly affected by the imbalance in the left case the number of each cluster is same so the intercept of this classification plane lies on the close to the origin but in the right case positive samples are very small compared to the negative samples so the intercept of this classification plane is strongly biased towards the minority class in the limit the majority over frame the minorities then the best if we do not take into account of the imbalance the usually the best classification plane lies around there so we cannot detect positive samples there are many methods for treating such imbalance the first kind of type method is at the data level and among them down sampling is I think this is most popular way in this case we minimize this kind of cost function the data points in the minority majority class is down sample which is represented by this C new the key point is that for minority class this sampling is not used because in this randomized dataset the number of samples in each classes are the same so this method should have low bias but because the number of total data points used in this randomized cost function is small so this method should have large bias and there is also another way which is a bit computationally demanding this is usually called under bugging which is obtained by taking the average of this method with respect to C this method is usually used to reduce the variance there is also another method which is at cost level this is a very simple one among them which only considers weighted cross entropy which uses different weight on each classes usually in minus for minority class majority class is taken to be very large in this case all data points are simultaneous use so the computation is very easy and you may think of it as a very naive method but this is very popular and standard and even implemented in circuit run so many methods have been proposed including the ensemble method actually there are too many methods proposed so the choosing one good method is very difficult so what I want to ask is that do we really need to aggregate or do we really need to aggregate because aggregation is computationally very expensive because we need to obtain the estimator for many realizations of the sample data set if this bugging procedure is very similar to digitalization maybe we can obtain a nice estimator just using one realization of the sample estimator and by choosing the good realization I also want to know if the sampling procedure is better than the cost based one or not I think when there are many data points in the classical asymptotic regime the sampling approach and cost based approach will not so different but things in high dimensional settings may be very different so I would like to consider this kind of proportional limit again we can construct a single body problem for this kind of problem in this case we cannot obtain the generalization error by using simple L2 Miscule error or the parameters but we can obtain the rebound logic of the output which is defined by the linear combination of the parameters for excluded data points we can show that the empirical distribution of D1 out estimator can be written like this again this is the superposition of the two Gaussian randomnesses the mean body depends on the input class but the fluctuation can be only Gaussian the first one comes from again data and the other one is against the sampling so the average estimator should behave like this if we take a very large number of sample datasets the red color tongue would vanish so variance would be reduced and the sampling case corresponds to the k equals to b1 this is a simple classic of the theoretical prediction this is the empirical distribution of the rebound logic is compared with the theoretical prediction theoretical prediction is just a Gaussian and it seems that the two distributions seem to be very in good agreement this agreement can be even very apparent if we use the quantile-quantile plot here the size of the input dimension is about 8 k and the ratio of the minority and majority class is about a quarter so here this is a very biased, imbalanced case but the prediction is very accurate and here is the performance comparison let's see first the fixed realisation case the left panel shows the F2 score which is the harmonic mean of the accuracy for both minority class and majority class as you can see weighted cross entropy method, weighting method performs very bad in high dimensional settings but the sampling approach, this is a time sampling case and this is a debugging case these two guys perform very well even in high dimensions this thing right here comes from the phase transition from linear separable region to linear and separable region because the bias term of the weighted method is very strongly affected by this phase transition it cannot perform well in high dimension setting and also we can see that under debugging debugging can make a significant improvement against down sampling and what happens if we optimise the regularisation parameter round in this case even for the optimising regularisation parameter the bias term for the weighting method is not so good it still has bias powers and minority class but on the other hand again the cross entropy method down sampling and under debugging method is very better than the weighting method and the under debugging method is still better than the down sampling this may come as a surprise because in the previous cases the generalisation error is the same with the optimised transregist method but in this case because the total amount of data used in the training is much larger than the down sampling case so we can't do some performance gain and we can also see that under debugging seems very insensitive to the choice of the regularisation parameter so it seems that in the high dimension case which would be relevant to the running neural networks the sampling method is very nicely performed well and this is a short summary for the running classifier from imbalanced data under debugging as I said is very better than the down sampling and regularisation this is because the total amount of data point used in the training is different this is the correct characteristics of the imbalanced data and under debugging is based on insensitive to the choice of the regularisation parameter and the existence of the phase function and the cost sensitive method is much better than the sampling method so you need to special care for such cases but training a special cost function is a bit difficult so I think the sampling method is very useful the sampling method is the best recommended way this is the summary of this talk the motivation was that we want to know the difference between the L2 or L1 regularisation and the assembling and the interesting phenomena around the assembling method and we can introduce a different method for the assembly running and this analysis can be easily made by using the one L3 like analysis and we related to applications and in the simpler cases the generalisation error cannot be improved but there is one phase function which may be interesting and the other case which is more structured case under debugging method is much better than the regularisation method so possibly these two applications are very simple but possibly the assembly running in more structured data may provide a very fruitful phenomenon so I think it's nice to further proceed this kind of direction that's all, thank you for your attention thank you very much Takashi is there any question from the audience? Takashi, thank you so much for the great talk I have a minor clarification so in the first part of your talk when you were kind of assembling and sort of adding the CMU variables to your cost function right in that case your estimator the one you analysed Y hat was still taking an expectation over the C variables right so could you say a bit more about that I thought that in this case you would just analyse the predictor which is X transpose W hat C without averaging over C so could you say a bit more why you average over C? why average over C? yeah because it seems that in the cost function you get this additional randomness because the randomness from C is not averaged out so if that's the case then I thought even for your predictor you should just analyse X transpose W hat with the C being quenched disorder in the system similar to the just in terms of something so you mean that why I do not analyse the fixed random variable C exactly because this estimator the estimator for this random variable cost function has additional randomness so the variance should be much larger than the original problem I guess so so I think such estimator is not so useful because just it has a very large fluctuation of course we can characterise its variance but I just think it's not so useful if you have some applications please but it can be done if once yeah yeah of course it's very easy thank you okay thank you for good presentation I'm just wondering when we were comparing the under bagging and down sampling now looking at the fact that you know there will be more data used in under bagging than down sampling I'm just wondering will we be justified saying that from this scenario that under bagging is performing better than down sampling looking at that you have more data training for under bagging than the down sampling so you mean that you need a clarification more that because under bagging down sampling is only one realisation of C is used right so in this realisation there are many zeroes in so in such data point the data point with C new that with C new is zero such data point is not used so it's not never used in the training but if we consider many realisation of C all of the data point may appear in the training then use the data point is different I have one question in the first if you go back a few slides back where you just show the simplest bootstrap technique I think you call that bootstrap where you just sample the data set you have this Cmuse you have this Poisson variables you are sampling each time you get a different estimator for a different data set and you are averaging over these estimators it makes me think about the kind of Bayesian estimator you are averaging over population of estimators my question is is there a kind of equivalent Bayesian estimator which would take all the data into account but you have an effective temperature instead of solving the minimisation problem for different some sample data sets you go over them you take all the data into account you do a Monte Carlo you get a finite temperature estimator but from the whole data right away is there an equivalence between the two or is it really different? That's very interesting actually there is a similar argument made in the last century by the F1 which is the creator of the Poisson method but the exact relation has not been obtained and maybe we can delay the Bayesian estimator and the Poisson one but still I haven't tried it but it's very similar and I think it's worth tackling so if I find something interesting I do it Okay, so it's an open question we don't know if there is really a connection Okay, thanks Hi, thanks for the very interesting talk I was just wondering if you can clarify maybe a little bit why you expect this one RSV structure to be exact and maybe the more specific question is if this is equivalent to some if averaging over the sea is indeed equivalent to some more explicit procedure then I guess what you're claiming is that procedure replica symmetric prediction should be exact in that and I was just wondering if you can clarify Sorry, I couldn't get the point Do you mean the exactness? I just wanted to get your insights on why you believe that this one RSV kind of analysis should be exact This is just a little bit these two slides are just explaining the intuitive picture but we can actually compute exactly compute the partial function of this random nested duplicate system and then exactly the same structure appears so I just use it in the exact computation So maybe you expect replica symmetry because the problem is convex or? Yeah, yeah, yeah because for each example dataset the problem is almost convex so I think the problem should be exact in the RSV Okay, thank you In your talk I'm a little bit confused when you were talking about bagging because I have used bagging for example do you look at comparing the assemble itself with it base learners because I don't know if you look at that area as well like when you have the assemble you have the base learner for example if you have example let's say for example you choose like three five seven and nine for example that is the assemble and for each example it has its own base learners and these base learners are the weak learners who compare the assemble itself and it base and I don't know if you look at that area in the classical tradition of machine learning because what I can see from your presentation is logistic regression and other service machine learning algorithm Actually there is no so deep reason I chose the bagging just as a start point Because this is why I'm saying this is because from your presentation you said the issues that bagging is under bagging is performing better this under bagging you were talking about is it different from the normal bagging techniques in assemble learning I have no short answers I will discuss it later Okay maybe I will talk to you later Sorry Alright is there a last question Alright if not let's take a show again Thanks