 I am for the benefit of those in the room, as well as whoever may be watching. I am Kathleen Howell, Associate Dean for Engineering. I'm here to welcome everyone on behalf of Dean Jameson to this event. So this celebration of faculty careers actually emerged from a strategic plan about five years ago. And at that time, it was called the Faculty of 2020 and the focus was on professional development at all stages of the faculty members' careers. It was also interested in aligning at that time the hiring and promotion and tenure process as the college scope was evolving and its leadership values were being brought forward. So there was a desire for a post-review of full professors that would include the feature of having an understanding of the accomplishments of everyone at this particular time in their career and an opportunity to make any plans going forward. So full professors who are at least seven years past promotion present this type of colloquia on their achievements and their plans. And then following this, they get the opportunity for a meeting with their head as well as with the Dean. So the program was originally piloted in 2013 and we're now entering about the fourth year of actually running the program. So today we have the opportunity to celebrate as to solve the problem. And he completed his PhD at MIT in electrical engineering and computer science. Prior to coming to Purdue, he was with Scientific Systems Incorporated in Cambridge as well as both of Berenick and Newman in Cambridge. So he's been here since 1987. And as you all know, he's currently a professor in electrical and computer engineering. His interests lie broadly in modeling analysis and optimization of stochastic systems and their signals and systems. And his research is in the areas of digital communication, statistical signal processing, optimization and pattern recognition. So with that going forward, we then let you tell the rest of the story. Thank you. So I just wanted to say that I was a little bit unclear how to structure this talk. So I decided to talk about things which sort of maybe I'm most proud of. And also what I'm currently interested in and maybe just overview some of the other things rather briefly. So we'll see how that goes. Okay, so this is an outline of the talk. And so I'll first be talking about stochastic sampling and optimization. This is actually work that grew out of my PhD thesis at MIT but was done primarily by me when I came to Purdue. So the first topic is stochastic sampling. So Markov chain sampling methods, also known as Markov chain Monte Carlo or CMC are a basic tool for Bayesian statistical inference. Things like MMSE and map estimation, imputation, validation, those kinds of things. The idea is to sample from a Markov chain with a desired invariant distribution. And this idea was due to metropolis but there's been a tremendous amount of work. Subsequently, many other related techniques, some suitable for parallel implementation. A related approach, which is perhaps not as well known as sampling from diffusions or stochastic SD, stochastic differential equation. And they're a discrete time approximation. So that's for sampling in Euclidean space. Okay, so we examine the relationship between Markov chain and diffusion sampling algorithms. And what we showed is that these suitably interpolated Markov chain sampling methods converge weekly to what's called a Langevin diffusion which is like a Brownian motion or a Wiener process except there is a viscosity term. Furthermore, we show that different types of Markov chains from different sampling methods like the metropolis and heat bath method, they converge to diffusions running at different timescales. So this is a way of comparing the different methods, something people were really able to do previous to that. Now a different approach and one which one might take if certainly would be to look at the modulus of the second largest eigenvalue, the transition matrices but actually that's a very difficult thing to get a handle on. Okay, so here's a display about Markov chain sampling method. So anyway, the idea is this. It has a transition density p of xy which you get from sort of a candidate transition density q of xy. So starting at x, you get y from q of xy and then you accept it with probability s of xy. And if you don't accept, you say the same place x. So this p of xy expression does that. Now the acceptance probability differs for different Markov chain sampling methods for the metropolis method, s is equal to this sm for the heat bath method, s is equal to sh and there are others as well. So how did we, or what did we show? What is this convergence of this Markov chain sampling method to a diffusion? Well we parameterize a Markov chain by a small parameter epsilon. And it has a transition density q epsilon and the way this works, remember we're doing things in continuous space so it selects a coordinate at random and then it performs or selects a Gaussian perturbation with mean zero and variance epsilon. Then we interpolate that into continuous time to get this x epsilon. And what we showed was under suitable conditions this interpolated process x epsilon converges to a Langevin diffusion. Depending upon the Markov chain sampling method, for example for the metropolis method it converges to this xm and for the heat bath method it converges to xh. And if you know a little bit about stochastic differential equations and diffusions it turns out that xm is running at twice the speed of xh. So in this sense metropolis runs at twice the speed of the heat bath and would therefore be preferred. So this was a new, okay next thing I wanna talk about is stochastic optimization. So stochastic approximation is a basic tool for root finding and optimization under uncertainty. And it's a generalization of the classical non-linear local search methods. So for optimization typically, and that's what we're interested in here, typically one performs a gradient or a Newton type step but the estimates which are used to get the gradient and the hashing or approximations to them are noisy or imprecise measurements. And so we get what's called stochastic gradient or stochastic Newton algorithms. These are initially developed by Robbins in my row and Kiefer-Wolfowitz and there's been a lot of analysis and application of these methods. I should say that the decreasing step size approach is used for the fixed parameter identification. If we use a fixed step size, we can track time-during parameters. Okay, so what we did was we developed and analyzed what I'll call a globally optimal stochastic gradient algorithm under fairly general conditions. And we also developed and analyzed continuous state Markov chain annealing algorithm, which is kind of a different type of algorithm also used for global optimization. And we did it by writing the annealing algorithm in the form of this globally optimal stochastic gradient. So this again is based on this connection that we developed between Markov chain sampling methods and diffusions. And then we applied this stuff also to several global optimization problems including edge detection and virus reconstruction. In the latter case, we actually looked at a lot of other global optimization methods as well in comparison. Okay, so here's the classical stochastic gradient. It looks like a steepest descent algorithm, but it has this noise, psi k. And a k is a sequence of positive numbers. In the modern analysis, there's usually two steps to analyzing these types of things. One is you establish some kind of global stability property like a boundedness using a fosterly optimal criteria. And then you characterize the limit points from what's called an associated ordinary differential equation in which this little z of t, which is obtained by averaging the stochastic gradient algorithm. And under suitable conditions, you can show that with a k say going to zero like one over k that this z k converges to the set S, which is the local minimum of u converges with probability one. So what we did was we modified this stochastic gradient algorithm by adding in this bkwk term here, this thing. And this is sort of done to escape the local minimum. So everything's the same as before, except the wk's are standard independent Gaussian random variables. And the bk is a sequence, which goes to zero very, very slowly. Again, the analysis is done in two steps. They're sort of guided by the ODE method and the classical stochastic gradient algorithm. We establish a global stability property, in this case, tightness or sort of abandonness and probability. And then we characterize the limit points from what's called an associated, what I call an associated SD stochastic differential equation. And under suitable conditions, with this bk, again, going to zero very, very slowly, like constant over square root of k log log k, and also b over a being greater than c z zero, which is some constant, which comes out of analyzing this diffusion using the Friedland eventual large deviations theory. That xk actually converges to a set of global minimum, even in the presence of strict local minimum. Convergence is in probability. The analysis here is a lot more delicate than the classical approach with the ODE because of this very slowly decreasing noise. You have to localize the approximation and the stochastic differential equation on very, very long time intervals. But it can be done. Okay, next thing I wanna talk about is pattern recognition and machine learning. So the first thing here is iterative growing and pruning of classification trees. So classification trees are an important method for non-parametric classification. The trees are usually grown top down by splitting features in a feature vector until some termination criteria. We label the terminal nodes with the class labels. There's many advantages to these approaches, to this tree-structured approach. An important one is interpretability, meaning how does the classifier work? What features does it select? What order does it select them? What are the thresholds? So it's used not just for predictive analysis but also for feature extraction. Okay, here's a classification tree. X is the feature vector. F is the splitting function, theta is the threshold. C-hacks and the terminal nodes are the labels. So a feature vector propagates down the tree and gets labeled as the class of the terminal node that it lands in. Cart growing and pruning. So to avoid overfitting, which is a critical thing in not just classification but regression, brine and freeman, ocean and stone as a part of their famous cart classification and regression tree program suggested growing a large tree and pruning it back rather than using stopping criteria. This kind of global approach as opposed to a local approach would be the stopping criteria. So you grow the large tree until the terminal nodes are pure as they contain members only from a single class after which there's no point to continue splitting. Then cart minimizes what's called a complexity cost over the pruned subtrees. These are subtrees with the same root node. See, if you label all the nodes in the tree with some class, then each pruned subtrees actually classifier itself. And selecting a pruned subtrees is like selecting a less complex, less overfitted classifier. So they, anyway, they minimize this complexity cost over all the pruned subtrees, over the pruned subtrees for all possible values of the complexity parameter and then find the complexity parameter using cross validation. And they give a very interesting and efficient algorithm for performing the minimization over the pruned subtrees. So I wanna talk about it a little bit because we generalize this for the naming Pearson approach. So RFT here is the misclassification rate of a classification tree based on training data set. And this will be a very biased estimate because the estimation is done on the same data which is used to grow the tree. So it needs to be penalized to eliminate the bias. So this complexity cost R alpha is introduced and a t tilde here is the set of terminal nodes and alpha is the complexity parameter. So as I said, cart grows this full tree and then finds the optimally pruned subtrees which minimizes this complexity cost. And they do this by proving the existence of and then determining an efficient algorithm to find thresholds alpha k and pruned subtrees tk such that this t zero of alpha is equal to this fixed pruned subtree tk when alpha is in this alpha k interval. So the whole problem is finite so you would expect something like this. The issue is how do you actually find the alpha ks and tk's and that's what they gave an efficient algorithm to do. Okay, then you find alpha star and t star of the optimally pruned subtrees by minimizing the misclassification, an estimate misclassification rate based on cross validation. So what we did, well I should say cart, so if you wanna think about cart, does cart use this cross validation or test that to find an optimally pruned subtrees amongst a parametric found to reducing this to a parameter as a nation of pruned subtrees. We instead proposed to find optimally pruned subtrees amongst all subtrees, using all the data in iterative growing and pruning phases. So we had no need for complexity parameters here. And we did that by splitting the data set and iteratively growing and pruning based on orienting subsets and establishing conversions. So the key thing was to get the things set up in a way that would actually converge. And then we also gave an efficient algorithm for the pruning phase. So here's an example. This is from the cart monographed on a problem in waveform classification. They use this problem extensively to examine their results and compare them with other methods. So in the top table, we have the cart in the proposed method. And what you can see is that, so the number of terminal nodes, that's the first, is about the same for the two method. The estimate of the risk based on cross-validation is less for the proposed method. And the actual risk is less as well, okay? We know the actual risk because we have a model that can compute it. Furthermore, the CPU requirement is dramatically less because this generation of the parametric family between sub-trees turns out to be very computationally expensive. The lower table shows the iterations in the proposed method. It actually just takes three iterations. Actually, even after the first iteration, it's doing better than the cart. However, it's difficult to determine which pruning method or more generally classification tree design is better. Results problem dependent. This is a very active, very advanced search of research that's been going on at least 30 years. Now there are new methods. There are things called bagging and boosting, random fars, which use ensembles. And the goal is to have improved prediction over a single tree. So you can kind of compare ensembles with pruning, where you generate a single tree. And however, okay, single trees are still widely used for feature selection. And because they can be interpreted, this is what I was talking about before, how does the classifier work? This is actually very important to people who use these things and don't just want a black box. Okay, the next thing I'm going to talk about is naming Pearson classification trees. Okay, so classification trees are usually done using Bayesian approach, minimizing misclassification loss, or Bayes risk in the various phases. The frequentist approach is handled by basically arbitrarily selecting some class prior values to generate some suboptimal and incomplete subset of the receiver operating characteristic. It's sort of a crude application of what's called the naming Pearson number. We propose an approach to generate the entire optimally labeled improved ROC, which would then yield the naming Pearson design, and actually other things too, like minimax as well as the area under the ROC curve, which is a popular method to compare classifiers in the machine learning and statistics community. So we propose to minimize what's called what we call a prior complexity cost, or a prior parameterized complexity cost over the prune sub-trees, and also the terminal labels because they change when the priors change. For all possible values of the priors and complexity, then find the complexity parameter using cross-validation, then extract the receiver operating characteristic. So by comparison, CART just minimizes the complexity cost. So it's sort of a one-dimensional thing, whereas we have a two-dimensional thing. So we got a geometric aspect to the problem which wasn't there in CART. So we give an efficient algorithm for doing this, and it turns out that the CART algorithm is really kind of a special and much simpler case. So here's a display of a little more detail of what's going on. So we let PD and PF be the detection of false along probabilities. They're parameterized now by the prior alpha-gamma of class zero and this two-class problem. So now the prior parameterized complexity cost is this R alpha-gamma, and we're minimizing this overall with prune sub-trees. And what we show, we demonstrate the existence and then actually determine an efficient algorithm to find what turns out to be convex polygons Pk. So it's that the optimally prune sub-tree is equal to some fixed prune sub-tree Tk whenever alpha-gamma is in one of these convex polygons. So we then find alpha star gamma and the optimally prune sub-trees are functional gamma prior minimizing over alpha using cross-validation. And then we can get the ROC curve by varying gamma that will generate the whole ROC curve. Then we can find the ROC region, I guess you can say. And then we can find the ROC curve by finding the boundary of the convex hole. So it actually kind of surprises me this whole thing could be done, okay? And but I think we've done. So this is an experiment, credit assessment experiment. This is from the famous UCR machine learning database. So you're trying to determine from some training data set whether somebody's credit worthy or not that is the plaster minus. This is the full tree I was talking about. So this is the full tree and we want to find the ROC curve of optimally prune sub-trees or randomizations which yield it. Okay, so the figure on the left is the alpha-gamma space with the convex polygons, each representing where a particular prune sub-trees is optimal. And the figure on the right is a magnified view of the lower right corner. You can see these convex polygons here. And most of the action seems to be going on in that corner. There's actually 190 of these convex polygons corresponding to 190 optimally prune sub-trees that we generate, which we then use. We estimate the probably detection false alarm for each of those. That's the figure on the left. And then we extract the boundary of the convex hole that's the figure on the right. So this algorithm which we determined to do this is actually quite interesting. There's some linear programming sub-problems that need to be solved, but it seems to work. Okay, next thing I want to talk about is incremental and adaptive regression trees. So regression trees, like classification trees, are an important method for non-parametric and nonlinear regression. The trees are constructed like classification trees. Eventually there's just a continuous response value at each of the terminal nodes. We're actually going to use a multiple linear regression. We could use a generalized linear model even at each node because we want to actually do piecewise linear regression or piecewise linear filtering. There's various methods to define the regressions, the split points and the prunings. Again, like classification tree, these can be used for prediction, also variable selection, ranking, association. So incremental and adaptive regression trees. This is what we wrote down. Incrementally designed and adaptive regression trees are important when additional data becomes available or the data is not stationary because you don't want to rebuild the whole tree, especially in some deep learning problem with a big data set that's not practical. And of course that wouldn't work if the data is not stationary when you're trying to track it. In the literature, people use heuristics for this. There's no analysis. So the basic problem that you have to come to grips with here is that even with stationary independent data, strong assumptions, that the data at the non-root nodes has a very complex, non-stationary and dependent character because the splitting is changing at the nodes due to the new data coming in. So what we did was we developed MMSC fixed games, stochastic gradient algorithms, and also adaptive pruning algorithms. The whole thing was adaptive. And we actually were able to demonstrate the convergence and specifically how it's related to the tree depth. And this actually guided us in how to formulate the algorithm. And I'm actually very proud of this work. This, we developed some new ideas about analyzing hierarchically structured stochastic gradient algorithms. We also applied it to some nonlinear echo cancellation and equalization problems. Here's an example of the equalization of the severe ISI channel. So these are learning curves. So on the left plot, this is the tree structure approach. As you move up, you get the linear equalizer, then you get a polynomial, a second order, and then a polynomial, a third order. And on the right plot, this is the asymptotic error rate, probability of error. You get the, on the bottom, best performing is the tree structured approach, then the third order polynomial, second order polynomial, linear equalizer, linear equalizer, actually has an error floor. So what we see, but it's from both the point of view of convergence rate and asymptotic error rate, the tree structure approach works better. Now the obvious thing is to use a polynomial equalizer. The reason that doesn't work is because to get, you have to use a high enough order polynomial scheme to get enough approximation capability, but then there's so many terms that it slows down the rate of convergence. And so you could pick terms offline and people do that, but that wouldn't be suitable for an adaptive implementation. Okay, next thing I wanna talk about is, and still in this pattern recognition machine learning here is multilayer neural networks. So multilayer neural networks are a basic tool for non-parametric classification regression, very popular in the current deep learning craze. Multilayer neural networks consist of weighted linear summations and nonlinear activation function units arranged in a feed forward network. There's other types of networks, recurrent networks, convolutional networks, but these are still popular. Okay, these multilayer neural networks, feed forward networks are classically trained with what's called the whereabouts back propagation algorithm, which is actually stochastic gradient algorithm. Again, the field has progressed quite a bit since this work and there are new methods, there is sort of pre-training at hidden layers which are not input or output layers and unsupervised feature selection at hidden layers, but the back propagation algorithm is still the primary tool for training these type of networks, especially fine-tuning even in the deep learning. So here's a multilayer neural network, at the top is a neuron, the activation functions that circle there at the top are, these are some examples, the sigmoid is the classically popular one, there are some others that are increasingly popular now. And at the bottom we have the two layer, one hidden layer multi-layer net, feed forward neural net with both multiple inputs and outputs. Okay, so what we did was the analysis of this and I'll try to explain why what we did was anything but sort of just applying the usual theory. So the back propagation is widely applied but the analysis is difficult because it's a complex non-linear stochastic system. The standard analysis actually uses averaging to determine associate ODE and then you linearize around a cannon equilibrium point to get some kind of local asymptotic stability for the original, for the average system and then as well for the back propagation algorithm itself. But it turns out the analysis does not explain the qualitative behavior due to the non-linearity which has been observed over time with the back propagation which is that there's a long-term dependence on the initial condition and there's also a drifting of the waves. So we did something different, we analyzed back propagation using a separate and statistical linearization of each activation unit. This is like what's called the describing function method and non-linear systems analysis. So the algorithm, unlike this class, the conventional approach where you linearize the whole mean vector field, okay, the algorithm is still non-linear and it reflects the mean behavior more accurately. This approach yielded an associated ODE, it turns out had an unbounded manifold of equilibrium and we showed that the trajectories of the ODE are bounded and converged in that manifold. We could not use the up and up theory, we had to use LaSalle's theory for this and I should also point out that the convergence does not imply badness here because the manifold is of infinite extent. Okay, and then we empirically we confirmed that the back propagation mean vector field actually had such a manifold and there was this dependence on the initial conditions and drift along the manifold. So what I've shown here for a very simple two-layer net, I think just a single input and output. So there's two weights, but it illustrates the general theory. Showing some equilibrium and trajectories. So the equilibria for the average back propagation algorithm are these sort of hyperbolic looking things. And for the quasi linearized thing where we linearized it, statistically linearized each activation unit, we have similar equilibrium and also the trajectories were pretty simple. And in fact, it does predict this type of weight drifting that depending upon the initial conditions, the back propagation kind of contacts or approaches this equilibrium manifold in different places and then subsequently drifts along it. As far as I know, we are the only ones to have done this kind of analysis and to actually explain this qualitative behavior. Okay, okay, the next thing I want to talk about is LMS algorithms. So adaptive algorithms are approached to predicting and estimating some unknown single or identifying or approximating some unknown model parameters. The identification and approximation things are not actually exactly the same thing. We don't really need a model, okay? In fact, we don't operate mostly in that setting. I guess we're more in the approximation setting. Anyway, there's some subtle differences there, but the most well-known and widely used adaptive algorithm for MMS C linear estimation is stochastic gradient algorithm called least mean squares or LMS. There's many applications of this when low complexity linear estimation is appropriate. And what we did was we investigated a several fundamental algorithmic and theoretical problems motivated by practical issues using a constructive approach. And what do I mean by constructive approach? A non-constructive approach is the following. You sort of make very weak assumptions on the data and you show something like there exists a step-size sequence or a step-size with some desirable asymptotic property. But it doesn't say how big is the step-size or there really aren't useful details about the asymptotic property. In the constructive approach, you make strong assumptions on the data, maybe stationary data, stationary independent data, even stationary independent and Gaussian data. And we actually can derive balance on the step-size and other information, practical information about the performance of the algorithm. And then what you can do is you can do simulation to validate that the algorithm kind of works this way even when the data doesn't invade such strong assumptions. So, this is kind of a useful approach for engineering. So here's the LMS formulation in algorithm. So Y hat is this regression on X. X is the regressor. We minimize the mean square. There's also a tracking problem where the data is non-stationary. And then I've shown the LMS algorithm at the bottom here. Alpha is the step-size. So the first contribution I wanna talk about is called noise constrained LMS. So it turns out that in some problems, the expected performance can be estimated. For example, in wireless communication systems, assuming that the actual channel is in the model set, automatic gain control is used and the performance is actually just the noise power. And this is essentially known. So this actually is the case in CDMA networks where there are certain types of special and singling. Special singling is done to monitor the singling noise ratio. So what we did is we proposed using this information and adaptive constrained MMSC optimization and prove the performance of LMS. It's more general than that. This methodology can be used to include components of model-based information into adaptive algorithms and so interpolate between sort of fully adapted algorithms, very simple algorithms typically, and fully model-based algorithms. So this is sort of a different approach. Instead of assuming there's a model, and then well, maybe I don't really know the model, so you scale the model back and things like that, we're starting at the bottom and adding model-based information to a fully adaptive algorithm. So in the noise constrained Mineral Mean Square, performance constrained Mineral Mean Square, and your estimation problem, what we do is we minimize a Lagrangian which is formed from the Mean Square and the constraint, and then this is the critical thing. We penalize, and this sign turns out to be important, we penalize the multiplier. Now why do we do that? Okay, well the penalty term is added in since otherwise it turns out the critical values are non-unique and actually therefore unbounded or non-unique and unbounded, and particularly the multiplier is non-unique and unbounded and that creates problems with an adaptive algorithm. So by subtracting this penalty term, penalizing the multiplier, we get a unique critical value which turns out actually to be a saddle point. So since it's a saddle point, we can't use steep stochastic gradient, we have to use cold rodent's Monroe algorithm, but we can do it and adaptively solve for the solution of what we call a multiplier, penalize, constrain MMSC problem, which is this algorithm. It's a type of variable step size LMS algorithm, alpha k which turns out to be data dependent. So for stationary independent Gaussian data, we did a rigorous analysis and showed the MCLMS weights and multipliers were bounded in Mean Square. Actually, this led us to look at the general problem of analyzing LMS type algorithms with data-dependent step size, which was something also happened about a bit. We also performed an approximate analysis that showed that the MCLMS algorithm achieved larger convergence rates and smaller asymptotic MSE. And even in the case where there was mismatch, where we hadn't estimated the performance correctly, the sigma squared in the constraint, there was still an imperformance gain. We were able to show actually that it was best to overestimate. So here's an example of the identification of an ISI channel. These are learning curves for the third channel tap. So the best performing is actually, you can see this, you can see it. So since this recursively squares, which is a much more complex algorithm. Next to it is the MCLMS. And then the other algorithms are various types of variable step size LMS, which are based mostly on heuristics, so like the kind of principle derivation we gave, which are in the literature. And the worst performing is the LMS here. So the second topic, I guess, in least mean squares type-adapted algorithms is a general analysis of variable step size LMS. So this noise-constrained LMS, we were just talking about, is a type of variable step size LMS, but it depends on the data sequence. There's many other types, and they're based mostly on heuristics. The idea is that you wanna choose a step size large initially so you get fast convergence and small eventually to get a small asymptotic error rate. It turns out when we looked carefully that a rigorous analysis of the general data dependent step size was unknown in the literature. In fact, it was just assumed that if the variable step size satisfied the same bounds required for a fixed step size, that it was stable, had the same stability as LMS. And that turns out to be true. It's easily shown if it's not just fixed step size, but a deterministic step size. But all the popular versions of the variable step size LMS, including the NCLMS, were actually not just variable, but data dependent. It's a difficult problem to analyze this because unlike the fixed or variable non-data dependent case, you can't get a recursion in the weight error covariance. So what we were able to do was to determine some nonlinear difference equations which were satisfied by certain bounds on the weight error covariance, bounds which were uniform over an entire class of data dependent step sizes. For example, prior ones with prior dependent meaning, they depend on the data up to the previous time and posterior, a data dependent which depended on data up to the current time. That was the key. We were then able to analyze these equations, determine stability regions, and we showed that the stability region for a data dependent step size can actually be strictly smaller than the fixed step size contrary to the usual assumptions of terminal literature. So a little bit of detail on this. Here's the general form of the variable step size LMS. Alpha k is the data, generally data dependent step size. So we let script A denote this class of step size sequences I was talking about. So it could be fixed, deterministic, prior data dependent, or a posterior data dependent. And then we let S subscript A be the step size intervals for which the weights were mean square bounded for all step size sequences in that class which satisfied, which lied in the interval. We call this the mean square stability region for the class of step size sequences script A. Now it's known that for stationary independent Gaussian data, strong assumptions in this constructive approach. The stability region for the fixed step sizes is just basically the step size intervals which are bounded about by some parameter alpha star. Where alpha star can be characterized in terms of the eigenvalues of the covariance of the regressor. And it's also easy to show that this chain of inclusion is whole. So what we show first for the case of a single tap is that the stability region for the prior step size was the same as for the fixed step size. But the stability region for the posterior step size is strictly contained in the stability region for the fixed step size. Then when we looked at what turned out to be a much harder problem in multiple taps, we were able to get bounds on the prior and the posterior step size regions. Still, we were able to show that the posterior step size was strictly contained in the stability region for the fixed step sizes. So here's an example of this. The stability region for the fixed step size is just this kind of triangular region here. At this point here is alpha star. This axis is the upper bound. But for any step size, which is less than alpha star for the fixed step size, would be mean square stable. The stability region for the posterior step size is now bounded away from the region for the fixed step size. And it's kind of interesting because it shows as the upper limit, this is the upper limit on the step size interval, gets larger, so does the lower limit on the step size interval. Okay, right? The lower limit on the step size interval for this value, the upper limit is here. If you're here, it's here. So this interval is kind of getting worn out, okay? And actually as the upper limit on the step size interval approaches, the maximum value, so does the lower limit. Which means that if you want to get, say choose the maximum step size, say to get the largest rate of convergence, you essentially have to choose a fixed step size. Because there's no wiggle room between the lower and the upper limit once you choose such a large step size. In the multi-tap case, we will be able to get bounds. The figure on the left shows such bounds for minimal eigenvalue spread and on the right for a considerable eigenvalue spread. Okay, so that's all I wanted to talk about in detail and I want to talk about some other briefly. So I wanted to discuss some stuff which I've spent a lot of time on here at Purdue. In fact, realistically, at least I have my career probably more. But I've since moved on the last several years to this, stuff I was talking about previously, or maybe moved back to it, machine learning, pattern recognition, statistics, optimization. Although there is some overlap. So this other work involved a lot of work with optimal and near optimal model-based methods. All kinds of variants of AR and ARM models, Markov, hidden Markov, hidden semi-Markov, states-based models, things like that. And there was a, this is a list, I'm not gonna read these, this is a list of some of the things that I worked on. Well, maybe I'll mention the last two. One was this ethanol concentration and pattern recognition from an implantable biosensor. I should say something about the funding, I guess. So this was funded by an NIH R01 grant. And most recently, this harmonic spectral analysis and pattern recognition from probe passive devices, which was funded by an Army, URI. This other earlier work was actually funded by NSF. And also the work I was discussing about adaptive algorithms, so-gastic approximations, that was also funded by NSF, but also by the Army Research Office under a core grant. I also got some high-performance computers from them under their DORA program. And I think around the year 2000, I actually had the most powerful compute servers in the department. And I remember actually the department head at the time asking me for accounts on those machines for some of the incoming faculty. Kind of a funny story. Okay, okay, another area I worked on and mostly moved on from although I have some regrets because this area has now become very active again. So I've done a lot of work on practical problems and most of it supported by industry involving modeling, algorithm development, analysis, and lots and lots of simulation of all kinds of different coding and modulation and channels and wireless and satellite communications and broadcast systems. Again, I'm not going to read through all these. The most recent one is this nonlinear channel coding for satellite channels with Northrop Grumman that went on for a few years. It's kind of an interesting channel model. It's a peak power constrained channel model. Oh, one thing I should say is, I had a multi-year relationship with Thompson initially consumer electronics and then multimedia before they left. And that laid the foundation for writing a fairly large 21st century grant large at the time. So I learned a lot from this stuff but I'm kind of returning to the pattern recognition and machine learning area. My interest in this area is in what I've called modern time series analysis. I'm using tools from statistics, machine learning, optimization, and also some model based single processing. Some recent work is discovering these temporal drinking patterns from implantable ethanol biosensor. I already mentioned that. Also discovering temporal dietary and physical activity patterns from surveys and accelerometry. So these are both problems in the health area and there's a lot of these kinds of problems there. So we had a little seed grant from NIH for this accelerometry work but it's been challenging to get funding. But we have proposals in and we're optimistic. On the theoretical side, I'm interested in developing and analyzing classification regression trees. Both the sort of pruning of the single tree and the averaging with the ensembles of trees for classification and prediction of time series. And then also developing and analyzing dynamic time warping for comparing and clustering sparse time series. So dynamic time warping is a method to compare time series which is sort of running at different rates, which is the kind of thing which happens a lot when people are involved, when they speak or eat or move. And some of these time series are quite sparse. Sparse can kind of arise in various ways. One is, you know, there's like a missing measurement. Okay, maybe there's a sensor or somebody doesn't answer a questionnaire. That typically isn't a lot of sparseness. Okay, that's a kind of a rare event. But the other way it can arise is you can have kind of lots of zeros in a time series record. These would be periods where people aren't speaking or maybe they're not eating or moving. That's actually very significant. And so that's what we're looking at, trying to understand, you know, fundamental limits and practical algorithms for dealing with that kind of sparseness while maintaining the near-optimality, you know, the dynamic time warping criteria. So we just had a paper accepted on this subject at one of the more prestigious machine learning workshops. And we'll see how that goes. Okay, so I've got a couple of pages of references. I guess that's it. Any questions? One question is, in fact, it's a very nice algorithm to build an always seeker using a person's right to write. Right. Do you have a code for this? You're willing to share the code for a... We do. Okay, so I'll pay you on that one. We do. I can't guarantee that it's in great shape. Okay, fine, we do. Thank you very much. Thank you.