 Good evening everyone, welcome again to the third of our inaugural IHACA lecture series. Today we have another area of statistics, another area of statistical computing. Jennifer Allen is the sort of person who gets described as a rising star in statistics. She is in the Department of Statistics at Rice University and also the Department of Electrical and Computer Engineering at Rice University and the Baylor College of Medicine and the Texas Children's Hospital and this is a side effect of being a statistician who does methodology research developing methods for studying new and complicated types of data and then as a result leading to work with people who have new and complicated types of data. A lot of that is at sort of beyond the level that we would expect public lecture people to cope with even before we've given you a glass of wine. So today she's going to be talking more about visualization of clustering and biclustering algorithms in high dimensional data. Thank you, I'm very honored to be here and excited to give this talk. So today I'm going to be talking about clustering and biclustering and hopefully convincing you that convex biclustering and clustering is something that is interesting and worthwhile and then we're going to be porting this and saying, okay, I think there's really cool things we can do with this in terms of interactive visualizations in R and then I'm going to be moving towards that R type stuff. So this is joint work with one of my grad students, John Nagorsky. Let's start with a little motivation and background. What is clustering? Clustering is a very fundamental exploratory analysis tool. It's very simple. You're finding groups of objects in a data set. These could be groups of features. They could be groups of observations. They could be a little bit more complicated than that. But basically when you have really, really big data, you need to make some sense of this data. Clustering is one of those ways to do so and define groups. So who uses clustering? The answer is pretty much everybody. You can imagine that all of these internet companies, Amazon, Facebook, Google, they use clustering all the time to find groups of users who behave similarly in terms of their preferences and profiles on the internet. You can imagine the NSA also probably uses clustering of sorts. But also this is used a lot in medicine, understanding groups of subjects that behave similarly on a genetic level or clinically for precision medicine purposes. This is also done with new wearable devices, finding people who have similar profiles in terms of accelerometers or Fitbits. There's tons of applications, of really exciting applications of clustering. This is a really fundamental thing that we like to do with data. So I mentioned the term bi-clustering. So you might not have heard of bi-clustering before. What is this bi-clustering? Bi-clustering just means two clustering. It's basically finding groups in both observation space and feature space. What do I mean by that? Let's take a simple example where we have, say, patients, and we have measured genomic profiles on those patients. Perhaps we want to group those patients to see which sets of patients are similar to each other at the same time we want to understand what's the genetic basis for the similarity of these patients. We're grouping two things. We're grouping patients or observations and we're grouping features or variables, which are the genes. And what you see here, if we organize our data, these observations and features in the form of a data matrix, what you're looking at here is something called the cluster heat map. You guys have probably seen these before to visualize data. What this is is an actual heat map of the actual data matrix itself, where yellow means it has very high values and blue means it has very low values for each individual entry. So each pixel in this image is an entry in your data matrix. And this is a cluster heat map, meaning that some form of clustering was applied to the columns of the data and the rows of the data. And the data was reorganized according to this clustering. So we can see here there seems to be a big group of blue here with very low values in this group here. So why is bi-clustering important? So I'm going to motivate bi-clustering by, again, this is commonly used in a lot of areas. But I'm going to motivate this with just a couple of scenarios. The first is in precision medicine. And specifically, this is used a lot in understanding cancers. So people used to think that cancers were just lung cancer or breast cancer or colon cancer, that they were molecularly based upon the organs in which the tumors originated. But upon looking at the genomic profiles of lots of tumors, scientists realized that there's actually a lot of heterogeneity in the tumor samples. And in fact, in breast cancer, it was discovered this was actually around 2001. This was a major success story in breast cancer, that breast cancer is not actually just one type of cancer. It really behaves on a genetic level, like five different types of cancers. And each one of these different subtypes of cancers has a different clinical outcome associated with this. And this was one of the first major success stories of bi-clustering that was applied in science. And the importance of this is precision medicine. Scientists are now developing drugs that they can target those individual subtypes specifically. So you can actually test someone, their genotype, and understand what subtype they're in, and then give them precision and targeted drugs. So this is a big success story of bi-clustering. Another area where you see bi-clustering comes up a lot is in text mining. And an example, text mining can be very simple, understanding text on the internet, understanding Twitter, and so forth. In text mining, there's lots of ways to organize data. I'm presenting one particular way here. This is called the bag of words model that gives you a document by term matrix. So here I have six fictitional documents, and I have some words, all of which you will probably hear in this talk today. And the entries that you have here are actually the numbers of times in each document that this word appeared. And the reason that bi-clustering is really important for understanding these text mining examples is, for example, we might want to find all of the documents that are related to SHINee. It looks like we have three documents related to SHINee. It looks like we've got two documents, maybe talking about big data or something like this. So we might want to understand how the documents are grouped together in terms of topics and also what words differentiate those topics. So this is a really important example of bi-clustering. Another time that you see this is recommender system. So this is just a silly snapshot from Netflix, but you see this all the time with Netflix, Amazon, a Likes on Facebook, where, for example, depending on your rating, your likes, or how much you viewed a product online, you want to group users into similar categories of say, you know, user one and two really like horror movies, but three and four love those rom-coms and so forth. And also, of course, you need to group the products or the movies at the same time to understand which categories they lie in and make recommendations to customers. So these types of methods are used all the time in tons of different scenarios from medicine to internet advertising and lots of things in between, I should say. So what are some approaches that people use for clustering and bi-clustering? So there are a ton of methods out there literally a ton of methods. K-means and hierarchical always seem to be the two major ones. K-means is really simple. It says, find me three basically means or groups in the data. These are denoted by these three kind of big dots here. This is a scatter plot of a two-dimensional data set, color-coded. And hierarchical clustering says, build me this dendogram, this tree-like or dendogram-like object. And what it does is it starts with each individual data point, perhaps these two points, and says these two points appear to be very close together. So we are going to join them or link them via this tree and they are going to now fuse into one. And then you start building up this tree from the bottom to the top by fusing individual objects that are similar to each other. And you can understand how both of these algorithms lead to groupings of objects, right? I've color-coded them according to the predicted hierarchical and actually this was just a toy simulation in the underlying truth here. So you can see that all of these lead to interesting examples of clustering and there's some really good properties of these methods. They're really, really fast and they're super easy to visualize, especially with hierarchical clustering. We can easily see here there's three main clusters here that I've color-coded differently. That's very easy to pick up right away. But there are some caveats to all of these methods and this is not just true of just k-means and hierarchical, but a lot of other clustering methods. So a lot of these methods give local solutions, meaning that if you start them at different initial starting points you don't get the same solution all the time. You don't find the same groups all the time. That could be a problem, especially say in medicine where those groupings really do matter quite a bit. There's also some instability. So if you change and you say, oh no, I don't think this data set has three clusters. I think it has two clusters. So there's a lot of instability there. Those cluster assignments can change radically. You can understand in practice that might not be the best either. And then there's the issue of how many clusters are there in this data set or are there groups at all in this data set? And these are really hard questions to answer with these existing methods. With bi-clustering this story is very similar. So this cluster heat map that I mentioned before, what is done here is hierarchical clustering is applied to the columns of this data set completely separately from the rows of the data set. So you just do clustering of the columns completely independently and clustering of the rows. So you're not actually taking into account what the other view is seeing. So you can imagine if this is documents and this is terms here, the document clustering is not taking advantage of the groups that are known in the terms. So there's some issues with this, but also since these are using some of the same techniques for clustering, they inherit all of those poor properties that we just talked about. And one of those poor properties is, again, instability. And by this I mean very small changes to your data or small changes to your algorithm can completely change your results. So this is a surprising example. Here's some original data. And I added 5% random noise to this data. So not much, just 5% Gaussian noise. And look at the dendograms and the groupings that are resulting from adding 5% noise. They're very, very different, right? I work with scientists a lot, and if I went to scientists and had one version of their data that got one grouping and walked in with a completely different one, they probably would not trust me that much, right? It's really important that we have stability in these algorithms. So what do we want to do? So our goal here is to make clustering great again. And believe it or not, Donald Trump's going to come up a couple of times in this talk. It's going to happen. And how are we going to do that? How are we going to make clustering great? And the solution is we're going to propose convex clustering and biclustering methods. Why convex? Convex is a really nice mathematical property that means all local solutions are global solutions. So this gives us very important mathematical properties in terms of stability and consistency of our results that are going to mean we're going to be able to find more reproducible clusters in the end and also have very data-driven ways to select the number of clusters. So there's a lot of really good mathematical properties from formulating this as a convex problem. Okay. So there is going to be a little bit of math here, but I've color-coded the math so that we can work through this here. Okay. So what you're looking at here is the optimization problem for convex clustering. And what you see here, I'm going to point you first to this is each data point. We're going to denote as x sub i. This is each data point or each observation. And each observation is assigned a centroid that's associated with it. And this centroid is ui. Okay. All this is doing is saying, make my centroids close to my data. They should be close to the data. But there's this penalty term. This is what we call a penalty term right here that regularizes. And this penalty is actually a convex fusion penalty that pushes those centroids together so that they fuse and form groups or clusters. Okay. And so what is this actually doing? This loss function is the exact same loss function of k means that we know and use well. And this fusion is the same type of fusion except it's a convex version of doing fusion as in hierarchical clustering. So this is actually we're on really solid brown here. We're actually just doing kind of k means in hierarchical combined into a mathematically nice formulation for this. And one really cool thing here is there is only one tuning parameter in this whole problem. And this is lambda. It controls both the cluster assignments and the number of clusters that we have. So as you start off with lambda equals zero, each centroid is its own cluster. And as lambda gets bigger and bigger and bigger, those u's are going to be pushed together. This penalty right here is saying push u sub i and u sub j together and fuse them eventually. And they become fused. When lambda is very large, they all fuse together. The weights can be pre-assigned. There are some nice algorithms for this. These are iterative optimization routines. And you can't explore these in R with CVX Clust R. But I think it's best to illustrate this with pictures. So this is a toy simulation. Here you might say there's five clusters. You might say there are two, or I'm not sure how many you would say. But this is a toy simulation. And what I'm going to be plotting here are the convex clustering solution paths. What we're looking at here are the raw data points in two-dimensional space. And when I increase lambda, that one tuning parameter, what I'm plotting here in blue are the centroids for each observation that we have in our data matrix. And you see the centroid start to move away from those original data points. And if we increase lambda a little bit more, all of the sudden those centroids have started to fuse. We see that now our centroids indicate that there's actually five clusters in this data set. There's five particular points that have merged together. And as we increase lambda, again, now we have three clusters, three clusters, and eventually they will fuse and form one cluster. So let me just go back and play this again really briefly. We start out with there's no clusters. Each data point has their individual centroid. And as we increase the regularization parameter, we have a centroid fuse until we get complete fusion here. So this is a really cool way to not only has nice mathematical properties, but I think it's a really cool way to visualize and basically watch your data form clusters, right? And which is what I'm going to be getting to in the latter portion of the talk. OK, so that was clustering. But what about by-clustering? Again, we mentioned that we really want to group the columns and the rows at the same time. For example, in text mining, we want to group the words at the same time that we're finding the topics in the documents together. So here is, there's more math, but again, it's color-coded, the optimization problem for convex by-clustering. What you're looking at here is, again, a loss function. It starts out that every individual data point, so this is each observation feature. Xij is assigned its own by-cluster centroid. Each Xij has its own Uij by-cluster centroid. And we want these centroids to be close to the data. Again, this is the same loss function as with k-means clustering. And then we want to somehow fuse the rows and the columns. And we're going to do this by having two penalties here. The blue penalty, where we merge and are fusing together the row centroids, and a green penalty which fused together those column centroids. And one tuning parameter that controls them both. So again, all we're doing here is we're fusing the row and the column centroids. Again, Lambda controls both of these. And one of the key tricks here was actually assigning the weights in such a way that you ensure fusion happens at the same rate and you get proper joint fusion. So note that this is different than just doing a separate clustering on the rows and a separate clustering on the columns. These fusions take account what the other side is doing. And this is also available in R with CVX by-cluster. Okay, so again, it's super easy to visualize the solution pass here with images. So what you're looking at here is an image of a data matrix. Again, yellow means high values and blue means low values here. And when Lambda equals zero, each individual data point is the original data. So what I'm plotting here are the centroids. And it starts out, each centroid starts out as the data point. As I increase Lambda, all of a sudden you saw some fusions occur, but now they're occurring in the rows and the columns together. You see a little bit more fusion when you increase that one tuning parameter. And even more, you start seeing it form these checkerboard patterns or groups of rows and columns at the same time. Even more checkerboard. We've got four by-clusters there pretty clearly. And eventually it's going to all merge and we all have one big by-cluster here. So just play this again. This is a mathematical problem that basically allows your data to fuse along these paths. You can visualize and see how your data is forming groups. So we're pretty excited about this. So Eric Chi worked on some of this. Eric Chi was a postdoc at Rice. He's now an assistant professor at NC State University. And we're pretty excited about this because this is a global solution to a problem that typically only has local solutions. There's some real mathematical advantages here. I can refer you to the papers if you're interested in those. But basically the big thing is it's very stable because it's a convex problem. You've got those consistency results and those stability results which allow you to have very reproducible clusters and by-clusters that you find. The only caveat here are that some of the algorithms used to find to solve those mathematical problems can be pretty slow. So this is going to be a bottleneck here and this is what I'm really going to be talking about today. So let's get on to how do we interactively visualize these clusters and by-clusters from this method. So what do we really want to do? I've hinted at this already. What we want to do is we want to build tools that will allow you to watch your data form clusters and by-clusters. We want to allow you to interactively and dynamically watch how your data groups together. And we want to do this by... Dendograms are very popular so we would like to use these nested families of clustering and by-clustering solutions to build dendograms from them so we can visually see how elements in our data are linked together. And we'd like to also visualize these continuous clustering paths. But there's a problem with this. All of the existing algorithms are convex optimization routines to solve this convex clustering and by-clustering problem. They're iterative methods, meaning that you take one iterate, you get an update, and you keep repeating those iterates for a long time. Typically, these are run for one value of lambda so this is one clustering solution at a time and these iterative algorithms can take 10,000 or so iterations at one value of lambda to converge. Now, if you want to get the whole clustering path, think about how many lambdas you have to solve for. Particularly, we want to build dendograms and find the exact point at which two of those centroids fuse together. So imagine you have to continuously evolve on this path and run a bisection algorithm to find when these fusions occur and then a 10,000 iterations to get this to converge. Basically, it's really, really slow. So when a student tried to first do this, he came back to me a week later and was like, it didn't converge in a week. I could not get this to run at all and find the fusions. It literally, we could not do it. And this was an example with n equals 50. So only 50 data points. So this was way too slow. This was not going to allow us to watch our data form clusters and bi-clusters in real time so that you can actually play with it. So we need a new algorithm for this. So how are we going to do this? Okay, so I'm going to throw out there a very, this is a very harebrained idea, but bear with me. This is something that I call algorithmic regularization path. And the idea is, sounds crazy, but you're going to start with each observation being its own cluster. There's no regularization. Our lambda is equal to zero. We're going to perform one iteration of the algorithm that you would typically use to solve the clustering or bi-clustering problem. Just form one iteration and then increase the regularization level by a tiny amount. Okay? So we're not running 10,000 iterations. We're not letting it converge. You take one tiny step. You increase the regularization by a tiny amount. And you keep repeating this. You take one step of the algorithm. You increase the regularization by a tiny amount. And then you stop when everything has formed one cluster. Everything's merged together. And you take the iterates as actually the algorithmic clustering path. Okay? This sounds completely crazy. And here's the math for it. Okay. So I had a couple slides with math. It happens. This is what I do. But so the main thing to pay attention to is here. We basically increase the tiny amount of regularization by a logarithmic or multiplicative factor here, t, and a tiny step size at each time. Okay? And then take the iterates as this algorithmic clustering path. So look here at these two plots very carefully. See any differences? Huh. Both of these are examples of clustering paths. This one over here is by running the convex clustering solution at a very, very fine grain of lambda values, tens of thousands, hundreds of thousands, even more of iterations to get a path this fine. And this was our algorithmic clustering path run with a very small t. And this actually only took 10,000 total iterations. It's very fast. We're talking on the order of a week to converge days to a week to converge versus seconds to converge here. Okay? And they look pretty much identical. And there's some nice math. Okay, so there's a little more math than I anticipated. It happens. But there's some nice math to back up that these two things are actually the same. Okay? So what this is saying is that the convex clustering solution and our algorithmic clustering path as the step size goes to zero, these actually converge. Okay? And they're the same thing in a weird distance metric. Okay? Cool. So let's get to some of these interactive visualizations. So just to recap, convex clustering and biclustering seem like really cool algorithms that can help you understand and watch your data form clusters. But the bottleneck was really the algorithms were too slow. So we've developed these new fast algorithms that are going to allow us to do this in real time. And of course, we are going, I'm going to give lots of example demonstrations. These were built using the Shiny app in R. Okay? And let's go to a couple example demonstrations here. So I am in RStudio. And let's start out with this example demonstration. This is a toy clustering example. And we can go over exactly what's going on here. Okay? So what we have here is this is our, this is a silly toy example. Okay? So bear with me where we have a two dimensional space. So here we have X1 plotted against X2. We've got silly points that we've labeled A, B, C and D because this is just a demonstration. And I'm going to play this movie here. And we can watch in red are these convex clustering past these centroids for each observation. And as these centroids are merging, you're watching here these past form clusters. And you're watching along these dendrograms. I'm going to come back and explain these dendrograms. Every time a cluster merges, you see them form a link on these dendrograms and see the form, these groups on these dendrograms and trees here. And eventually it will completely merge and it will be one dendrogram here. So what you're seeing here in this dendrogram is every, this scale here of regularization is the log regularization scale for these convex clustering paths. And what you see here is the height at which two entities are merged are the regularization level that it takes for those two centroids to have exactly come together. And you see those connected as a link in the tree here. So you notice the first two to come together are actually these two d points that are very, very close to each other here. They come together at this regularization level. Ish. I think if we go back. Close enough. We've got two b's merged and two d's merged right there. And as we increase that regularization level, more and more merges are happening and all we're doing is moving up that dendrogram. So you can actually see where you are in the dendrogram and where you are in the paths of these clusters. We also have a static version of this where, for example, if you say, I think I'm interested in five clusters there. It shows you, it color codes on the solution path where you are that gives you exactly five clusters. And what's the lowest point on the tree that gives you exactly five clusters? So you can actually see, you know, if someone says, I think there's three clusters in this data set, you can go and say and look here and see where on the dendrogram those three clusters are formed like this, okay? Cool. So this is the idea of clustering. So we are putting these and we're showing both in dendrogram space and in the clustering path space. So I'm going to exit this here and now show a quick demo of bi-clustering. Let's get this up and running. Okay, this is another silly toy example. You can kind of see it right now where the groups or where the bi-clusters are in this data set. So I'm just going to say that these dendrograms here before I play the movie for you, I want to highlight these dendrograms here are again from the convex bi-clustering algorithm. So the height of these merges are the log regularization level at which those two entities have merged together, okay? And what's interesting here is this is not the same as the cluster heat map because the groupings in the column space depend on the groupings in the row space and vice-versa. So this is actually an improved version. And let's watch this movie of how these entities start to merge together. So we see right here right away this whole chunk of rows seems to be very similar to each other. And this whole chunk of columns merge very quickly. Going up the tree, you see at different heights. And every time you change regularization levels, it changes on both the row and the column space. So you have this joint groupings of both the rows and the columns. And eventually the whole thing merges to a very ugly color brown. But that's the middle color. Whoops, not sure what happened there. Color there. So again, these denagrams are from the algorithm, and this allows us to watch how our data forms these groups together. Cool. Any questions on this actually before I... Okay, so let's look at an example next. And this is going to be kind of a fun example. So I am going to use a text mining example of U.S. presidential data. So this example came about because the grad student I was working with on this, John Nagorski and I, we're sitting around and we were following the news coverage of Trump's inauguration speech. And we were talking about this and we were like, wow, that was pretty far out there. And we were like, yeah, yeah, yeah. And John was like, I bet you I can show that Trump's... He's just crazy, he's a true outlier here. And I was just like, okay, yeah, yeah, yeah, we'll prove it, right? Like I'm a statistician, you know, I say like go and show me this, right? I was like, this is a perfect example where we can use by-clustering to see if U.S. presidential speeches, if Trump's inauguration speech was truly actually an outlier compared to other U.S. presidents, okay? So what we did for this data set is John did a web scraping of a repository of all of the U.S. presidential inauguration speeches and State of the Union speeches. Trump's only given one joint address to Congress, so we actually used a couple of his speeches at his convention and a couple others to make sure the data set was balanced across all of the 45 different U.S. presidents. And then we downloaded this data, this web scrape data into R and used the text mining R package to convert this to eventually a bag of words type model where we've got documents by words. And specifically some of the things that you do in text mining are you take all of the words that were said, you convert them to lowercase, you remove white spaces and symbols and so forth, and also one really important thing is stemming words. So this is removing the prefixes and suffixes of words. For example, big, bigger, biggest would all be the same word and so would bigly, right? They'll probably all be the same words. So if you make up words, they might not show up in this stemming algorithm. Huge to huge, yeah. That's another one that might not quite show up in these stemming algorithms. But you do all of this, and after all of this processing and stop words were removed, like the A and is so forth. And after all of this, we are left with about 40,000 words. And you can imagine drawing and showing examples that fit on my computer for 40,000 words was probably not going to be the best. So we actually just did a very, very simple thing and we wanted to be as most unbiased as possible. So we just took the 50 words that had the most variance across all of the 45 different presidents, okay? And then just did a log transform, which is very common when you're analyzing bag of words models. So what we're left with is actually a data matrix that's really simple, okay? We've got the 45 U.S. presidents on each row and each column is one of the 50 words that are the most variable across all of the presidential speeches in U.S. history. So what were some of those words? You can look here. There's some interesting examples. Here there are some stemming issues that are happening. For example, this is territory or territorial, right? Or treasury here. Or this is treaties, right? These are stemming, removing all of those suffixes to make them all the same root word, essentially. But you can definitely see some words that I think might show some differences. Soviet, for example, might pop up in different ways. And the goal here is to understand how these U.S. presidents, based on their speeches, group together based on only these 50 words, okay? Okay, so let us see this. Let's first look at clustering on this data set. So this might take a tiny bit to load and run. It's actually part of the slowness is actually Gigi plotting here. Yeah, can I take a quick question first? Yeah. Just had the most variance. We did something super simple and said, which words have the highest variance? That's right, there are different frequencies. So each word was counted how many times it was said in that president's speeches, okay? And then so it's basically the words that are said a lot by some presidents and not a lot by others, right? And we are thinking this would be very illustrative of the ones that are going to separate presidents and group them together. Okay, so what you're looking at here is a clustering example and this is, again, we had 50 words here and we cannot plot unfortunately in 50 dimensional space yet, but I'm sure Hadley will work on that soon. I got a Hadley joke in there. And so instead we are actually projecting this down to lower dimensional space using principal components, okay? So we're looking at principal component one and two and these are the orientations of the presidents in black. Here's Donald Trump right here. Here's Harding, here's Harry Truman and so forth here. In red we're going to be plotting the convex clustering pass and here's the dendogram associated with this and let's go ahead and start this movie here. Okay, we see some points starting to come together. No fusions have occurred quite yet. A couple of fusions are happening, moving up. And you see very clearly it's starting to happen and we've got this two groups right here and then it takes a while for these two groups to come together. So let's actually pause this right here and look at these two groups. Okay, so if we look at this group, this group is say Kennedy, Bush, Clinton, Nixon, Roosevelt, Eisenhower, Truman, Carter. Who are all these presidents? Yeah, it's a time thing. These are all post-World War II presidents, right? Or during World War II, in the case of Roosevelt. And these are all pre-World War II. So this is clearly found, just a very cursory type of text mining example has clearly split a long time here and we see these splits very clearly here. Interestingly, we can also plot, for example, if we want to look at principal components two and three instead of one and one and two. You can plot different principal components and again we're just projecting the solution pass onto these principal components so that you can visualize them in many different ways. I find one and two are perhaps a little bit more compelling here with these. And let's just take a gander at the static version here. So starting out, so Donald Trump has actually already merged with some of the modern presidents here in this example but we're going to revisit this later. Jimmy Carter and Truman have not, interestingly. Harding is off on his own and so is Calvin Coolidge. Interesting. So let's see a couple of these others if we go to seven. You'll see Jimmy Carter is still an outlier and so is Warden Harding. And keep going. And now Wilson and Hoover are outliers and Harding's an outlier and we've got two kind of interesting groups here. So in the purple group we have Teddy Roosevelt, Taft, Coolidge, Polk, McKinley, Cleveland. I'm not saying names, I'm not sure you would recognize these unless you've recently taken a US history class but these are basically between the two, the Civil War and World War I. So here's all this group of presidents here and these were all pre-Civil War presidents over here. Eventually we're going to see that Harding is still an interesting outlier and I saw this and I was just like, oh my goodness, I had to think back and I was like, who was Harding? And Harding was, yes he was involved in many scandals, he was in the 20s but the reason his speeches were such an outlier was because he didn't want all the pomp and circumstance of inauguration speech and he got up and he just gave a five minute speech. So we basically just didn't really talk that much and this is clearly coming out in this data set. We see these examples here. And of course eventually it too, you see very clearly these kind of modern presidents versus older presidents here. I shouldn't say older, I should, whatever. In past history, there we go. But I think more interesting on this example is the by-clustering of these presidents here. So let's go ahead and run this by-clustering. Whoops, I don't want to give away the juice here. Okay, so what you're looking at here, I'm going to give you a quick time to orient yourself here. These presidents here like Teddy Roosevelt to John Adams. So this is probably, you can imagine our group of presidents that are pre-World War II up here. Here we're starting with Truman and Carter Bush and at the very bottom right here we have Donald Trump. And here's some words. Let's maybe watch this movie and see how it plays out and then come back and think about some of these words here. Okay, so so far no fusions, couple of the fusions happen right away. So there's some words that these older presidents are not saying, older in history, sorry, are not saying much at all that merge. And as we play, we can see lots more merges happen. And suddenly we're left with those two kind of very clear distinctions. And if we just pause it right here, this is interesting because these are basically all the words that modern presidents say a lot more than the kind of older in history presidents and vice versa. And so if we examine some of these words here, so definitely women, nuclear, Soviet, ballistic, billion, million. I mean, these are words that you are founding fathers, you probably would not have heard them say, right? So what are some of the words the founding fathers said? Or pre, they're not quite founding fathers, pre-World War II. Let's see, we've got shall, we've got territories, provisions, treasury, treaties, Indian, naval, island, vessel, Spain, Mexico, and etc. So remember, there was a lot of the U.S. during a long period of time with annexing a lot of land from Spain and Mexico. And these came up a lot in some of the early speeches. And so those words show up. And we can kind of finish the movie, you know what happens, eventually everybody merges to one big group here at the end. Okay, cool. So if we go back here, I want you to look very closely at the very bottom. I mentioned that Donald Trump was at the very bottom here. And this is an example of why doing joint clustering or bi-clustering of the rows and the columns together is so important. Because when we do this, notice the dendogram here. Donald Trump is the last president joined to the group of modern presidents, right? And what words were different that Donald Trump said? He said, get a lot, job. Okay, get job, yes. He said America a lot, yes, we know, make America great. And the other big word he said a lot was Mexico. And I think we all know the reasons why that was said a lot. So here's the outliers again. So again, now I'm going to play this movie again and perhaps just watch kind of those modern presidents and those words and when they start merging there. I think you're going to see some interesting kind of groups there. But of course, interestingly enough, Trump is still more similar to the modern presidents than he is to the founding fathers, right? We would expect this. We see them all grouped together there. So anyway, so this is just kind of a really fun example of text mining of what you can do with these interactive visualizations. So actually I don't do text mining and most of my day to day work. I tend to work on genetics and neuroscience. And we're going to be using examples like this to these interactive visualizations to work with scientists to help them understand how their genetics data is forming groups for precision medicine purposes. So this is actually how I'm going to be using this in science. So quick summary here. I've introduced convex clustering and biclustering. Hopefully I've demonstrated that these do have some significant advantages, mathematically and statistically. And we also developed a very fast algorithm to compute this clustering solution path. And this is really important because without that fast algorithm, there's no way we could have built those interactive visualizations in the shiny platform. And also developed this R plus shiny interactive visualizations. So please check my website for hopefully coming soon a Clust R Viz package that will have these tools available and will also of course be posted on CRAN. So some quick acknowledgments here. Again, most of, especially the programming and the shiny aspects of this was the work of my PhD student, John Nagorski, who really did an excellent job on all of this. And we have a couple of references for you as well. And thank you very much.