 Well, thanks, Irene. It's really great to be here. And so yeah, as Irene hinted at, I'm going to be talking about prediction today. And so the question I'll first ask myself rhetorically is, why am I talking about prediction? We're, of course, at a visualization conference. And if you had asked me a few years ago, would I be talking about prediction, I probably would have said no. I mean, when I got into visualization research 10 years ago, one of the things that excited me about it was, we had this ability to kind of empower humans to use our innate perceptual abilities to find patterns where machines just couldn't. We are really good at detecting patterns in data visualization, and that's why we're here. But at the same time, with big data like Danielle was talking about and other issues, it's really becoming the need to start leveraging some computational techniques to help us make sense out of bigger and larger amounts of data with many, many more dimensions than we can even visualize and explore. And so the other reason, really, is that each time, here's an example of a recent project I've done a few years ago, which is kind of popular in the sense that it basically, I work in this health care team with a lot of data. We partner with a bunch of different hospitals and health care institutions. And so we have a visualization here, kind of like the Napoleon chart, where we're seeing a bunch of patients and we're seeing down the different paths they went. And some of these patients become healthy. Some of these patients die. And so we're able to kind of see the green paths are good and the red paths are bad and so on. And so when we show this to data scientists, some of them really love it and are able to explore it and find insights. But on the other hand, a lot of them are like, OK, well, the next time I have a patient, which path is he going to go down? And so these types of questions have really kind of motivated kind of some of my research directions for the last year or so. But before I kind of talk more about medical data, where I spend most of my day, I thought I'd focus on an even more important problem we have at hand here is, right? And the organizers have done a fabulous job putting together a great talk schedule. But I thought I would kind of use that as an illustration of how we can maybe use prediction to kind of understand what's going on. So as we've seen today, we've had really seven great talks that have all been really interesting and very diverse. And I'm sure we're going to have a few more talks after me that are also going to be equally great. But of course, the organizers, when they were putting together the schedule, they had to kind of choose, OK, who are they going to put in the slot, right? For some reason, they chose me, which I'm delighted to be here. But they kind of probably made that a bit naive. They had to use some guessing. Will I do a good job? I'm not sure they really used a data-driven approach to do this, right? But of course, we have lots of data about all these speakers, including myself, right? We know how they spoke in at OpenViscon before. I think I found a bug in this data. Dominicus mentioned he was here a few years ago. So we should give him a yes, too. But some of these, so that kind of makes it more like real-world data, right? Lots of bugs in it. We have other data. Did they mention the word research in their bio or title? So some of these are researchers by that measure. Others are predictioners. Some of them chose to use black and white photographs. Others chose color photographs. Mostly colorful group this today. You know, their first initial of their first name. Some of them come after the second half of the alphabet. Some of them come in the first. Most people are early on this. And are they talking before lunch or after lunch? So we can also kind of compute things about the temporal sequence in which they've been scheduled, right? So all this type of data, we can start maybe to kind of use it as a prediction to see if I would actually be a good speaker. Am I going to ruin the day? Or am I going to make sure I don't screw up things too badly? So I'll use that as kind of an illustration of how these techniques work. And then I'll get back into some real examples with real data later. So I'm trying to simplify the spirit of modeling to kind of go over it in about five minutes or so. So I'm going to make some simplifications here. But you can really think of it as kind of a five-step process. The first part is what we sometimes call cohort construction. And cohort is simply a fancy word for kind of population or group of individuals. And so here we have a bunch of speakers. Some of them are good speakers. Some of them are bad speakers. And so just to make it easier, all the good speakers have these green circles around them. And all the bad speakers have these red circles around them. And so of course we need techniques to kind of easily separate these into two different groups, right? So whatever, we can write some fancy SQL or whatever, and we can kind of divide them into the cases, the people, the good speakers that we want to be able to predict for, and the controls, the bad speakers we want to avoid. And so I'll talk about later about some techniques we can actually kind of use visualization to help build these cohorts. Because it gets really tricky if you're really trying to do some more temporal-based querying. So I'll show that a little bit later. Okay, but after you have these cohorts, we now need to kind of construct the features from them. And so here I'm just going back to those features I've already extracted and talked about here. So you'd ideally have these features and maybe hundreds or thousands or even a million more features about these people, all the data we could grab. Not all of these features will be meaningful and the feature selection process later down in the pipeline will kind of deal with that. But nonetheless, we have to get our features into the system first. And so Danielle talked about some different ways you can aggregate data and compute features and stuff like that. But essentially we need to get the features constructed. So visualization can help there. I'm not going to talk too much about this today because it's not very specific to prediction but it is a very important challenge. The next step is cross-validation. Because we only have a finite set of speakers, we don't necessarily, we want to kind of compute on the speakers we already have. We want to be able to predict if they're going to be good or not. We don't want to just have a set of training data and then have no actual speakers we can actually use from. So we're going to use cross-validation to make sure our predictions actually work. So what we're going to do is we're going to divide these speakers into some training sets and some evaluation sets. And so here's one way. But of course if we want to make sure our prediction is works so well, we want to kind of do a bunch of different permutations and kind of group them together and kind of split up the training set and the evaluation set. And so when you kind of do this split and to kind of make sure there's well coverage in both the training set and the evaluation set, this is called cross-validation. And so here we have three cross-validation folds. Usually use a bigger number than that. And again I'll show some examples of this later on. Okay so now once we have all these different folds, what we can do is then we can start doing the interesting stuff, the feature selection part. So again here we have a bunch of features about each of these people and some of these are going to be informative for the prediction. Some of them are going to be non-informative features that have no, there's no difference between the people that are good and the people that are bad. And so we'll have some algorithms and we'll talk a little bit about that later about getting rid of the non-informative features. And then finally once we have this reduced set of features that actually are indicators of prediction, we can then go back to our evaluation set and make a prediction using that model. So we can for each of these features predict if they're going to do bad or good. And then of course since we're using data we already know about, we can also compare it to the ground truth and then we can kind of see how well we did. And so this predictive prediction model, it did okay, right, it missed up one. But this is kind of how the process works. So I just want to say before going on that there's a lot of open source implementations that are a lot of these algorithms that I'm talking about today. I just put a handful of different toolkits here that I typically use, but there's tons more. I'd be happy to talk more about that later as well. Okay, but right now I've just been talking about prediction but what about visualization? So really what I'm trying to do and kind of one of my goals is to really kind of build a ecosystem of tools that kind of touch upon all the different parts of the predictive modeling pipeline and the output of that predictive modeling pipeline. But due to time limits and just what I've been able to do so far, I'm just gonna be talking about two different tools that kind of tackle some different areas. And I'll show you how they kind of tie together but you'll notice there are some gaps. The first project is in collaboration with Yoshi Krause at NYU. The second one is also with Yoshi and then also Enrico Nertini at NYU. And without further ado, I'm gonna be talking about the system called Coquito which is an acronym for basically cohort queries. And what we have here is this system is, you can kind of think of it as a visual query builder. So it's not really new in that sense because there's been tons of visual query systems that have been around for time to time. But what I thought really was necessary in order to rebuild and kind of create another one was really when you're usually dealing with prediction, you're usually dealing with temporal data where you have a bunch of time stamped events because if you have an event yesterday, an event today and you wanna predict tomorrow, well, there's inherently temporal data in there. And writing temporal data or writing temporal queries is usually very challenging and very slow. So we tried to build a system that would allow us to do this both interactively and fast and visually so we can kind of explore what's going on. So I'm gonna walk you through a simple scenario of how this tool works. And this is gonna again return to the medical data I was alluding to earlier. So the way the system works is essentially we have this, here's an empty query essentially. We have a couple hundred thousand patients and right now all of the patients are in the result. So basically we haven't added any constraints to these patients yet. So every patient we have ends up in the results and the goal is for us to keep adding new constraints until we can find the population that we want. And so the goal here is for us to define that case cohort, the patients that we wanna kind of be able to predict for. And we also have three maps to show all the different types of features that are organized hierarchically, a lot more features that are off screen. Then we also have some demographic information about who these patients are, their gender breakdown, their age and so on. Let's go ahead and start adding things to the query. So I'm gonna do a search for diabetes so we're gonna be looking at diabetic patients today and really the goal of this we'll be able to predict if a patient's gonna have diabetes or not. Predictive modeling is really kind of popular in the healthcare domain, there's all this data and really the goal is to kind of figure out is there anything we can do with this data to kind of help improve patient outcomes and so on. So I did search for patients with diabetes and a few seconds later I'm gonna get back to the patients that actually meet that condition. So here you can see we went from 200,000 patients to about 26,000 patients. So we've already kind of reduced the factor by 10. This is data from a hospital that we collaborate with with a small data dump of it. So usually the data is much bigger but nonetheless we have the ability to kind of instantly query those patients that meet the conditions we want. And then we can click on this node and start to see more information about the patients that did go down this path. So all the tree maps update to showing the events that typically occur after this event in those patients. So it kind of gives us hints what we might wanna add to the query later on and of course the gender and other demographics update as well. So you can kind of zoom into all these different tree maps and select different types of events that you might wanna do and so on. But what I'm actually gonna focus on today is to search for a common side effect of diabetes, protein area, I'm not gonna get into the medical implications of what that is but it happens in a lot of diabetic patients and it also causes a lot of the risk for kidney failure is very high once you get that. So we're gonna search for patients that have diabetes and then protein area. And then what we're gonna do is we're gonna add another constraint after that. Again, those patients that ended up actually getting renal failure after that. So again, you can kind of do this interactively and search for what you want and instantly kind of generate these queries. And so I'm showing you relatively simple queries here and when we actually do these things with kind of real situations you usually have 30 or 40 constraints like this. But anyways, I'm gonna click this renal failure, the kidney failure node and we're gonna see all the information update again. And so now we're down to only 136 patients. So okay, we're shrinking and shrinking and shrinking the data set but we're finding the patients that actually kind of fit the profile we want. And again, it's keeping the sequence of events into mind. But not only do you want patients to have these problems we wanna also see if they were treated. So I'm gonna search for a few different treatments and just add that to the query too just to show you how branching works. So you're not only constrained to a single sequence that's linear you can start having branches to go on. So I'm searching for patients with hemo dialysis now I'm searching for patients with a period to nail dialysis and I'm gonna add that to the query too. And so again, you start getting you can make or statements like that as well. So again, nothing too new visually here but we spent a lot of time kind of optimizing the backend queries to make this you can actually do it in real time. It's kind of challenging and also just giving the kind of experts the ability to query things. Okay, so we did it so we ended up with 26 patients 25 went down this path two went on this path and so one guy must have went down both paths had both types of treatments and then again we can click and they start examining who those patients are. Okay, so that's half the problem, right? We defined our cases but now we also have to kind of define our controls the patients that will have different features than these guys. So we're gonna add another query here and we're gonna link it to our previous query so we can make sure there's no overlap between the two groups. I'm just creating a new query called controls and once it clicks okay we again start with that same empty data set of all 200,000 patients and so what I'm gonna do is I'm gonna drag those results that I've had before drag it to the beginning of this query so there's kind of patient those two queries are linked and so now I've shrunk it down to only 26 patients, right? So that's good except that's not the opposite of what we want we don't wanna just use those patients we wanna use everyone but those patients so we can drag and not know to there and again inverse that patient set so it's no longer the 26 but everyone else but those 26 and again what we'll do is we'll start kind of adding those same features we want we don't want these, we want these patients not to have diabetes but we want them to be similar so we need to have them the same conditions, the same type of treatments and so I'm just gonna quickly add all those same nodes to here you could copy and paste but just to make it more dramatic I've read them all on here and so anyways once we get these queries built we'll be able to tell how many patients we have in the one group versus the other group and we're gonna have only 17 in this group now typically when you're doing predictive modeling it's good to have balanced groups so what we can actually do is we can match the smaller group to the bigger group so we'll drag that control to the cases to kind of match the numbers and it has some interesting algorithms in the back end to actually do the matching so it kind of keeps into consideration the underlying data so you're not just taking random patients but trying to keep them similar as possible and now we have those two patient sets and once we're ready then we can kind of deliver this to our predictive modeling pipeline that we might have already built in the back end and then we send this there and then on the right hand side of the screen we get some accuracy score and I'll talk about that accurate stuff later but clicking this would actually take us to the next tool I'm gonna show so that's kind of the part of the process the cohort stuff which again is kind of more generic than predictive modeling but I think it's a really important thing to do because the worst the cohorts are the worst the predictive model is gonna be so it's really important you get that right and so being able to kind of explore that visually is really important the next tool I'm gonna show you right now is how can we do interactive feature selection and the reason why this is important is because this again is the part of the pipeline that's gonna start removing non-informative features and only trying to keep the features that really relate to prediction so if you do the step wrong it's gonna be very bad for the predictions so when I talk to data scientists or medical scientists that I collaborate with and ask them okay so which feature selection algorithms you use they start naming all these different types of features each one has kind of their own favorite and so there's lots of interesting research that have happened in the data mining community to produce these but if I start asking them why do they use the one they want they don't generally have a good answer for that question it's kind of they're familiar with it or it's worked well in the past so they just keep using it and so really the goal of this project was kind of to help them understand what are the strengths and weaknesses of that so again to ultimately improve the predictions and the same can be said about what classification algorithm do you use again usually they use what they're used to or what the here is called but they don't actually do a rigorous testing of all of them and that's what this tool will do it's gonna let us see a lot of things at once so here we have another predictive modeling pipeline again dealing with patients with diabetes we're gonna have 10 cross-validation folds because that's a popular number in literature again we could have a visualization tool to help us choose the optimal number but let's choose 10 for now and we're actually gonna try out a bunch of different feature selection and classification algorithms so we're gonna try four here four popular ones of feature section four other popular ones of classification just to see what happens and see if there's much agreement or disagreement between all these approaches because you would think no matter which one you use you're gonna get similar results so again if we're dividing the population into 10 different folds and we're gonna try out four different feature selection algorithms that means for each feature in our dataset which can have thousands of features we're gonna get 40 different feature rankings right so how can we show that in a kind of an easily to digest way so we're gonna start with our favorite bar chart so the higher the black bar the higher important this feature is but of course we don't just have one ranking we have 10 different rankings for each of the folds and then again we have instead of just one fold one feature selection algorithm we're trying four different ones so we're gonna have 40 again and so again just to remind you higher relevant features are more black lower relevant features have an absence of black and we did some experiments trying to show lots of these at once and it actually was in a very effective technique for kind of compressing this data into a small screen again we usually have hundreds of features that come out of our feature selection algorithm so we need to be able to show a lot so we just did a simple trick where we just basically transformed this into a circular glyph so we kind of transformed it into a circle and then we have basically the bars coming inwards so the longer the bar coming inwards the more important it is and this kind of worked well for the number four because you kind of get it into four you know you can divide easily even in your head into four different not quadrants but quarters of the circle and so again each quarter represents a different feature selection algorithm and then within that each slice essentially represents a different cross validation fold so again you would expect it to be consistent across folds if you're doing that correctly and if all the algorithms were thinking that a feature would be important you get close to a black circle but here for whatever reason the bottom two algorithms like this feature a lot better than the top two algorithms so when you show a bunch of features at once you get something like this and even at the zoomed out stage you can start to notice some interesting patterns so you'll notice you'll see a lot of these kind of half moon shapes where you see some features that like these features and then different algorithms like this feature and so you see this half moon kind of competition for who's ranking better and when we showed this to kind of the data scientists they were very surprised that different algorithms will basically give you totally different types of features and this really helped us kind of not only debug the predictive models but to essentially improve them as well and so again each feature kind of gives you a different shape and you can start learning a little bit about them and again this interactive environment lets you do lots of things like transforming them to scatter plots so you can see ones that are picked off in over all of the algorithms or ones that have average and gray and again you can sometimes kind of dig into more interesting features but that's only half the story right we talked about four feature selection algorithms but then we also wanna try different classification algorithms this is what the classification is actually gonna be give us the score of how accurate those predictions are but we don't know which algorithm to use for that so again we're gonna try out four popular ones and we're just gonna use it again simple bar charts here to present all that information so each row is the feature selection algorithm so basically the top features that came out of that algorithm each gets a row and then each column is a different classification algorithm so it's kind of a compact view of a bunch of different models if you will so we're basically showing 160 different data points here and each bar's height shows the more accurate that model is the higher the bar is so again we get one for each fold and then the overall score here so these models aren't doing very well but nonetheless we have that information there and again this is fully interactive so you can select any part of it and it'll light up the features so actually so yeah so the other feature that kind of motivated us from the beginning was is well okay we're relying on all these algorithms to kind of choose the optimal set of features but you know we have some domain experts that might actually know a little bit more about this data than just these algorithms so could we kind of inject the human into the experience as well so yes we can kind of review the progress that these algorithms are doing but nonetheless could we help the domain expert also input their expertise into it as well so we have started some playing around with some new features here that led us to do that so again if we are particularly interested in one particular model that was computed you can select it it highlights all the features that you've selected and again we can start this is a simple example here but if I don't like a certain few features I can whatever unclick them and take them out of the model I can also you know in the visualization start selecting more features that I care about and once I have a set of features that I like of course you would do much more in depth than us in that you click this evaluate model button then you can see it actually add this whole new blue row here and this blue row is the new model we constructed and we can kind of compare to do better or worse than the model that kind of was done automatically by these algorithms so that really brings us to the end of what I wanted to talk about I just kind of wanted to bring to this community that a lot of data scientists really love to visually explore their data but I've been noticing more and more a lot of them want also to do something with prediction of their data but of course then they also don't know how to correctly interpret these predictive models but really visualization can help so I think it's a really interesting active research area and I hope to kind of see this visualization community take it on even more but with that I'll take any questions thank you.