 So, I'm pleased to introduce next, Trang Le, a postdoc here at the University of Pennsylvania, and Trang will be speaking about a package to build interpretable decision tree visualizations. Thank you, Trang. Thank you, Stefan. Great. So, my name is Trang. I'm a postdoc with Jason Moore at Penn, and I'm excited to share with everyone the new package Tree Heater for visualizing decision trees, and this is my first package ever written for visualizations, so I'm really looking forward to a lot of feedback from everyone. And I just tweeted out a link to the slide deck, so if you want to follow along feel free to do it there. Okay, so the tagline for this package reads, your decision tree may be cool, but what if I tell you you can make it hot? And you do that by incorporating a heat map into your decision tree. So, really the idea comes from looking at heat maps, and I've been doing that a lot, and I thought instead of looking at the clustering of the samples and the features using hierarchical clustering, what if we do this grouping using decision trees? And so, really the idea comes from the heat map side, not the decision tree side. But here we are. And before we go any further, and most of you probably already know this about decision trees, and especially if you attended the tidy model workshop yesterday by Alson Hill, probably a quick one, then you probably know that decision trees are essentially just very simple tree-based machine learning model that gives you prediction of the labels based on the features. Okay, so before we go into what Tree Heater actually does, let me walk you through a few tools, like tool and tool sets that currently we have in R for drawing these decision trees. And the first is our part plot. You have probably seen this earlier in Daniella Witten's talk, where she showed like this decision tree where at the node you have the condition, and if your sample satisfies this condition, you go to the left, if not you go to the right, and then eventually you arrive at the Tomono node or a leaf node where there's a label for your sample and also kind of shows you, I think with the shading here, how confident the model is at predicting that label for the sample. The this network package is also really, really cool to drawing a lot of network structure, but also especially this tree. There are some decision trees in this case. On the left, you see here is actually regression tree, and let me hide that. So in this case, I think they're trying to predict the pedal length from the iris dataset. Plot party function from party kit is great because not only it shows you the feature name, it also shows you the corresponding key value, so that shows like how important that feature is in predicting the final outcome. And at the leaf node here, it's a bit hard to see, but it shows you the number of the samples in that leaf node and the error corresponding with it. You can even draw histograms at the leaf node. And gg party is another package that goes one step further allows you to do say in here histogram or density plot in the inner nodes and scatter plots on the leaf nodes. And I just want to give one more mention to a Python library called DTreeViz, which does really nice plots. And I really like the idea here where you have histograms, stack histograms on the inner nodes and pie charts at the leaf nodes. And what's important here is that with these pie charts, you can kind of see how many samples are there in each of these leaf nodes and how accurate they are. Okay, so what's different in TreeHeater? This is a decision tree model that is predicting the penguin species from using the everyone's most beloved pome of penguins data set. And at the end, at the leaf node here, we see that this is the heat map on the rows we have the features. And each of the very, very thin columns you see here is a sample or penguin. So I think that some of the most important information from this visualization really can be immediately seen. So for example, you can see immediately how big each of these leaf nodes are and where the misclassifications are. And some more specific things too like, you know, the gen two penguins look like they in general have higher or sorry, bigger flip length and bigger Coleman or Bill and they live in the Biscayie Island. So those are the things that you can draw from this type of visualization. And the core function of TreeHeater is called heat tree. The main argument can be an object of, it can be a party object or a party object. And so what that means is that you compute your tree using, you know, either the party kit package, or you can use our part to compete the tree and then convert it into the party object using as party. You can also use party to know to define your manual tree. So you literally draw that I'll show you an example in a little bit. Or you can just feed it a data frame. And in that case, you would need to supply a target label so that TreeHeater can automatically like computing the conditional tree for you predicting that particular target outcome. And because this is our medicine, I wanted to apply this TreeHeater to a clinical reset. In this case, this data set contains three 351 blood samples of patients who were admitted to the Tangier Hospital in Wuhan, China from January 10th to February 18th. And the task here is to predict whether they survived or not from COVID-19. And there were three features that were selected using their important score from an activity Bruce model, which are laxative dihydrogenase, which is or LDH, the lymphocyte levels, and also high sensitivity seroactive protein or CRP. And so what you want to do is really just put in the code of data frame, make sure it's all clean and in the tidy form format. But all you need to do is put that in and make you find that, okay, I want to predict the outcome. And this is your result. So once again, we see that these three variables are very important variables in predicting the trees itself. But looking at the heat map, you can kind of already see that, oh, wow, looks like the patients who did not survive with these seeds have very high level of LDH and CRP in their blood samples. So that's something that immediate Julie jumps out at you. Oh, and I forgot to mention that each of these leaf nodes here are labeled based on the majority votes and colored so as to correlate with the actual the true outcome. And within the heat map, the lighter the color is the higher value it is. And there's some scaling happening here. And I can get to that later in the question. And but you know, say you don't want to look at the heat map, you really just want a tree, then you can get that as well just defining beats equals NA. And there are many, many other options that you can do with this in the vignette. Lastly, I just want to show how this looks a little insane. But it's just a way to draw out your decision tree. And this is using party kits in tax. And so you can say, like, you put this node next to the other one, it's the kid of this one on the right, so on and so forth. And to just kind of make your own custom tree as well. And there's some statistics that you can bring with it. Okay, so I invite you to browse the vignette. There's a lot of things you can do. You can kind of adjust all these nodes if you want to add more meaningful statistics here, foods and key values, more labeling legends and whatnot, not showing any, you know, outcome at all and just really literally the tree, choose whatever features you want, how do you apply it to regression data, all of that in here. So that's all I have. I think my boss is in more for helping me throughout this and brainstorming every little piece. This is built on, you know, do you plot to party kit and do you party? And I learned a lot about the heat maps on the heat map package. So yeah, I'll stop here and take any questions. Thank you. Chang, this is super neat. And I think as Peter pointed out, very mixed these decision trees really interpretable. So I think this is a great visualization. Thank you. There's a few questions. So the first one was the one with the most upvotes is can you use this with the random forest? Okay, this is a great question. I thought about this before too. The thing with random forest is you have not just one tree but so many trees. So how, yeah, so there's, you know, if you can find a way to average it somehow and get together one single tree, then yes, after that, then you can just put in a tree heater and then it'll do it for you. But yeah, everyone wants random forest. Can you control colors of outcome labels or heat map palette? Yes. Yeah, if you go into going to the vignette, you see a lot of things that you can do with it. I kind of got over it a little bit. But yeah, like the second example here, I believe is, yeah, I changed the colors of the outcome a little bit here. I haven't done a lot of yeah, I think there's some changing in the heat map color as well. Yeah, so there you can do that. And if there's not a, you know, an option to do something just, you know, put an issue up and I'll be happy to take on the challenge. That's awesome. And then Isra is asking can we use this for omics data with large numbers of variables? Yeah, so it depends. I think that I have run this with, you know, about 5,000 features and it works like it runs within, you know, two minutes or less. But the thing is, you know, it's so much depends on how good your decisions are. So I have seen cases where the decision tree doesn't do very well, like, you know, it may classify this classification accuracy maybe like 70%. And it doesn't really, the patterns don't really jump out at you. And then in that case, I don't know how useful it is. So it depends on how good your decision trees are. And but, you know, with omics data, and talk about this with like any large scale data as well, there's many things that you can do to either class the features or select the features, whether randomly or non randomly. And you know, down sample the sample the number of observations or something that yeah so those things you can do to try to incorporate. Very cool, Trang. Thanks for an outstanding talk. Absolutely. Thanks so much, guys. Yep.