 So over the past several days and yesterday as well, you've already seen a lot of data visualizations. So the idea of visually representing your data is quite important. So what we are going to learn in this module is to give you some deeper understanding as to how data visualizations are constructed so that you can better create them as well as appraise them. So we are going to be talking about the process of encoding and decoding information into data visualizations and what that takes. We are going to also discuss how we can think more systematically about data visualizations so they're not just graphic design problems. In fact I come from a statistical background so I think of them much more as visual models because you're applying a lot of that analytic thinking to creating data visualizations. And then the last thing we're going to talk about is the concept of a visualization design space and how we can reason about different kinds of data visualizations that exist within that design space. And the last part as well is going into my own research as well. So this is sort of an organic area and it is influenced by how I'm thinking about this problem and how I think we should tackle it but it will change over time and I hope more and more people start to think about it in the space beyond me. So let's start with why we should even bother to visualize data in the first place. So in science the way that we usually visualize data is as a component of communication. So usually we'll have some funky research problem or we'll have one in the future. We're going to do some really great analysis and you've learned all of these different analysis tools that you can use over the course of the week and then finally once we've got something we want to tell people about it and that's the point when you're like I need to visualize my data because I need a figure for my paper or I need to make a poster or talk to someone in the general public about what I've done. So at that point people can get pretty creative and they're like well let's visualize it maybe you want a funky infographic. You might go back and forth a couple times and eventually you may ask the question did anybody understand this visualization and what I did. And often the answer is no probably not right so maybe you will then try again maybe you will think of something different and you'll go at it again and you'll either get stuck in this cycle or eventually you're like that's it I give up or you will just declare victory and move on right. So there are two issues with this paradigm that I want to bring to light. The first issue is that you can actually use data exploration quite a bit in the scientific process and in the exploration process so that you actually understand your data and what your models are doing right and that's important for checking the validity of your work because your data can surprise you. So this happens before you even get to a communication stage it's in fact a very different process to use visualization for exploration than to use it for communication. So here's a nice example you can see all of these numbers changing I know you've got the slides in your notebooks there so just looking at these numbers can any of you guess what this actually represents? No. Right so this is all that data. So this is a really really fun paper that came out from Autodesk it was called Datasaurus and what they're showing you here is all of these different representations of the data or all this underlying data actually has incredibly different structure to the point of even representing a Tyrannosaurus rex but the numbers as you're seeing are some of the statistical descriptive methods that they're calculating along the way so all of these incredibly different underlying data forms have exactly the same statistics which you would miss if you only relied on statistical values of the data. So I would say visualize your data in exploration because you never know when there's a dinosaur in your data. I had a conversation with one person once who was like is this like a frequentist versus Bayes thing and the answer is no no one expects a Tyrannosaurus distribution okay the other challenge with this paradigm and you kind of get it when you flop around and try to find the right data visualization is that actually trying to pick the right data visualization for what you are doing is really really hard and this is true whether you are trying to visualize data for communication or for exploration. So we're going to talk a lot more about picking the right data visualizations today. There's no 100% solid answer it's like picking the right analysis or statistical analysis to apply to your data. There are lots of different ways to do it and what's important is to know the kinds of questions that you can ask along the way and the kinds of assumptions that the different models are making so that you can arrive at something that is a good solution for the kind of data you have and the context it's being presented in. So I also want to talk about what data visualization actually is because it's important to sort of set this framing because it really influences how I talk about data visualization here and how I think about it in my own research. So as much as I love Bob Ross data visualization is not just a graphic design or an art project so sometimes people are like I have a data vis can you make it like prettier you're a designer right it's like no I'm actually like a statistician and a bioinformatician and I come from a very quantitative background and actually design is not my forte. Fact design is sometimes the last thing you do once you've got a functional data visualization. A lot of people also think that data visualization is a programming language right so data visualization if you're a web developer gone to New York times a lot of people are thinking ah JavaScript you have to do JavaScript for data visualization and we saw a lot of different tools that produce data visualizations for us yesterday as well like Genghis and microreact so often people assign data visualization to to a particular language or a particular tool but of course there is like I said the aspect of choosing the right data visualization as well and both go hand in hand once you've got the right data visualization you need the right tool to actually implement what you want to do and that can be a tool like microreact if you determine that that's appropriate or it can be something more customized that you make an R or that you make in JavaScript so in the tutorial we're going to talk about how we make a data visualization but in this lecture we're going to talk about how we think about the right one okay and actually choosing the right data visualization is more complex than just than just thinking about pictures in fact choosing the right data visualization often requires thinking about the limits of human perception and cognition it's thinking about the limits of the computing hardware and what you can actually display on a screen and how fast and how easily there's a finite amount of pixels it also involves a lot of data analysis because we're not always just showing the data that we've got so all those phylogenetic trees are derived from genomic data we're applying some calculation we're not just staring at A, C, T's and G's and defining outbreak dynamics so algorithmic and statistical processes are also an underlying component as well and then in addition to all of that is the visual design as well be the aesthetics and some of the things that people usually think about as graphic design and there is a really nice way of summarizing this by Robert Casara that I really like which is that you should be aware that data visualization consists of two components encoding information into some sort of a graphic and then when the person sees it actually decoding that information to try and get a sense of what is going on so we saw yesterday for example when we had our phylogenetic tree and we would color the nodes or change the color of the line right so that is a process of taking some metadata and encoding it and representing it visually but when we're reading the paper we are decoding that information we're looking at the colors we're looking at the text to try to figure out what is going on both of those processes are in play and both are worth considering so I want to give a small digression because I think people have a sense of encoding information since you do it that often but this human perception and cognition this idea of decoding information is something that's like whoa like really different you don't need to know it all in detail there's like so there's like cognitive psychologists that worry about that but you need to know that it's something that is going on and that people are actively researching so we're just going to small digression to talk just a little bit more about the process of perception and cognition and how that comes into play and data visualizations so here are two examples the first one is a heat map right these are really really common in bioinformatics and they are also really really hard to interpret by somebody who is colorblind right science happens to have a lot of men something that will hopefully change over time but a lot of men are red green colorblind and so there are actual simulators out there that will show you how a heat map looks like to an individual that is not red green colorblind compared to somebody that is right and what you can see is the nice differences that we can detect when we're not red green colorblind are totally absent to a colorblind individual that doesn't mean that you can't use red and green there we use them because they're very they're very salient colors but in a heat map you might not be able to you can do it if you encode the information another way so let's say you had a scatter plot and your categories are red and green you might also use shape to redundantly indicate that these things are different right and it would allow you to still use those colors which are very salient and your brain picks up on them but still be interpretable by somebody who is red green colorblind so that's this idea of encoding and decoding working together there's also a lot of well-known tricks on the human perceptual system so this is the dress it was like a really big deal like maybe two years ago so I'm not sure how many of you saw it before but just as a quick show of hands how many of you see it as black and blue interesting and how many of you see it as yellow and gold smaller white and gold sorry yeah yeah so this is also interesting as well like different lighting conditions and different displays will actually change the way that you can see the information for the life of me I could not see this as a white and gold dress I've only ever seen it as black and blue right there's a lot of known like there's a lot of literature out there that takes a look at some of these perceptual effects and they might not really come into play if you're doing something as simple as a scatter plot but as we get more complex data and visualizations become more complex as well some of these things might start to affect our ability to correctly decode the information one way a very practical way that you should be aware of that this comes into play as well is the way that you choose the colors for your data visualizations as well so this is a perceptual study that was done by actually the one of the authors who is also responsible for D3 which is this JavaScript library that everybody uses and what they were doing here is they showed information same information using different kinds of color scales and they used a mechanical Turk which is a service available through Amazon to try to test out the error rates with different kinds of color choices so this jet or this rainbow color map is it there we go so this rainbow color map is the default for a lot of packages and it actually is is hard for people to interpret and has really high error rates see if they've got it's here this is jet and the error rates are quite high right whereas different color maps are easier for people to see and illicit lower error rates so GG plot just changes default to Veritas as an example because people make fewer mistakes when they look at Veritas information and this is actually because humans have actual limits as to how they can interpret color in fact you can often not differentiate between more than 12 different colors so if you've got like 30 things in your data set and you code them all with a different color people have a hard time telling the difference between some of the groups and make mistakes so often if you want to have so often what I do is I try to summarize the data or like I limit as we'll do in the exercise today I'll limit the use of color when there are more than 12 categories right so that's perception and cognition so now let's take a look at this as well in the context of something that is more relevant to you which is the genetic which is a genetic epidemiology so here I've got a tree from a paper we're going to slowly walk through each of the the different components of this tree so one thing that I also want to show you is that there's a lot of information in this tree that is also represented as text it's hard to see on the full thing so I've blown it up but they actually have a city as well as some information about the time of the infection in the in the tree this is the SARS no this is MERS that was in the Middle East little a while ago okay so they've got data on individual cases and genomic sequences so they've got this location information they've got dates they've derived some virus clay information and of course they've got the sequencing data so what they've done with this data is they have actually derived some new information you don't just look at a CTGs again you actually calculate a phylogeny and from the phylogeny they have derived these clades right so they've done some transformations to their original data some analysis okay so now with their visual mapping well they've chosen to represent this data as a phylogenetic tree each case is an is a leaf node and some new data for this outbreak is shown in the color red so you can see very easily what's in this outbreak versus what was all data the timing of the cases in text the city of the cases in text but the clades are delineated here on the side with these sorts of colored lines okay so this is the way we've chosen to visually represent and encode different kinds of information from what was a spreadsheet as well as a CTGs okay so it's really easy to see the colors and to see where the different clades are I think we can agree on that because you can see the clades are here um you can through the phylogenetic tree see the relatedness of the individual isolates and we talked about that yesterday as well as how that can be important for interpreting an outbreak but it's really hard to actually understand location and time you have to read everything and think about it and form a mental map so this is a cognitively intensive process that you are doing when you're analyzing this figure so can we actually do better can we redesign this in order to get some of that information that's in text and maybe help us understand what's going on a little bit more and in fact the authors did this in their own paper which was very nice which is what one of the reasons I like this paper as an example so they had that tree and then what they did in another figure was actually show a timeline as well as a geographic map so now all that information is taken out of text and shown in these two different chart types okay so you've got the same data you've changed a little bit your geographic data because instead of just having cities you've got latitudes and longitudes that you're bringing in there those are the GPS coordinates and you've got very concrete geographic boundaries because you're using a map right as opposed to text the data mapping is the same you're taking this genomic data and you're converting it to a phylogeny and from the phylogeny you're deriving clades so this is where they've made really nice and intelligent choices about some of the visual mapping that makes it easier for you to understand what's going on right so each of these different points are one of the different isolates in their tree right and they've actually colored each of these points the same colors as the clades so if you wanted to go back to the tree and see that genetic relatedness you actually can because they use the same colors so that's a nice way of linking information that makes it easier to kind of understand what things are going what's going on across these different chart types okay they've also got the timeline now so they're not just relying on the phylogenetic tree as a visual they've chosen a visual of a timeline they've also chosen the visual of a geographic map remember this information was formerly encoded as text right and now we've we brought it out they've also got cities here this is potentially confusing because each city is is a little point so it could be confusing between this point and that point but they've changed them to different sizes right so compared to just the phylogenetic tree that did have a lot of this information you can now actually see things over time you can now see the geography and you can see the location as well as the location information you can't see the exact genetic relatedness but they've been clever here by using the same color scheme that they used in the phylogenetic tree so you can see those clades and you can actually go back to that and get the relatedness information if you wanted to and it's a lower cognitive effort because you're not reading as much you're not reading text and making this mental map it is literally shown to you you're still thinking and reasoning about it but one part of that cognitive processing has been offloaded because you can see it it's being done by your perceptual system okay so that's great we were able to kind of do better and take a lot of that information out of the tree but this is just one example so how can we do better more consistently and so this is a part where I'd like to talk about how we can think more systematically about data visualizations to sort of help you come up with some of those better designs and those redesigns as we saw from that tree to that timeline and geography okay so I'm a small sort of disclaimer is that the stuff in this section is to help you make and appraise better data visualizations since it's an ongoing area of work so I'd like to have more concrete answers like always do this and always do that but I don't just yet so I'm going to just tell you how to think and reason about them a little bit more I hope that you can use some of the stuff here to think through your own data visualizations and talk to your friends about it and I'm also actually transplanting content and research that occurs in this field of computer science called information visualization which is where one of my advisors sits in it's a young field of study and their research is evolving as well but the stuff I'm telling you about is grounded in in the stuff that they have done so the color study that I showed you earlier comes from that area as well okay so the first thing to realize or to think about in data visualizations is something called a design space probably I've never heard about a design space before but it's actually a concept that comes from architecture so when you're designing a building or even if you're designing a city you're not reinventing what you have to use every single time right they've often catalog different things that work and over time they might say okay so we design things that way over time we got enough information this works well this doesn't work well and they actually have a bunch of variations for some context that they can consider or use design space has also come up a lot in computer science so there are patterns to the way that you write code and some of those patterns are more effective than others and we teach them to software developers it's also true in circuit design so there are ways that you can reasonably design a circuit for something in your computer and somebody who's like really great at doing that has a mental model of this design space and knows what works and doesn't work so I want to also talk about visualizations as a design space so the idea of a design space is for any given problem there are actually a lot of different ways that you can visualize the data and we did see that as well there were different design decisions that went into Genghis then that went into microreact and different tools show different parts of the data even if you're maybe still trying to investigate an outbreak so within this design space there are things that are good solutions and that's like a little plus sign things that are okay and things that are not so great and in a perfect world what you would like to do is actually consider a couple of different things try them out and then narrow in on the one that you like and sometimes that's as simple as trying a couple of things and talking to somebody in your lab about it so even though a design space is kind of an abstract concept we have a notion for it already so for example all of these chairs are all technically things that you can sit in right there are all variations on a theme but of course if you showed up to the office and you have to sit in the electric chair or the baby chair you would be very unhappy whereas if you got the very nice office chair you'd be happy great but if you got the other chair well you could sit in it but maybe you're not too happy about it and so visualizations are the same some are absolutely inappropriate for what you're trying to do and don't tell you anything others are exactly right and some some are okay but not perfect right another way to think about this is like you know visualizations are more abstract so if you can imagine an alien coming to earth and not knowing about our chairs that alien might actually show up and willingly sit in a baby high chair and just be like wow humans sit on comfortably all the time right so having an awareness that there are different options and that they have different qualities is something that's really really useful when thinking about visualization design okay but of course chairs are not pictures so visualizations are so how can we apply some of this logic and reasoning and get a good sense for how we make good data visualizations so one thing that we can do is just give up because there are no there's no hope and maybe we're not interested right so you're just going to produce whatever you want to produce and some people take that approach a common answer these days is also just to wait until AI solves the problem and tells you which visualizations he is someone who's actively thinking about that I can say that that's hard another solution is just to try something and get some feedback from people that are around you which is actually a great first step to seeing if what you did make sense that means of how you encoded the data can be consistently decoded by somebody else and then the other thing you could do is try to think a little bit more systematically about it so the first thing to do is actually break down a data visualization from a picture into something that has a few more layers so you often create data visualizations for specific contexts and the things that are needed in one context may not be appropriate for another context so you actually need to think about that as a starting point and different data visualizations have different data again that depends on the context so you want to think about what data you're using as well as how that visualization is used and how that data is used then once you know why you're making something and what data you're using and how that data is used you can actually start to think about different kinds of visualizations that might suit that problem so again sometimes that's you're going to make your own other times as you make a choice about what software you use because maybe the software allows you to do that analysis then it also allows you to get the visual that you want at the heart of it is also algorithms we're not really going to talk about that today because it's not a computer science course but there is also ways that you can analyze the data summarize the data so that you get more effective visuals so like I said computers have a finite amount of pixels in screen space and if you want to talk honestly about big data in public health and medicine you have to realize that eventually you're going to run out of pixels so you can't show everybody in the tree at some point you need to simplify or you're going to start to lose information because you got a scroll and do all these weird interactions which are ineffective because they are cognitively expensive since you got to remember the shape of the tree as you move around it and you can ask these questions about you know why you're designing something what you're using and how you're doing it it throughout the design process and then you can go back and evaluate them afterwards either informally or formally to figure out whether what you've got is effective so just to surprise what I said the why is do you even need to visualize data and how will your others use it the what is what kind of data is being visualized and what tasks are being performed and the how is how do you make the data visualization and asking if it's the right one and people tend to jump to the how and ignore the why and the what so we'll jump into these a little bit more deeply this idea of breaking down a visualization this way comes from one of my advisors tomorrow Monsner and she's got this nested model which I really like because it helps me think through data visualizations as well so the idea from the nested model is from the outside you are actually going to to work your way in so you start with your domain problem and you work your way down into the kind of appropriate data visualization and your evaluation is going from the inside and all the way back out it can be an iterative process depending on how intensely you want to do it and the way to think about it that's useful is is actually from a paradigm in software development called agile development so if you don't really know what you want sometimes it's not a good idea to spend like a week or a month just coming up with one thing and then finally showing it to your colleagues because they might find it confusing or they might they might take for granted some things that you haven't put in there or thought about loud a better approach is actually to make little incremental changes start with something right and then try to get some feedback from people that are close to you see what they find confusing or not whether it's obvious or not and from that kind of work your way up to adding more bits of information and slowly changing your visual design so that you've got feedback along the way so by the time you put it out to the general public or actually put it in in your paper you've had some other people take a look at it and tell you whether it's easy to understand or not so again the domain problem is why data is visualized what is the problem you're trying to tackle and for whom and in public health the for whom part is complex because they're actually multidisciplinary decision-making teams and again in this big data world we're like great more data more people we're all gonna come together we're gonna have a great time and come up with all these insights but everybody's used to different things everybody has different needs everybody understands the data differently and what you're gonna put out for a nurse or a doctor who has access to lower levels of data is different than what you might need for a politician who might enact a policy or a patient who has to make some kind of a decision so actually really figuring out like who am I doing this for fellow researchers or somebody else is a really great first step I often find researchers just taking their research papers and trying to talk about it more slowly to patients and politicians who who have a very very very different needs so you actually a visualization that works great in a research context does not work great in other contexts being aware of that and talking to those people during your process will help you improve on the visual design okay so data and tasks what data should be visualized is it actually available we already talked about that sometimes it's not and then what is the data used for these are tasks right the concept of task-based design is I'm not sure if any of you have ever heard of it it kind of comes a bit more from engineering it's the idea of designing around what people do a great quote from this that sort of emphasizes that is you know Henry Ford the inventor of the Ford automobile he said that if he asked people what they wanted they would have said a faster horse right so there's no point in designing a faster horse what people actually did or what people wanted to do was go from point A to point B more quickly and this could be done with a genetically engineered horse or a horse on some form of steroids it could also be done with a car right so people don't often know what they want you shouldn't take your expertise out of the equation but asking them what they want to do is the more relevant question and then you can be like oh you want to do that I have a better solution for how you should do that okay and we did a study a little while ago that was taking a look at the kinds of data people used for different tasks in tuberculosis management and care and we were trying to see when they use different kinds of patient data or different kinds of contextual information as well as when they use different kinds of whole genome sequencing data and we broke down their tasks into other they were not diagnostic tasks treatment tasks or surveillance tasks and we did a survey with it with a group of a bunch of people to figure out what information they used for different tasks what we found is that actually patient data and prior patient records were really really important for a lot of clinicians and nurses and lab folks the whole genome sequencing data was less important and the other thing that we found that was really surprising was that there was a really good idea of the kinds of data you should use for diagnosis and treatment tasks but there wasn't a lot of good knowledge about what kind of data you use for surveillance tasks at least it was very very inconsistent so this also affects the way that you would design something you need a lot more information about those surveillance problems where maybe there's a more concrete idea about what they want to see for diagnosis and treatment and if you want to introduce something new you really have to educate them and sort of break into that so this is this is data and task collection that is also an important part of a visualization design we use this information to design a clinical report wasn't visual report but it was of a clinical report one other thing that you should be aware of with data specifically is that you don't always have to visualize the data that you are given so I call the data that you get raw data that's stuff that comes off the sequencer that's whatever metadata you get from the epi's often you'll have to derive new data so if you've got case counts maybe you want rates right and again we do this all the time where we have our nucleotide data and we're converting that into a phylogenetic tree and in this very simple example you've got like two categories of continuous variables imports and exports you could just show that or you could do the calculation and show the net trade balance because the thing is when somebody is looking at this graphic again they're going through the cognitive process of doing that subtraction which is hard and if your point is to show something like a trade-in balance just show that derive the new data and don't just show exactly what you got right there's also ways to just come run into trouble with this again xkcd is one of my favorite comics so you also want to be careful when you're just overlaying data visually everybody's always saying let the data drive in a big data era but I always say that you should be careful because you never actually know if the data is really drunk and so you want to make sure it's doing the right thing in this example what the presenter the stick figure man is showing is effectively population maps right which is our site's users subscribers to Martha Stewart living and consumers of well furry pornography which is weird and that's the point they're trying to make and what they're saying is if you just show rock counts what you're effectively getting is population maps so you're inferring just by visually looking at this you're inferring a correlation that is in fact irrelevant whereas if you were to show rates the information would look visually different right so now we're actually going to talk also about the different kinds of visualizations that you could make so you you should I would argue explore whether other people have tried to tackle this problem with the tasks and the data that you have at hand sometimes I hear people telling me that they do this anyway where if you're trying to publish a paper in a certain area you try to think about a figure you'll actually look at the different figures that people make in these different papers and I'll show you at the end like in my research you took that into consideration we actually made a tool that takes a bunch of figures from different papers and lets you do this mapping more directly but it's good that you think about what other people have done and not just sort of start from from scratch and you can also not like what other people have done like I'm not a huge fan of Kymes 3d sphere scatter plots but they're there and if you don't find a good solution you could implement your own so we're gonna do a small digression here to talk a little bit more about some of the ways that you could think about visualization construction that again is beyond just pretty pictures so I want to introduce to you this concepts of marks and channels so a mark is essentially the basic element of a building block of a graphical building block and they are usually points lines or areas and you've already seen that for example on a phylogenetic tree we are changing the color of a line right or we are adding points right or if we're adding some different and weird shape we are changing the properties of an area right and some of this has actually gone into the programming libraries and the answer the graphic paradigms very basic graphic paradigms of how the computer will even display the information for you and there's you know interesting programming languages like D3 as well that are that are also built upon the idea of these basic marks as units that you could represent data with now these marks also have channels which control the appearance of the marks these are things that you visually see and start to interpret with your perceptual system so it's where things are oriented in space what color they are what shape they are how they're tilted how long the lines are as well as the size of the area and whether they've got 3d properties so a phylogenetic tree is effectively a bunch of lines of various length laid out in space right so you could manipulate any of those to show your data differently it's not always easy to do that now the channels in particular have varying effectiveness as well so there was also research done on this once again by Jeff Hare who again was responsible for D3 Jeff is a very prolific and wide-ranging researcher so there was this old psych psychology study she's always called Cleveland and McGill where they actually ask people to interpret information the same information from different ways of representing charts and so here what they're doing is just changing the positions whether they're beside each other on top of each other like how far apart they are etc. Jeff Hare's group added a couple more things for example the angle like in pike charts area so circular area as well as rectangular area and they redid the study again with mechanical Turk and they wanted to see the error rates that people got and it turns out that people are really bad at judging areas so for example if you've got like a really big circle in a very small circle yes those are different and you can roughly guess that but if they're kind of the same size it's really hard to tell the difference between them the same is true for for angles as well so you may have heard people saying like don't put your data as a pie chart and you wonder why so here we've got a pie chart and a bar chart right and a bar chart is making use of positional information at a common scale so everything is aligned to the same zero position whereas a pie chart is making use more of angle and area like the size of the triangle and some people even argue just the length of the curved line if you take a look at the two of them it's really easy in the bar chart to see the different heights it's actually quite a bit harder to see that difference in a pie chart okay so when it matters is kind of like do you want somebody to accurately and precisely understand this information and this difference right how do you want them to decode the information and if it matters that it's precise then a pie chart is not the best option because people are going to make more mistakes if you want people to get the gist of things and you want things to be like in a compact space because circles pack really nicely then maybe a pie chart is okay although ideally not one of those pie charts with infinite wedges so also limits to the pie chart right so again this is taking it back to what do you want people to do how do you use this okay and this is again bringing in the human perceptual system and help how humans understand things so bar charts surprisingly effective if you've ever used our and ggplot you will also see this idea and this terminology come into play and how you directly in code information and manipulate the charts so this is just simple ggplot code for coming up with a scatter plot now instead of channels ggplot uses the term aesthetic which I personally like a fair bit more the term channels comes from the infovis literature and so what you've got here is you've got x and y which are aesthetics that define the position which we'd already seen was one of the aesthetics we could use and then you also have color which defines the class right so now you're making a choice this place in space this color and then finally they use geoms as their way of talking about marks you can have geom point geom bar is basically a higher level abstraction for a rectangle right and so on it's not a perfect one-to-one mapping there are really really weird so there are geoms that are not necessarily marks and there are aesthetics like group that are not necessarily changing your visual perceptual channels or anything like that but the the idea is is still baked in pretty solidly into the ggplot paradigm okay all right and we saw this video yesterday as well but of course the idea is also that it's not just about what you create with static visuals there's also a whole design space for interaction that is quite complex as well and there are some patterns that people have seen like these are good interactions these are less effective interactions and for today I'm not going to talk more about interactions aside from saying it's possible maybe it's something you want to consider since I assume most people are making static graphics for their charts but that is another dimension okay and I lastly just want to mention that there are algorithms I'm sure all of you have seen the network hairball where basically people are just telling you that they've got a lot of networked information there are algorithms that allow you to deconvolute that by summarizing the topology of the network and representing that instead of every single point and that's an example of an algorithm that you could use to simplify your data visualization but I also think that's a bit aside the scope of what most of you would be doing you should just be aware that maybe different tools or different packages have some of these in the back end that you can use so you could see with with Rob yesterday's Rob's code yesterday where he allowed you to sorry gang is yesterday where you could collapse a node and just get this big triangle instead of showing everything or you could collapse things that are common genotype the algorithm that detects that and does that for you is an example of an algorithmic implementation that also improves your visual design and it's something that as a consumer of tools you can ask for and and then of course there's the process of once you've done all this design work to kind of go back out and evaluate it so like I said if it's something as simple as a paper for your figure just talk to a couple people in your lab try it out go down the hall talk to somebody who hasn't seen or doesn't know your project that can be often extremely helpful if you need something that's like mission critical like a patient has to understand it a person has to understand it you may want to consider actually running a study and this is unusual or different I think in public health we don't usually run studies on pictures but you can and you can do this quantitatively or you can do this qualitatively depending on what you want to do so the more mission critical your visit is the more hardcore of an evaluation you might want to do and you don't have to do just one round of evaluation especially in design this idea of formative evaluations like growing it up over a little pieces getting feedback over time especially if it's a complex visual can be really useful so when you're evaluating you could think you know for the why stage does this data visualization actually meet its intended need you know for the right people again visualization that's appropriate for research might not be appropriate for a patient or policy or a policymaker but they'll still use your data because of medicine is supposed to be evidence-based you also want to ask why so is there are people using the right data for the data visualization is something missing that can be reasonably incorporated are they applying the data naively could they be deriving data for example sort of using counts using rates and then you could also evaluate the visualization itself to take a look at whether you can understand the visual interactive choices and if you want to evaluate it on an algorithm sometimes it's also useful to know that an application doesn't crash or doesn't take a thousand years to run and that it's also reasonable but that's a consideration that you may not use as often because you just abandon the software if it doesn't work very well it's my guess going back to that study where I talked about the different kinds of data and tasks we actually did some more intensive valuation also because we wanted to use some of the data that we collected there for other studies so if you wanted to see an example this was a mixed method study design if you wanted to see an example of a more intense design evaluation and practice we've got one in this paper okay so when we're thinking systematically about data visualizations we're not just thinking about pictures we're thinking about what the problem is that the data visualization is solving and for whom what data should be visualized is it available is it the right data what is the data used for you should explore different kinds of data visualizations and implement your own solution if you don't find something that is great and then we should evaluate it by looking at these multiple alternatives with different people and if necessary even doing a study of the data visualization to assess its effectiveness so the last part of this is stuff that comes out of my own research where we're trying to take a look at how we specifically might want to visualize data for genomic epidemiology so it's still research stuff it's not gospel but I'm hoping that it's something that you might find useful to think about in your work as well because it's some more specific advice as opposed to just generic advice from this field of computer science that doesn't always occupy itself with genomic epidemiology so what I'd like to talk to you about is a is a tool that I have been working on for the past little while called Gevit it stands for a genomic epidemiology visualization typology and what it is is a way to systematically describe data visualizations for the purposes of analysis and it allows you what it does is it is it takes some qualitative descriptors which I will tell you what they are in a moment and organizes them in such a way that you can do this analysis a lot of people are hoping that Gevit will give people the answer of being like I have this data it's going to do this thing this is what I should do when I originally started this I was hoping for that as well so I will analyze like a lot of figures and then I'll look at good examples and then I'll organize them and then I'll that'll help people move them in a specific direction this ended up being really hard because there's an extraordinary amount of variability in the way that people visualize data so we're not there yet but we'd like to get to it so right now it just provides you with an interactive gallery that allows you to view different kinds of data visualizations on this why what how idea and and a way to consistently describe everything that's going on in those data visualizations so the why what how is trying to get at why data are being visualized for example to show that there's transmission in a hospital what data are being visualized location duration outcomes and how data are being visualized and this tool is actually available online and it's the first version of it I expect it may change over time but you can input all of that to search for data visualizations based upon those why what how parameters okay I'll briefly describe to you what they are so well I'll briefly describe to you the visualizations specifically about and how get it breaks them down so the first thing that it does is it takes a look at the different kinds of chart types that are present in a data visualization because this is the basic building block those are things like file genetic tree timeline geographic map we actually found that there were several different classes of chart types that people were trying to visualize so I'll tell you what they are so again the whole the kinds of visualizations that we gathered from this was from an analysis of about 18,000 papers in genomic epidemiology there's a smaller subset of papers for the manual analysis but we still had about 850 figures in total that we analyzed to figure out what what is being done in the field so we found people often try to visualize just common statistical charts like bar charts line charts scatter scatter plots pie charts Venn diagrams timelines and different kinds of distributions but beyond that people tried to do more complex things so common statistical charts are things that you can do quite easily in Excel or even Tableau but of course we're not limiting ourselves to that in genomic epidemiology so we also found that people often try to show complex types of relational charts those are network diagrams social network diagrams or minimum spanning trees as well as to show different kinds of flow diagrams so that was the core diagrams that you may have seen in the circles plots that are quite famous in cancer we also saw that people were trying to show temporal information over time in various different ways with the stream graph and people would rely on on heat maps that's not surprising or density charts to also show different aspects of the data when we were looking at how people spatially represent data we did not find that it was limited to geographic maps or core path maps we found that people also often showed interior images as well and there's actually not a lot of great support for tools that show interior charts and then of course there were different kinds of trees which is not surprising you've seen a lot of them there's a category of tree that you might not be aware of called a clonal tree which is coming out of cancer research because they are actually sampling tumors over time at multiple stages so they've got these very explicit samplings of the internal nodes whereas with with gen epi we often have the just the leaf nodes and they're using that in their phylogenies and are constructing different kinds of phylogenies there's also a lot of different kinds of genomic maps so often for drug resistance people do want to see the kinds of mutations in context with other things and then of course people used tables images and some things that we could not classify so these chart types are useful if you want to get a sense of how else could I visualize my data beyond a tree that there's lots of different options for you and they could be useful for bioinformaticians to let bioinformaticians know of what kinds of visualizations their software should support since people are trying to create things like this we also found that in addition to these basic charts people were adding metadata in very consistent ways so we talked about points lines and areas we saw that people would change their size we saw that people would change their shape their color or their texture sometimes when it was a zoonotic outbreak and you wanted to specify you know what was human what was this animal what was that animal people like the little icons of the different animals to help in the understanding of the data we also saw that in this community text is also used quite a bit so bolding or coloring text or making things italic was also really common so this was interesting because at least in the infovis community they don't like to encode information as text because again reading is cognitively expensive so this is a neat finding for us because it was a way that you're taking this cognitively expensive task overlaying these kinds of channels on it to hijack your perceptual system I don't know if that made sense to anybody but it was a neat finding that we didn't expect we also found that people either added these things consistently or in one off ways so if we were to put the two together you might have just a basic chart type like a tree people might want to re-encode different kinds of information which means they want to change perhaps the color of the line or the size of the points in the different underlying base graphic you might want to add different marks which is adding the little points to the tree or highlight different groups and then sometimes you just want to add an annotation so this one has an annotation that just says this guy we don't know his contact that's a one-off thing and this one just highlights one group instead of all of them so we saw that people did this my guess is people often do this manually and the last thing that we saw was there are also consistent ways that people combined different kinds of charts in order to convey these very different aspects of the data so we saw that already where we had the phylogenetic tree and we had the map in the map in the timeline and the colors of the points were consistent linking the information in across all of the charts so in our taxonomy we called that a many type linked right because people were using color often people just had a single chart and they were throwing all this information at it and putting a lot of information as text we saw people do what we called composite so this is when you've got your classic dendrogram as well as a heat map and you're using spatial positions kind of read across to get all the different kinds of information but they're two different chart types we saw small multiples which is using the same kind of chart type to show different aspects of the data so for example you would show the same phylogenetic tree four times but you would color the points of the tree in each instance according to some variables so maybe same tree with geography same tree with gender same tree with something else with time right but you're repeating the same thing to show different aspects of the data and then sometimes we stop like because of space limits people just put things together in a single figure that had no connection and there were also very complex combinations of these so this is also something I assume that people are doing by hand right for you as people that are not software developers I don't expect you to fix this problem overnight in fact one of my hope is that in talking to you about chart types chart enhancements and chart combinations and giving you the gallery as a resource that you can think about the different degrees of freedom that you can use when you're creating data visualizations for your own papers to try get some of that information out of text and show it in different ways and that my hope over a time is that bioinformaticians knowing the degrees of freedom that people need will start to develop tools that allow you to do this in a more automated manner as opposed to by hand which is how I assume is happening now right so the the purpose of the tool is to try to help you explore the kinds of visualizations that you can make it's it's taking that concept of a design space and making it concrete for infectious disease genomic epidemiology it's like I said it's still in the early stages so I think it'll change over time don't know if it's the right answer but I hope that you find it to be a useful resource as you explore this design space and think about these different kinds of alternatives and think about creating visualizations for yourself okay so with that we'll wrap up so I hope that with the lecture today I have given you a sense of the idea of what it means to encode and decode information visually and I hope that in taking you down through this why what how idea and breaking down those components for you you have a sense of thinking beyond just pictures and thinking much more systematically about data visualization and I hope that in talking to you about design spaces and introducing you to some of the degrees of freedom and design space complexity that we observed in get it by studying data visualizations that come from this community we have given you a sense of how you might use visual alternatives and and create different visualization design choices and that's it