 Hello. I'm going to start talking about visualizing multi-dimensional data. So one of the questions Tim posed yesterday was how do you look at four-dimensional data, right? Or how do you look at a four-dimensional object? So let's just look at a four-dimensional object. That's a tesseract, which is kind of what the cube is to the third dimension that tesseract is to the fourth dimension. And the way to think about this is if you take a shadow of a cube, which is a square, and join two squares, you get a cube. If you take a shadow of a tesseract, which is a cube, and join those two cubes, then you get a tesseract. And that is one way to look at four-dimension. And already it is quick to see that it's not that easy to think in four-dimensions, right? Let's start with something even simpler. There's a book in 1884, written by Edmund Abbott, called The Flatlands. And the book is really about a flat line, a 2D place where everybody is either a line, a triangle, a square, a pentagon, a hexagon, or a regular shape. And the protagonist is this person called a square. And the square is nothing but a line on two-dimension, right? And one day, this square is visited by a sphere. And a sphere is a third-dimensional object. And the sphere can actually rise in and out of the two-dimensional, can make himself small, big, and even disappear. And the square is not able to understand how the sphere is able to do this, right? No matter how much he explains to the square, he cannot grok the concept of the third-dimensional till literally the sphere actually pulls him out and shows him that there is this third-dimensional. It's a beautiful book. Anybody wants to read Flatlands? And that's what I'm going to try and do here. So I'm going to try and show you as much as possible, rather than tell you. And I'm going to show you in very basic principles about how to think about visualization and how to think about doing that in multiple dimensions, right? So we are using visualization because we are visually wired, right? We're really visually wired. 50% of the brain is used for visual processing. 70% of the sensors are in the eye. And we can actually enter any room, and in less than a second or 0.1 of a second get a sense of the layer of the room, right? So our visual processing is our biggest advantage we have. And if you have a simple phenomenon, like compendium swinging, which is, let's say, a phenomenon, you want to abstract it. You have two choices. You can either use the data to create a symbolic abstraction or you can choose a visual abstraction. And the part of the visual abstraction is that it allows you to use 50% of the brain. It allows you to use what you are good at, which is pattern recognition. So visualization is the transformation of the symbolic to the geometric, right? And that's what we're going to see. So we're going to see this in four contexts. We're going to look at it from a small data, then a large data, then a big data, and then white data, right? So what is small data? Let's take the smallest possible data set that we can start with. Let's just take five observations. Area is one column. Sales is the other column. The area is a categorical variable. So we have got five categories there. And sales is a quantitative number, right? So if you were to visualize this variable, we would start with acquiring the data. So we have the data. We would pass it into variables. So we'll have two of them. So we'll say x is equal to area, so it's my category. So I've made them 1, 2, 3, 4, 5, five categories. And y is a quantitative variable, and i, which is sales, right? So I've got those two variables. I have passed them. Then I would encode them into shapes. I will say x is a position. y is a bar. And I will scale it on a 200 by 200 pixel screen space. So I scale those numbers to that pixel size. Once I've got that, I would then render it with a coordinate system, in this case, Cartesian. And I will have what would be a bar chart. Now, we can use this simple four steps or five steps to actually create multiple visualizations. Let's take a simple example. Let's take the same data set and now render it with either a point, a line, a bar, a stacked bar, or a bar stagger. So we're taking three different geometric shapes and two position combinations of that bar space. And we're going to use different coordinate system to see it. And we can look at the power of this visualization that if you were to just to coordinate Cartesans, or the coordinate system is Cartesian, x and y, what we are very familiar with, we would get a dot plot, a line chart, a column chart, a stat column, and a waterfall. I can cheat very quickly and flip the Cartesian system. So now x is y and y is x. And I would have dot, line, bar, stacked bar, and a cascade. So I already got 10 visualizations that I can do. I can then decide to do it on a polar coordinate and say, x is actually theta, y is now radius. And I would get a marked radar, a line radar, coxcom, if some of you are familiar with, a bullseye and a polar waterfall. Or cheat again and flip the coordinate and say, x is now r and y is equal to theta. And I would get a target, a line track of windrows, somebody's favorite pie chart, and polar cascade. So just with a simple application of this, I can come up with 20 different visualizations. This is the process of visualization. So a small database process is, acquire the data, pass the variables, encode the shapes, select scales, and render coordinates. Now, this is not new. I haven't discovered this. It is the grammar of graphics. And if anybody of here is a data scientist and uses, either r or Python in Julia, and he doesn't use r and Python's more popular plotting systems like graphics and matplotlib, but uses ggplot2 or bokeh in Python or Gatsly in Julia, they're all based around the grammar graphic. If any of you is a web programmer and uses d3, d3 is also based around the same principle. So this is small data. Let's look at large data. So let's take an example. Let's take a slightly larger set. Let's take 24,000 PIN codes. So all the PIN codes in India. And a PIN code would be 560076. But near get our road, that's where I live. And this is the latitude and longitude. And if I want to visualize this, I would just create a scatterplot. And I could plot them. And with some playing around with alpha, I can actually start to see the PIN codes and the structure behind it. I can see the sparseness in the central India, look at the deserts, where the spars can even make out the western parts, and actually kind of get a sense of the data. So I can create these visualizations very simply. But what I normally do is not create one visualization with a large data set. I create many. So in example, I may want to look at, is there a geographic nature of the PIN code? So does the 5 in 560076 stand for something? If you want to look at that, then I would just probably color code the first digit. And I would now look at, yes, the 5 actually stands for two states, Karnataka and kind of the erstwhile contiguous Andhra Pradesh. So exploration of large data is actually an iterative process, which many of you may be familiar with. So you would have an additional step of refining the data, which may either mean filtering or transformation that would come into it. So our process of creating the data visualization now is acquire the data, parse the variables, refine the data, encode the shapes, select scales, and then render coordinates. Let's take big data. So we went up three orders of magnitude. Let's go another two order of magnets up. And let's visualize big data. So let's take x and y, just two variables, a million data points. And if I was to just visualize them, literally, no amount of alpha treatment, transparency treatment, is going to help me if I want to plot each data point. My MacBook is 1,400 by 900 resolution, literally 1.2 million pixels. I have nearly a million data points to plot here. So I can't really plot all the data. I can maybe sample it. Sampling can be effective, as long as I'm doing, over-weighting for the unusual values. I may need to create multiple views. I may have to be careful about tuning parameters. But sampling is one way to look at it. Or I can actually model it. And modeling is really effective, because the model can scale much better than visualization can. I mean, visualization is great at looking for patterns, but the visualization doesn't scale as we've rightly seen. But visualization is also good for finding out things that I don't really know. I really want to do visualization, because I don't know what I don't know. Because if I knew it, I would have modeled it. So I really don't know what I don't know. And if I want to do that, I really want to visualize it. And the only other way to then think about it is to bin it. So binning can solve a lot of these challenges in actually visualizing big data. So there's a great paper by Hadley Wickham called Bin Summarized Smooth, a framework for visualizing big data, where he literally uses a MacBook Pro and 16GB of RAM to plot nearly 100 million data points using the approach of binning. Jeffrey He's team also has a paper on real-time visual querying of big data, again, looking at using aggregates in one way and then interlacing interactive techniques on top of it. So binning can actually solve a lot of the visualization challenges we have in big data. And binning is not new. One of the first statistical chart that any one of you would have drawn is literally a histogram. So if you were to ask all the ages in a room, create some buckets, put those ages in that bucket, and plot it, this is probably the first histogram we have created in grade five or six. But something happened in the interim, or at least happened to me, that I never created histograms for a long time because I moved to Excel for a long time and Excel doesn't have an ability to create histograms. So tools also do matter, even a simple tool like creating histogram can move away because defaults matter, tools matter in a way that we are not. So if the histogram is not a default available to me, I may not actually use it. So Amanda Cox says this thing, she's a New York Times graphic editor, and we're calling 2015 the year of the histogram. And histograms are really great because they allow me to look at the portrait of the data for a small data, and they allow me to actually plot big data in a meaningful way. So I thought I might as well come up with a quote, conference is a good time to come up with a quote, and visualizing big data is the process of creating generalized histograms. So anybody wants to tweet it and ascribe it to me as a code? That's a good time to do that. So we covered big data. So let's just look at the database process for big data. You acquire the data, you pass the variable, you filter the data, but you aggregate data. You aggregate data and then encode shapes, select scales, and render coordinates. And some of the work that we saw, for example, the Apache lens that was being talked about yesterday, also is trying to do a lot of this. A lot of the systems are trying to do this aggregation in a smarter way so that we can actually still work through this data pipeline. So we come to wide data, right? And if you were really looking at wide data or what is multi-dimensional visualization, then we basically have these five sets of techniques to do it. We can either look at standard 2D techniques, or we can look at glyph approaches, or transform it in geometric shapes, or stack it, or use pixel-based approaches. So this is basically the five set of techniques we have to actually visualize data in multi-dimensions. The important thing to understand is the need for interaction goes up as we move down that side of your face, we go more right, right of you, right for you. If you move more towards the pixel-based approaches, we need more interactive capabilities for us to be able to use these multi-dimensional data visualizations. The ease of understanding also declines. So it's easier to understand standard approaches. It's harder to actually understand pixel-based approaches. So keep that in mind when we walk through some of these techniques. So we're gonna take a very simple data set, the diamonds data set, if anybody has used our GGplot, it's one of the assets that comes inside it. It's good because it has got 50,000 observations of 10-dimensions, and it's easy to understand. So it's a large data and 10 dimensions. The 10 dimensions, five of them, are basically how you would go ahead and buy a diamond. So a price of a diamond that is determined by the four Cs, the four Cs are carrot, color, clarity, and cut. So price is in US dollar, carrot is weight, which is basically literally one-fifth of a gram, is one carrot. Cut is a very, it's a fair cut to ideal cut. Color is J is bad to D, which is good, seven levels. And clarity, which is basically inclusion one, is there something inside the diamond to IF, which is internally flawless. So easy to understand, at least two of them are quantitative and three of them are categorical. And then there are five more dimensions in the data set, which are literally describing how big the diamond is. So a length, width, and height, X, Y, Z. And basically, because it's not a square or it's not a cuboid, you also have table width and depth to kind of show the taper on the top and the depth on the bottom. Not important to understand, but just literally five numerical dimensions to describe the data, to describe the diamond. So this is how, let's say, the first six rows of that data set would look like. So let's start, one D and two D. I won't spend too much on it because we wanna talk really about multi-dimensional. But just for completeness, if you were just looking at one-dimensional and two-dimensionals, we will literally have quantitative and categorical one-dimensions and the combinations of those, which is quantitative and categorical, categorical, categorical variable, and quantitative, and quantitative. So we have five combinations. And if you wanna show them as a point, a bar, line, and area, we literally have 20 different charts that we can choose from. Well, five of them won't be there, but this is probably standard charts that many of you may be familiar with. If you're doing any kind of statistical visualization, some of them, like bar charts and histograms, you're more familiar with. Box plots may be less. Mosaic, table length, slope graph, probably even lesser. But the idea is basically, it's pretty easy to understand these charts and if somebody doesn't know what to pick, this is a good selection table to start and look at chart selection, if you want to, around just two dimensions, right? So the most useful two-dimensional representation is obviously scatter. And 2D scatter is the most common and you can clearly see price and carrot are related to some extent, or even, well, not necessarily linearly, but as the weight of the diamond goes up, the price goes up, which makes intuitive sense. And the first thing you probably want in, because some of you are also designing systems around interactive exploration, I want to cover interaction also in this. So 2D scatter plots, the first thing you want to do is have some interaction for annotation so that you can actually pick out which are the outliers. So you need to have a way of picking out these outliers, so literally diamonds which are bigger than four in this case, there are three of them. You should have some interaction capability to pick it out. But you can do what many of you would do when you do a log transform, log of the price and log of the carrot and now you can clearly see the chart looks linear. So there is definitely a good linear relationship between log, carrot and log price. It makes intuitive sense, heavier the diamond, more the price. The other thing you may want to do in interaction is basically zoom into area. So you would say, okay, I want to look at carrots greater than one price, greater than 10,000. Can I have the ability to zoom into this? And panning and zooming is one of the other ability you want in your interaction system so that you can actually visualize multi-dimensional data because you want to look at different areas of interest. So you zoom into that area and then start to further examine that. So let's go to three dimensions. So we have a scatter, how do we add three dimensions to it? The easiest way to think about it is to change the dot. Change the dot into size, so we can change the size of the dot, we can change the color of the dots, or we can change the shape of the dot. Very simple. We can change, if we add size to it and literally say x, y, and z, or multiply x, y, and z to create a proxy for size, we can see that the smaller dots are on the bottom, the larger dots are on the top. Again, makes sense, smaller diamonds on the bottom. And literally, the size can actually start to play. The transparencies can go even higher on the top so that the circles become more visible. But even at this 0.3 alpha level transparency, we can see the pattern very clearly about 50,000 points. The other option is obviously color. So if suppose we use color for literally the color variable in the data set, you can clearly start to see the pattern, that the D color, which is the good ones are on the top, orange ones, and the pink ones, which are the j colors are on the bottom, and you can clearly see lines and levels there. So color is the best way or one of the other ways to kind of look at the third dimension, right? We can also see shapes, because I talked shapes is the other way. Shapes don't really scale that well. So I'm using a different shape, five different shapes to donate to demonstrate different cuts. But even with the sample of the data, I think this is less than the entire data set. It's probably 10% of the data set. It clutters up very quickly. So shapes don't really scale very well with large data. We also have another option to do. We can do a really a 3D scatter. We can actually go depth, like literally draw as you would draw on a page, X, Y, and Z, Z going inside. And if you do it like that and do it as a 3D static, our depth perception is really not good. So this kind of creates some occlusion. I can't really see the points very well. Even though this looks nice, it's pretty harder to interpret. The better way to do a 3D would be just if we can do some rotation on it. So if you have interaction capability to rotate it from any direction and spin it in many axes, then a 3D can actually start to make many senses. And this is one of the basic techniques of creating different projections. You can do this from three to four to five to six, and you can basically create projections around that. So interaction is very important if you really want to use a third dimension. So let's go up from three to maybe four and six, four to six dimensions. So a 4D, the first one, and the most popular one would be a bubble plot. So you start to use color and size boats together. So color is donating, literally the color of the diamonds in this case, and the size is for X. And again, you can start to see some patterns, but color and size is, this is the most common you would do four-dimensional plotting. If you have time in your data set, and we don't have time in this data set, time could actually literally be the fifth dimension, and you can actually play an animation. So this Hans Rosling video, which I'm sure many of you may have seen, does that in a very great way, and literally plots five dimensions because you now have color, size, and time to actually show that. The other way to actually think is we've been thinking about only one graph on the page. We can start to divide the page, right? And once we start to divide the page, we can line them up. So literally in this case, we can take any conditional factor. In this case, each of the color factors, D, E, F, G, H, I, J, and create seven plots and line them together. This effect is called trellising, faceting, small multiples is a great way to start to see the patterns. Why stop at only one dimension when we can actually add another dimension? So we can make it literally into a grid where we have one variable on the top, one on the side, so we literally in this case have seven into five, 35 plots here showing two variables, right? So we have four variables and we can actually even add one more variable on the top. So trellising is another good way to look at that. You can then, instead of thinking of only a similar way, you can actually start to scatter plot matrices or you can plot different variables. So in this case, price, carrot, table, and depth, each combination of that. So we get these six plots and scatter plot matrix is another way to start to look at many of these dimensions. So scatter plot is probably the most effective way you will start to look at it. The other way is, why restrict to only one type of chart? What if we can create all different charts and link them together? This is the best way of actually doing multi-variable analysis or multi-dimensional analysis because I can start to see two histograms here, three bar charts, a box plot, a mosaic plot, and a scatter plot and they're all linked together. And that linking is the one that actually allows me to really work with this together. So I can select a part of an area or which is, let's say I can brush a part of the scatter plot and I can actually see how that plays out in every part of the two, every other chart that is there. So this multiple views with interaction capabilities to do brushing and linking is really what you want. Unfortunately, the tools that we choose don't allow us to do this very easily. So if you're using R and you use RStudio or you're using Python and Julia and use a notebook approach, they're largely one space. So the space on the bottom is typically in RStudio or you would probably have a space here to do. So it doesn't really allow you to do a lot of this multiple views and linking and that actually is one of the handicaps are using the tools that we use for data science because we can't do a lot of this interactive capability which is really needed to do multivariable visualizations. What if you want to go more than six? If you really want to go more than six, one approach is icon-based. So we can represent instead of just using circles, shapes, we can use icons. So stars, sticks, and shurnoff. I will just show star. So I can look at a star, which is five of these dimensions, color, clarity, depth, table. These are 50 variables plotted in a rectangular grid. So I can start to see some pattern and actually I can lay them on top on a price and carrot. So I literally have now seven variables plotting. Little harder to read, but if you can get the transplants to work, the sizes to work, you can start to make a star plot also work for you. You're not seven dimensions. Or flip it around. Why don't we add each of these variables instead of doing, instead of bin it here and start to show them as little graphs inside the main graph itself, right? So something called subplots or binned plot distribution. That would be another way to kind of flip the whole thing around and start to see more dimensions. So for also, we've been only talking about orthogonal projections or cartesans. And orthogonal is great, but it takes up a lot of space. If you really want to go above five or six, one of the approach is to just do parallel coordinates. And parallel coordinates can, I can literally line up all the 10 variables that I have and link them by lines. And that's what I'm really doing by parallel. This really works again very well with interaction. If I might be able to sort them and actually see some of the interactions between them. Or as I was doing in other cases, select them and I have an interaction and selection to actually start to make sense of it. So this is really 10 or even more, actually 10 up to 20 can probably work in a parallel coordinate. I can instead of joining lines, literally show them by bars. If I do that, then I get a table plot. Literally again, all the 10 variables bind together. So a binning our favorite strategy for big data is literally 100 bins and I can see them as plots and I can see how they are distributed. Again, I can zoom into selective areas and say, okay, this is the area that I'm really interested in. So I can get a table plot. Or I can do stacking. And stacking is literally around taking two or three of these variables together and creating these stacks or mosaic plot in this case for three of those variables and start to look at how big this is. Tree maps, another way of actually thinking about the same thing. So literally we are on these techniques where we can actually literally start to do a lot of this multi-variable or multidimensional visualizations. I'm gonna quickly touch on a couple of other ones which I'm not using the diamonds data set. I'm just showing examples. There is other ways to even compact it further which is star coordinates, which is literally arranging the parallel coordinates in a star way. Or what a lot of you would know as PCA kind of techniques which is basically tools and projections. Project the data three, four, five dimension and two dimension and actually keep rotating it in a way that I can start to see some patterns. Really good for clustering analysis. The last approach which is even most hardest to actually understand is really using each pixel to represent one data point. And pixel based approaches either spiral pixel curves or pixel bar charts or even space filling curves are much more harder to explain, actually really hard to explain in a conference setting. But with intuition it's like reading an x-ray. After some point you can actually get good at it. And with enough interactive querying approaches you can actually start to use them and make sense of that. So the database process for wide data is we're still acquiring the data, parsing the data, filtering the data, aggregating the data, encoding the shapes, selecting scales. We may not only render it in coordinate systems but we may actually render it algorithmically. And then we also may add views and add interactivity to it. So if you're really thinking about doing a lot of multivariable visualization or multi-dimensional visualizations it will basically encode wisely. So think of what you're encoding and how can you make clever choices around that. Use the space that you have and the multiples that you have much more cleverly so that you are not just using one chart but using them to kind of doing faceting or multiple views. Add interactivity because interactivity is really needed when you want to do multiple variables. So linking, brushing, that's very important. And most of the tools we have right now are not really good at it. So people who are designing tools should really think about more and more interactive techniques that they can apply. And in some cases even reduce the problem space to kind of show that, right? So those are the kind of four techniques. That's kind of the database process you follow for wide data. All the code for these slides this is basically created in R. These are the R packages used but you can go to my GitHub account and literally see the code for every chart on this slide. I wanna end with just one. I think the greatest value of a picture is when it forces us to notice what we never expected to see. This is John Tukey, the guy who kind of kickstarted exploratory data analysis. I'm gonna stop here. My name is Amit Kapoor. AmitKaps is a Twitter account. AmitKaps.com is where you can read more about me. I like to talk about crafting visual stories with data. So that's my primary interest. And I like to work in the intersection of data story and visual. Open to questions. Hi, have you ever considered using MATLAB for visualization? MATLAB as a tool? MATLAB as a tool? No, I have not. I just started with R about a year back and it's open source, so I use it. I don't know where the MATLAB is. I really don't know. Is it open source? No. No, okay, then I would have not considered it. So it was clear that if you rotate 3D data, the visualization was clear. What would it be for, let's say 5D, 7D? Would there be any interpretation if you show a rotated view of 7D data? So I, sorry, so tools and projections are the ways to look at it. So again, the challenge in rotating data is you have three damage to rotate. Where do you rotate? How do you look at it, right? So there are, again, techniques, visual techniques developed to create what are called as guided tools and there's a whole kind of space page on it. The link there on the tool package, you can go and see it actually implements that in R. But it basically plots the data in many dimensions and then allows you to look at the data from many different ways and it actually constructs a movie for you to actually show the interesting ones. Because otherwise interaction itself is the hard part in actually looking at these views. So tools is what you should look for. Any other questions? Hello, hello. Yeah, you had mentioned that there is interaction in multidimensional data visualization. Does that also mean that from one graph or one portion of the graph, if you click on a particular part of the bar chart or something, it will go to another graph, another plot and show the further details of that? Yeah, so that interaction brush, so this is brushing and linking, but I'll go back to maybe one or two if it goes back. Yes, the ability, I mean, brushing and linking is basically the ability to link. Right, so in this case, you can literally see you select brush. Brush is basically selecting a set of points. You can see those points becoming red and you can actually see where they lie on each of the other graphs. So the ability to do multiple plots only works when they're linked in some way. So you selected one point and you can see that point in every other space. And sadly, the tools we have don't allow us to do that very well. So this is done in Mondrian. There's a package called iPLot which does that in R, but mostly we don't have this. Surprisingly, even after 20 years of having these kind of tools, they're not open source. There are obviously commercial tools that do this pretty well. Yeah, sorry? So when you talk about interactiveness in a plot, so interaction means that one more dimension will be interactive. Is that what you meant? Because in R, we have a manipulate function by which we can, one dimension can be interacted using a slider. So is that what you meant when you meant interactive? Like one dimension can be interactive or is there any other meaning to it? Right, so there are two kind of ways to think about interaction. The one that you're talking about manipulate in R or shiny, for example, in R is basically reactive programming. You change the slider and the slide and the chart gets recreated. So it gets recalculated, recomposition and then you can recreate it. That's one level. The other interaction is basically direct manipulation where you can actually manipulate on the plot and see highlighting, annotation and actually this brushing and linking where it is not always going back to this, not going back to R and recalculating it and redrawing it, right? So that reactive programming is very important if you are recomputing a lot of these statistics and re-showing it again, right? What you also want actually is an implementation where you can do this kind of interaction much more well. Hi, any suggestions for visualizing text where you have hundreds of dimensions? There was a talk I think three years back that Anand gave on visualizing text here on this fifth telephone, you can look at that. There are many techniques, but I mean, this is basically focused right now on tabular data. So I don't have anything off the cuff to talk about text. I mean, graphs, text, there are all more categories of data networks which I'm not really covering here. This is largely tabular data. Yeah, sir, you had a question? TSNE, yes, there's an idea. Yeah, TSNE is a package that helps in doing a lot of that. Any other questions? Okay, thank you very much. You're welcome.