 So why digitalization? It communicates data clearly, it doesn't have language barriers, and effectively with pictures, graphs, and charts, and anything to help people to understand the significance of the data. Especially if you are a data scientist, the quickest way to get people understand what you've gained from data is through data visualization, and to enable decision makers to see analytics, to allow people to grab the difficult concepts to identify new patterns from visualization and actually see what patterns are there, rather than just going through all the numbers, and to find insights in the presentation and outcome of the data analytics. So very quickly, I'll spend the next 10 minutes going through the different data visualization charts and graphs out there. So there's a dot plot, bar charts could be vertical, could be horizontal, right? And the floating chart or corp gain charts, if anybody is a project manager, they'll be used to these sort of charts, pixelated bar charts. So each of these is an individual person you can see, blue for male, pink for female, and a histogram, slope graph, where the slope actually determines how different the range is of the numbers. The radial charts, leaf charts, this one is pretty cute. A Sankey diagram, see the thickness of the lines also tells the story. And this is the area size chart, join Facebook, okay, small multiples, ultra leaf chart. It's a word cloud, hypermia leaf, pie charts, square pie, three maps, circle packing diagrams, bubble hierarchies, tree hierarchy, line charts, we are inside somewhere here, spark lines, area chart, difference with line chart area chart is area chart, you will need to start with a zero so that you can see the area. And there's a stack area chart, horizon chart, stream graph, for stream graph, basically there is no positive or negative and you'll actually only concentrate on the area. Candlestick charts, barcode charts, flow map, scatter plot metrics. For data scientists who'd be familiar with scatter plots, but basically you just plot every single data point to find the clusters, radial networks, network diagrams, dashboards, maps, cropf maps, dot plot maps, bubble plot maps, topological maps, particle flow maps. This basically shows the wind over time, cartograms, basically this is indicating the population size, the bigger is this, the larger the population. Dolan cartogram, similar population size by using circles, network connection map, this is I believe Virgin and Singapore Airlines. Spatial temporal cube, so you have your map, you have your timeline, and then you have your city's animated maps, 3D maps, virtual realities, and that's roughly it. So all the different types of tools you can use, the different types of charts, graphs you can use to visualize your data, pick any one of it that is pretty, that actually conveys your idea in the right manner, and you can do amazing things with them. Thank you. I'm really excited to be here today sharing our topic I'm passionate about, which is data visualization. So can you have a show of hands, who among you already work in data field? Very cool. And who use R or Tableau? Nice. So let's get started. In this session we'll cover the basics of visualization and visual storytelling before moving to our hands on session, which will be fun. And so once you receive a data set, there's a few things you can do. One thing is you can look at the summary statistics, but they never show you the whole picture. In this example, every frame of the animation is a data set, and they all have the same means, standard deviation and correlation, but when you plot them out on x-y axis, they're distinctly different. So the kind of chart we make will depend on the kind of data we have. There's three main data types. One is quantitative, one is ordinal, and one is nominal. Quantitative are like neural data, for example, your age, and ordinal are categories, but they have some order. For example, the food you had just now, is it good or bad? So they're not all equal. They have some sort of ordering. Then categorical are categories, but without order. So for example, like countries, or your job, or your industry. And then there are also three data types that are a bit special. One is network. So network data comes with nodes and edges. For example, your friendship on Facebook, your connections on LinkedIn, all the actors and actresses who start in the same films. There's also special data, the kind of data that you would want to plot on a map. There's also time series, for example, temperature this month, rainfall this month. So those are also kind of special data. And also the usage of color in visualization was never random. There are three major color palette. One is sequential, often used to show quantitative data. So the night color show a small number, use the dark color to show a big number. And then diversion palette, which is very suitable to show ordinal data, which is adjusting that ranges from bad to good. So you can use different color tones to designate that. And there's also a quantitative color palette. It doesn't have some sort of ordering or importance among them, which is suitable for categorical data. So just now you have seen like dozens of chart types. Some are very complex, some are relatively simple. Most of the chart types are made of three primitive geometry, that nice dots and areas. And then there's dozens of visual encoding that you can leverage to show different data attributes. For example, you can vary the length of the bar, the color of the dots, the shape of the point, and all of this could help you to visualize multi-dimensional data. But one thing to bear in mind is not all visual encodings are perceived equally. For example, humans perceive length far more accurately than area or even color. So the kind of chart we use, as we said, depends on the data we have. It also depends on the question we want to pose. So when you look at a big data set, sometimes it has multiple relationship or patterns within this data set. For example, if you want to show comparison for discrete or continuous data, you probably want to use our bar chart or nine chart. If you want to see correlation, scattered point or bubble chart is suitable for this purpose, and you can even fit our regression line there. For those special data types such as network or hierarchies, you can use some sort of tree structure or network chart to show that. There's this tool I want to point out. So it's called visual vocabulary. And you can select, for example, you want to visualize the distribution, and it will give you some recommendation, also some use case. For example, I select the distribution here, and it's showing us you can use histogram, box plot, population metrics or dot plot. All kinds of options depends on your need. And in terms of chart choices, there's often more than one option, and it really depends on your unique perspective and what do you want to show. So in this example, the blogger for flowing data, which is a visualization blog, visualize the same data set in 25 different ways. And this data set is life expectancy by country from year 2000 to 2015. So on the chart on the left, he basically just dump all the countries as a line chart, put everything in one big spaghetti, and so the chart on the left on the right, he employs a technique which is more multiple, and basically and arrange the countries alphabetically. So in this case, you can easily zoom into a country of interest to see how their life expectancy vary across the years. So both charts have some pros and cons. You may find the one on the right a bit more readable. So data visualization is often a process of exploring, understanding your needs, and also iterating to find what's best for you to communicate your point and what's best for your stakeholders to drive your point through. Just now I've seen multiple examples of static charts. So we can move on to some animated charts. So one of the data visualization gurus Stephen Fu said, numbers have an important story to tell. So oftentimes it depends on us to make them speak, turn them into life, and give them a convincing voice. And next, I'm going to show you two examples I find interesting. And I'm going to tear them apart to see what kind of techniques they use, maybe we can use as well. So this is a story about deaths caused by U.S. gun violence. Each arc is an individual. So also annotated his name, the age he was killed, the age he could have lived on to. And then on the top right and top left corner, they showed big aggregate numbers, total number of people killed, total number of stolen years from their life. So what's wonderful about this, one technique they use is comparison. It's comparing the years these people lived with the years these people could have lived. It also used our combination of detail and aggregates. So you can see like individuals in the arc diagram, you can also see big number aggregates. So you have both side of story. Another thing also did very masterfully here is the color. Instead of choosing some random color, he used our high contrast black and white, which is befitting to the topic. And he also kind of speed up the process. At first, it was very slow. It kind of make you slowly getting used to the topic, digesting what's going on. And then the graphic starts to speed up, which brings you to the animated humanized story about the catastrophe caused by gun violence. The next example is relatively simple, but also interesting. This is more in our category of data visualization for fun. So the visualizer asked a question here, what if I compare Olympic swimmer with some ocean animals to see who can swim faster? So basically he pulled like Olympic swimmers together with some fishes and stuff. And instead of just showing a boring bar chart, he actually made it very intuitive. You can actually see who swims faster. And you can also input your own speed. So it kind of allows you to participate in this visualization, which makes it an interactive experience. And so that's the two stories I shared. In your day-to-day life, if you work with data, there could be visualization you want to make. Or if you are on the consuming end, you can use some of the techniques we mentioned to kind of detect visual lines, or kind of critique how other people's visualization to see what makes sense for you. Are you getting the valuable out of this visualization? So any questions so far? We can take a short break before going on to Tableau and our demo. So in this session, we are going to run through some Tableau stuff. You don't have to handsome. We can handsome in the last session because Tableau, there's some licensing restriction, I guess. So in this session, we can work through to create a simple dashboard like this, and you will see it's really easy to do something like that. So a brief overview of Tableau is our explorative, analytical, and also our presentation tool that enables easy interactivity. So let me briefly walk you through its functions. It has a data pane. So you have like measures, which are like categorical variables, and you have like dimensions, which are like categorical and ordinal variables, measures, which are like quantitative variables. You have like an area here, which is something like a pivot table. You can drag and drop stuff on the rows and columns. And then very simple, you have a show me. Once you have data there, when you click show me, it gives you like multiple charting options. So the sample data we use is a superstore. It shows like purchases and profitability of stuff like furniture and stuff. And so one thing we can do is we can open our slide. We can open a view to create our chart like the top half of this, which is profit ratio and sales by region. And how we are going to do that is we can search for the field you want to use using this search pane. And then you just drag it here. Similarly, you can search for the field you want to, the quantitative field you want to show. Just drag it to the column here. It's really simple. And then you want to do something else, which is profit ratio. Again, drag it here. So it's really fast and simple to do something like this. And the next half, we want to do our map, which is very easy to do in Tableau. In this one, we are visualizing profit ratio by different kind of state and postcode in the US. We drag it here. We can search for the state. It gives, it automatically gives you some sort of recommendation, but you can change it using show me. Right now, we'll be using offline map for the network. And then here is where visual encoding comes into play. So all these are visual encodings. For example, you can drag field onto the size to vary the size of the bubble. You can drag field to the color to vary the color of it as well. For example, we can drag sales to the size. All right. And then you can like generally scale everything bigger or smaller. Sounds okay. And then maybe you want to add a little bit of translucency to make it look fuller. And we can drag profit ratio to the color to make the profitable and non-profitable show different color. So we have very high level of granularity with state. Let's see if we can go one step lower. We have a region. Adding cities. So by now it is showing sales by cities called the by profitability. And right now we have made two simple charts in very short amount of time. And then we can name it. For example, this is sales versus profit. This one is a map. And then we can put both chart on our dashboard. So we can start a new dashboard here. And again, basically drag it here, drag it. We can change the size of recent height to make it fit nicely into one screen. And then we can add some cattle. So that is really, really simple. And I'm sure all of us can do this. We can use one chart to filter another chart as well. For example, here we click sales and it's highlighting all the states and cities in the south. So that's basically a very simple tableau work through. And at the end you can save the file. Basically you want to save it if you, there are several options. If you save it as TWBX, you'll package the data within your workbook. So when you share it with other people, they can open it. If you save it only as TWB, actually it will detach the data and other people may have difficulty opening it. So rule of thumb is generally save as TWBX. Any questions on Tableau? Oh, yes, that's a very good question. So gross percentage, you can look up the table calculation. You can, in order to have gross, you need to have a date. Yeah. Right. You can try it on your own. Basically it's very easy. You put date on the Y axis. And then you can try out the right click it, use table calculation. And then you'll give you an option like months and months, year on year growth. And you can do that. I will here, I will demo the usage of label here. So this is a field you can add labels. The difference between label and two-tip is two-tip is something you only show when you mouse over and it will pop up. Whereas label is something that is just here. So as I drag the label of this small panel, which is about profit, here in the label pane, label button is showing the profit of by region. Oh, similarly I can add in profit ratio here. Also sales. And then you can also format your label. You can change their color. You can change the unit. And sometimes with an ISR, you can only, you can choose only the min and max instead of every data point, which makes the chart a bit cleaner. And you can, suppose you have like a map with tons of data points, you can use allow labels to overlap at the max or unselected to kind of automatically hide some label to avoid like clattering the data in the chart. Sure. In this case, let's say I, this is the button to duplicate. So in this case, let's do stack bar only on one measure, which makes sense. If I can put state as a stack, it seems to be too many states, but you get the idea. Basically when you put some measure onto the color, you'll slice by that measure. I think it's because some state has negative profit, which means this is not a great example. Let's say instead of profit, let's just show the count of it. So in this case, you will start from zero, which basically there's a number of cells or whatever, a number of records in this state, in this region. So suppose we have like right now, we are using the aggregations, some aggregate or new, which is available. Yeah. If you want to do something which is not available here, do we have to do it in the data itself and create a column there? And so in terms of non-aggregate, I can show you an example. For example, now it's showing profit by region, which is aggregating to some, but you mentioned you don't want aggregate. So you deselect, you go to analysis and deselect aggregate. The measure is going to show your distribution and you can do things like boxplot or stuff like that. But can we do any complex aggregation like not aggregation basically, complex calculation, which I want to do on certain basis, like if this row is this then add it to... Right. So actually the profit ratio is our calculation. So to differentiate calculation with a normal field existing the data source, you look for the equal sign in front of it. So in this case you saw equal sign, you know it's calculated, you click edit, you can see it's formula. So in this case profit ratio is the division of profit and sales. Sorry. So you can right click a field, click edit. So how do you create an image? Oh, it's just as easy. For example, we want to disregard this. We want to say, we want to create a calculated field, which is like double the profit. So you just created a very simple calculated field. Can you use different also there? You mean... Can you use different conditions? Conditions. You actually need to aggregate it or make it as zero according to the conditions? Yes, you can say, let's say if profit is less than zero, show one. If it's more than zero, show minus one. You can definitely do things like that. Yeah. That's a really great question. You can actually aggregate different tables and you can do joins. Like if those two tables have the same field, you can join them together. Yes, and then you can put in all the fields from table one, table two. I think in business setting, most cases Tableau works really well, unless you want to create very customized like the US gun desk, that kind of chart, which is like more advanced than this. But for basic chart type that people use in the business or even academic field, maybe tree map, bar chart, and line chart is most common chart type I saw in business setting. So I guess it's okay in that area. But there are multiple tools on the market, like click view, macro strategy, each has its own pros and cons. It really depends on the use case. And then next month, my days didn't change. I would have some new data. So you can do a simple refresh. It depends on your connection. So if you are connecting it to Excel and your Excel got updated, you probably need to just right click refresh, which is really fast as well. If you are connecting to a live database, then chances are you don't need to do anything because you will refresh yourself. Yeah, you can also schedule it to refresh daily. Yes. And so let me show you this. Basically, you can connect to different kind of data source. Excel is one of the options. And the Excel is kind of static. Excel, CSV, text file, those are kind of static data source. You can also connect to server, which are more like dynamic data source because they will refresh themselves. You can also connect to stuff like Google BigQuery, all those like more advanced data source as well. Can you show us how the statements are related and where it's actually related to the database, or join the database? This one, let me see. This one I don't really have an example, but you can try with two data source with a shared field. For example, name, age, and name, job title. When you plug in, like when you input these two data source, basically you will show like two circles here with an overlapping kind of area. And you can basically select that to confirm that I want to join these two source and you will pull in both data table. Okay, so the question is, when you build a tableau, does the management or your consumer need to install the software as well? So I think this is more like a question answered by Tableau selves, but I can touch on that. Basically, you can publish your Tableau to Tableau server, and then basically use web-based. So anyone who has the login to that website will be able to access without installing software. It's not free, which is why in the next session we will talk about R, which is open source and free and great. Tableau professionals, and actually there is such a thing, now they have a Tableau beta where I'm paid, it's the unpaid version where you can only pay for Tableau users. So can this be exported and then can the dashboard be exported and used in a PowerPoint? So the question is, can dashboard be exported in a PowerPoint? It can be exported in Anis PDF, for sure. I do recall it can be exported in PDF, maybe you can screenshot and put it in a PowerPoint. In fact, your interactive walk-in tool is different. Oh yeah, that's true. Yeah, that's right. Yeah, unless you make a GIF of it and then plug it into Google Slice or something. We want to show our visualization to the consumers. So my concern is possibly I'm working in a public house, most of the time the data is confidential. For example, if I do something at base, but I want to keep my original data confidential, then perhaps the only thing I can do is upgrade the data. Is it possible? Because there are like two saves, right? One saves the graph with data. The other one is, can you save the graph or something? That's a good question. So if your data is confidential, can you prevent people from seeing the data? I think it's a possibility. So one thing is you can restrict the access of people who can use your dashboard. So you can give it the access to unlimited people. Another thing is you can prevent people from downloading the workbook. So once people, because once people can download the workbook, they can actually unpack the data. So if you prevent them from downloading the workbook, you could basically give them only the view access. Then you prevent them from seeing the underlying data. Yeah. So this is only when you want to pass workbook around. Most often you just publish it to the server, which is easier to share. Because when you pass around the workbook, actually people need to install Reader, which is like a hassle for them. But if you do need to share the workbook, I recommend the TWBX, which is packaging the data. Otherwise, sometimes they cannot open. Yeah, it doesn't work well. Any further questions? Okay. And so anyone, I saw some people have brought laptop. Do we want to do the top R demo? Anyone wants to do the hands-on session in R? Okay. Let me know if you have the R open and with the notebook installed at everything. Okay. Let's get started. So the demo we are doing now is based on one of the building data set called diamonds. It basically contains like, so four C or five C of diamonds. And so in order to view the data set, first you need to load it. So you can run this snippet. To run it, you can click the little green button here and basically it's running. What it did is it loaded the data set. So in this area, you can now see your data set. And when you click on it, a pop up window, which gives you like a view of what's in there. Similar to clicking on the data field, you can also run this head data set, which basically shows you the first six rows of the data. So is it always six because I've seen like heading data always? Yeah, head and tail is always six, but you can do things like 10. And then you will show you 10. And another very nifty function, which is summary statistics, is you will show you kind of the mean max quartile of each field. So this is where things get interesting. We are using the ggplot library. And basically ggplot stands for grammar of graphics. It forms a most used graphic library in R, one of the most used library in R as well. It has a fixed set of syntax. And you can use it to do very dynamic visualization. So here we can build a histogram. So this is a histogram of price. Basically, the encoding you pass through is you want to visualize the price variable. And for your histogram, you want to have a bin width of like $200, I guess. You can vary the bin width from 200 to 500. And then you can actually color it by clarity. So now it gives you a stacked colon. You loaded the diamond data set. So when you run ggplot, you provide what data set you loaded. So you loaded two data sets in this. Yes, that's correct. It's the same. ggplot2 is a library's name. The syntax is actually calling ggplot. So it's the same. No, there's no version one, version two. So it's one version ggplot2. So instead of viewing it by a stacked colon, you can turn it into a small multiples. So if you're still following, what happens here is I didn't pass the color variable. I didn't pass it a color variable clarity. I use clarity to break it out into like multiple rows. So this function is called a facet grid. And basically you put it on the left, it will show us rows. If you put it on the right, it will show it as columns. Not clear at all. Let's stick to the row. Any problem so far? So if you don't follow, you can just watch along and play with it at home or something. And the next, instead of a histogram, we can do a point, like a scatter point. So the default is like black. And what you pass into this graphic is you're saying x equal to carat, y equal to press. So it's like a press versus carat scatter plot. And then we can add in some colors. Again, we call it by clarity. So this is what gives you, it also comes with a little legend to help you identify which color means which clarity. So this one, the I1, right? Is that a way to change the order? This one? 73? 953. This one? Yes. You can switch the order of the clarity by re-leveling the factor, which is a little complex here. But if you Google re-leveling the factor, you will show you how to put in some sequence into this clarity field. And then after then you pass in the clarity, you will show the order you want. I think they, yeah, I think so. And we can try a little bit of coding ourselves. Instead of color by clarity, we can try coloring by card. Also card is like really distributed is everywhere. So just now when we saw the clarity, there's a lot of overlapping. One way to deal with that is we can introduce alpha, which is transparency. Or we can make the sides a bit smaller. So in this case, we make the doors smaller, which is slightly clearer in some way, since. Would you like to change the color? The colors which are fading out like before? Yeah, the color is this default rainbow palette, but you can use your, either you can use your own palette, or there's like multiple libraries of palette. One is called Vridis, which is so called scientifically proven for people to like sequence the color in like even sequence. Some people say rainbow is not great because people don't perceive each color as equal, something like that. And then you can change the size by card. Basically, we're exploring. It's pretty cluttered here. We can also add in transparency. So we have this scatter plot one thing we can do is we can fit our linear regression line. So you're adding a line on our black background because we didn't pass in any color or anything like that. Sorry, the plot is more of the regression, right? Yeah. What is alpha? So alpha is here, 997. So alpha is, so what is alpha? Alpha is the transparency. When alpha is zero is like almost no color. When alpha is one is the original color. We can try already low value. You'll be very, very faint. Yeah, because there's so many data points here is Joe showing very strongly. And then in the center of our straight line, we can show our smooth line basically by passing Joe smooth. So in this case, we feed one smooth line to the entire data set, but maybe we want to feed multiple smooth curves to maybe multiple card. In this example, we can cut by clarity. So in this case is fitting one line each to each clarity. And the color is also aligned with the color with the clarity legend. SE is standard error. So if you don't do SE equal to fourth, you will show a band of confidence interval. So this is what happens if you don't say SE equal to fourth, because the default is SE equal to true. It will show a band. The band should be the confidence interval. You have no wrong. And should be 95%. You can check that out further. So here we feed multiple lines. We can do something more adventurous. We can color it, but only feed one line. So we can have both. So in this case, we colored it, but then we didn't color the smooth curve. What's happened here is basically we're passing a group equal to one to tell this G4 to show only one regression or fit a smooth line. Okay, so if you compare these graphics, which we don't have group equal to one with this, which we have group equal to one. So when we don't group it, and we are coloring it by clarity, basically you will feed a smooth line to each clarity bucket. That's right. When group equals one, it ignores the card you put in. You basically feed our smooth curve to the entire data set. And just now we showed facet grid, which is like either go by row or column. And here we can show facet wrap, which is, do you feel like a small multiple X by Y? We can have multiple rows and multiple columns. And then here we are seeing this gray background everywhere. Maybe we want to apply a more minimalistic thing. So one thing we added is we added this thing minimal where I added this parameter thing minimal. So you will show like a white with faint gray grid. So that's a section on scatter plot. Next one we can do a box plot. So the difference is instead of passing geome point, geome is kind of the aesthetical parameters. We're passing geome box plot. So it's very simple and it looks decent. It shows you the distribution of price by color. One variation of box plot is like violin plot, which looks a bit funky. Looks like this. And same to what we did to the scatter point. We can also do a small multiple of the box plot. Basically here we are dividing it by cut. So each small chart is about a particular card, whether it's a fairly good card or it's a premium card. So we're almost towards the end. And just as a brief review, we did histogram, we did box plot, we did scatter plot. We also turned them into small multiples either by row, by column, or by both. And we're just missing a little something. We can add a title. So in the last example, we will add a gg title and you can basically add any title you want. So that's basically it. It's not super difficult. You just need to understand where to put for each syntax and what to put for each visual encodings, whether it's like x variable, y variable, is it color, size, alpha, transparency, or the small multiple you want to show the graphics by. Which row it is? 143. Okay, so 0.8 is showing a legend for the alpha, but maybe you don't want to show. You can actually hide this. Yeah, it's showing a transparency as well, but likely you don't want to show that. Right? So it's a standard box plot. It's showing like apple, quartile, lower, quartile. Yeah. When you try to actually use the head to just see how the syntaxes are in the functions, it's quite difficult to get the examples here. So sometimes it's all too tricky to know what exactly they are trying to say. Like, for example, now I understand why alpha is used or somewhere, but building something like that is quite difficult. Right. So are they like any files or where we can start on with these examples? Yeah. So they have building help function here. There's also, with Google, you will be able to, you will actually be able to find a lot of our graphic example. There is also a website called r-graph-gallery, which basically gives you some fairly common chart type with the code. Yeah, there's a lot of documentation on chart. There's some many books on it as well. So I'm pretty sure you will be able to find it. All right. So that's for the demo session. Any further questions? Sure. And so one thing you can do is you can save it. You can do something like PNG, chart name.png, something like this. And then afterwards you will do device off to kind of shut off the graphic device. But an easier way to do is I just run it on the console instead of the notebook. And then you will give me an export button. I think the notebook is like more interactive. It lets you code and see the results immediately at the same time. Console is like normal programming language. You will enter something into console and now chart pop up in a different chart window here. You can, if you need, you could export multiple chart as well. Like you can apply some function or even do a loop and then to export a series of chart. Yeah. So the question is how to get the console? Are you on RStudio? So on RStudio you should be able to have the console as one of your pain. So to run the code in the console is very simple. You just copy it and then you should have a console area in some part of your screen. Depends on your own layout. For me it's on the bottom left. So just copy it and paste it and then press enter. You will show. Yeah. Is there any further questions? Cool. So really well done everyone for bringing your laptop, participating in this session. I hope you like visualization or enjoy the introduction of the basics of it. So that's it for this session. You can also fit in our feedback form, which is the shortened URL is www.called.workshop. If you like, it's a feedback form you can fill in. Oh, so the feedback form is www.called.workshop. You can fill in to help us improve, maybe help me improve and you see what kind of topic you like. Go young. Thank you.