 Data visualizations are often used as a means to communicate complex information to the public in ways that cannot often easily be done with words. So what's involved in the creation of these? Today we'll look at the different types of steps and operations that are often used. Hi, I'm Chris Davis, and I'm an assistant professor at the University of Fronion. In this video, we'll look at the steps that are used for data visualization analysis. More specifically, after this lecture, you should be able to describe what these are. You'd have an understanding of how you could apply these steps in your own work with data. When you do data visualization and analysis at a very basic level, you have to get the data, be able to process it, and create the visualization itself. There are often a lot more steps than this, and a good overview of what often has to be done is provided by Ben Fry in his PhD thesis on computational information design. In this, he talks about the steps of acquire, parse, filter, mine, represent, refine, and interact. The link to the original PhD thesis can be found on the platform. We're going into more detail about these steps in the next few slides. While you can think of these steps as being part of a linear process, or once you're done with one step, you can move on to the next. You could also conceive a situation where each of these steps can loop back and change the output of the previous steps. For example, while people are interacting with visualizations, they may filter out a different subset of the data, or select options that alter how the visualization represents the data. People may want to even load in new data sets, given something that they discovered using these later steps. This whole process of creating visualizations is a mix of science and art. Fry argues that this whole chain of steps draws on skills from different backgrounds. As you progress through the different steps, you're using knowledge from fields such as computer science, mathematics, statistics, data mining, graphic design, human-computer interaction, and information visualization. To start, you of course need some way of acquiring the data. When acquiring data, it's generally good to automate the process as much as possible, especially if you expect for the data to be updated in the future. While you can just go to a website yourself and download the data, you can also set up a program to download the data directly from a URL, which allows you to easily repeat the process in the future with new data. For the next step, parse, there needs to be some way to read the data so that the computer can understand it. There are actually several components to this. First, the format of the data, second, what the data means, and third, what the data really should mean but doesn't, currently due to errors. Regarding the format of the data, there are different file formats that the data may be written in. A key thing to remember is that in whatever programming language that you're using, there should be some library or piece of software already written to help parsing the data from common file formats. The next thing to be aware of is what the data actually means. Specifically, there are numerous data types that represent things such as text, dates, numbers, and geographic coordinates. One thing that you'll likely experience is that parsing dates will sometimes cause problems. For example, a date that looks like this, 8.1.2016, may actually mean 1.8.2016, depending on if you expect the day or the month to appear first in the date. Dates such as August 2016 may also be ambiguous if the computer expects a day of the month to be specified as well. Even if the data can be incorrectly interpreted, there may still be issues with incorrect values in the data, such as inconsistent units of measurements being used. For example, you may find data that is listed in units of millions of euros, but also in euros, or maybe easy to spot the values in euros, since they are so much larger than the rest of the reported values. One way to deal with this would be to set up your program so that it spots these issues and automatically fixes the data. If there are a lot of diverse types of issues in the data, you may want to use a free open source tool, such as OpenRefine. The developers of OpenRefine describe it as a powerful tool for working with messy data, cleaning it, transforming it from one format into another and extending it with web services and external data. In practice, this is a very useful tool for dealing with data that needs a lot of cleanup in order to be useful for later visualizations. On the platform, you will find a link to the OpenRefine project page in a tutorial that demonstrates the different operations that you can do with it. The next step is filtering, which is relevant if you need to extract some subset of the data instead of using all of it. The key question to think about for this is what attributes of the data need to be filtered on. For example, are you looking at categories or classes, numeric values, ranges of dates, geographical areas, or a combination of multiple features? For the mind stage, filtering may not be enough, and we may have to use very statistical techniques in order to find patterns of interest. We may be interested in highlighting outliers, showing what average values are, or identifying what appear to be clusters in the data. What is shown here is a part of a reference sheet for the DePlyer library for the R programming language. We'll discuss this in more detail in a later lecture. As you can see, this provides many functions by which you can summarize data using statistical functions. This also allows you to join datasets together based on matching variables among many other features. For the represent stage, it's a question of how you would like to visualize data using different techniques. For this, we show a reference sheet for the ggplot2 library for R. This highlights several techniques that can be used for plotting one variable or two variables. Just from this, you can see different examples, such as histograms, bar charts, scatter plots, stacked area charts, density charts, and so forth. For the refined step, this is a question of how to change the visualization to highlight things of interest. And this step can also relate to the output of the previous steps. For example, when selecting a subset of the data, you may want to make the rest of the data more transparent in order to de-emphasize it. They also want to create objects that become larger when you place your mouse over them. The final step of Interact applies if you're doing dynamic visualizations instead of static ones. There are many free open source options for software that can help to create these. If you're used to programming in JavaScript, you can use libraries like d3.js. For R programmers, there's a Shiny package, which allows you to write your data processing code in R and have it interact with a dynamic web page. The interaction with the data should allow the user to explore some complex issue in a way that helps them to understand it better. A good example of this is a visualization that gives people options for balancing the budget of the U.S. national government. The links to this can be found on the platform. This is a very complex issue where the amount of media coverage a budget item receives doesn't necessarily correspond to its overall size. This visualization helps people to give people more insights into the complex trade-offs that have to be made. It helps them to understand some of the difficulties that are involved. All of these steps that have been mentioned are part of the storytelling process, which is to a large extent what data visualization and analysis is a part of. There are several very famous examples of this. One relates to a cholera epidemic in London around the 1850s. At that time, people thought that cholera was caused by bad air. A physician named John Snow decided to actually plot the homes of the cholera victims and the locations of the water pumps. In doing so, he was able to locate the water pump that the disease was spreading from. Another famous example was done by Florence Nightingale, who created a visualization showing that during the Crimean War, surprisingly, most soldiers were not dying of wounds from the battlefield, but rather from preventable diseases. Through this work, she was able to argue that improving sanitary conditions could go a long way toward saving lives. As we have seen, there are many diverse steps involved in performing data analysis and creating data visualizations. Furthermore, each of these steps also requires different skill types. Ultimately, this is about storytelling and about using techniques that allow for us to communicate complex data to the public. In a later lecture that I will give, you will be able to see actual examples of how this works, and you can directly see this through the link in the footnotes on this slide.