 The Data Analysis Workflow. I'm Professor Vernon Gale. I'm Professor of Sociology and Social Statistics at the University of Edinburgh and part of the National Centre for Research Methods. At the current time, the National Centre for Research Methods are unable to provide any face-to-face teaching. I hope that you and your families are all healthy during this difficult period. This presentation and its associated resources are all about the Data Analysis Workflow. A thought experiment. Be honest. Have you ever lost a file? Have you ever wondered if you've deleted a file? Have you and a colleague ever been working on different versions of a file? Have you ever struggled to identify data files? For example, a file called chapter1-2019.dat and a second file called chapter1-2019.dat Well, if you experienced any of the above, you could improve your social research workflow. The examples in this presentation tend to lean towards statistical analyses of social science data, but many of the issues around the workflow and being well-organised permeate across a number of research processes, so it's worth continuing to listen in. Here's me one Saturday morning. I look fairly frazzled. I'm working fairly hard at this stage, but if I'd improved my workflow, it could be a different picture. Here's me on an alternative Saturday morning. This is the West Coast of Scotland, believe it or not, at a time before corona when we could still go outside. Who worked for you? What is it? Well, for me, it's the whole process from conceiving of an idea right through to its publication. Commonly, you'll download some data. Here is the National Archive at Essex University. You'll download the data. You'll be using a computer or either your laptop or a machine on your university network. You'll probably be using a statistical software or programming language, so something like SPSS, Stata, R or Python. And the first stage will be data wrangling. Next stage will probably be something like exploratory data analysis followed by something more formal and statistical, such as statistical modeling. Then these results will be written up and submitted, for example, to an academic journal. Then after a peer review, hopefully they'll be published in the journal. For many of you, the process will look the same, but the output might be something like an MSC dissertation or your PhD thesis. Analyzing data without a planned and organized workflow can be compared to drinking and driving. In both situations, it doesn't matter how careful you are, it's still highly likely to end in a wreck. These are the wise words of Professor Philips Stark, UC Berkeley. Therefore, just like drinking and driving, we strongly warn against not having a systematic workflow. The workflow. The workflow should be planned and carefully orchestrated. The workflow must not be ad hoc, so it mustn't be worked on piecemeal and it mustn't be developed in reaction to mistakes. The workflow cycle is relatively straightforward to conceptualize. Plan, organize, compute and then document. After gaining access to your data, having downloaded it, this is a picture of the UK Data Archive at Essex University. You'll probably have your data on a laptop or on a machine on your university network and you'll tend to be using a statistical software such as SPSS or STATA or programming language such as R or Python and you'll begin your data wrangling. The first piece of advice is don't use drop-down menus. I'll say it twice. Don't use drop-down menus. If you use drop-down menus, you'll have no audit trail. The audit trail is nothing more than a line of breadcrumbs that leads you back to where you started and you won't have one if you use drop-down menus. Write out your data wrangling commands in syntax. Using GUIs, graphical user interfaces, will leave you in a sticky mess. Data wrangling. What does the data wrangling phase involve? It involves a number of steps. Commonly, selecting variables. Surveys tend to have a large number of variables, but we will only want a small number of them for our specific analyses. We'll need to operationalise measures. The data may have information on income. It may be net income or gross income or household income and we need to operationalise the measure given our research question. We'll need to recode variables to get them into the specific format we need for the data analysis that we're going to undertake. We'll need to often select cases. Surveys tend to be large in general. We may be only interested in a subset of cases. For example, just married couples or just households with people from certain ethnic minority backgrounds and so on. Inevitably, any dataset will have missing data or missing information. We'll need to think hard and work out how we're going to deal with that information. The question is, can these actions be traced in my audit trail? Every action that transforms the raw data into the analytical dataset should be able to be traced within the audit trail. Minor actions in the data wrangling phase can have major consequences later on. After we've transformed our raw data into an analytical dataset in the data wrangling phase, we tend to go forward and do some exploratory data analysis after which we go forward and do some more formal analysis, usually something like statistical modeling. But it's not a straight linear phase. We quite often have to loop back around and do some more exploratory data analysis. We sometimes have to go further back and do some more data wrangling as well. And then we end up with a set of results that we'll write up and ultimately submit for peer review. In the data analysis phase, there are many, many operations that we routinely undertake that are very easy to overlook. For example, which cases do we choose in an analysis or the format of the variables? How do we treat missing data? Which estimation method do we use? We may be estimating the model and we may have to decide between using maximum light, head estimation, or generalized least squares, for example. We may have to think about what weights suit our analysis or how we're going to represent the structure of the survey. We may be doing something exotic like bootstrapping and we need to think about setting a seed for generating random numbers. We may be doing something like fitting a random effects model and we're trying to decide the number of quadrature points. Do we use the standard or the default number or do we need a different number of mass quadrature points? We may, when working with software such as R or Python, have to think about which library to choose to undertake a certain procedure or, indeed, which version of a software package we're using. And often, many of these operations are overlooked, but they're vital parts of the data analysis phase. The question again is, can these actions be traced in my audit trail? And once again, every single one of these actions should be traceable in the audit trail, improving the workflow. The workflow should better support you and what you do, not changing you into something you're not. So really, what you want to do is improve your workflow to support you and how you work rather than implementing a kind of very regimental, very kind of Stalinist, almost, view of how you should work. Because if you do that, then you won't be able to stick to it. So you need a workflow that better supports you and what you do. Remember, all serious work must be reproducible. There must be an audit trail. And I'll return to this point again, but I'll say it again now. All serious work must be reproducible. There must be an audit trail. A planned workflow has a number of benefits. I sometimes call these the four pillars of wisdom, their accuracy, programming efficiency, transparency, and reproducibility. What do these mean? The four pillars of wisdom. The first is accuracy, minimizing information loss and errors in analyses and outputs. Programming efficiency. Automation, for example, maximizing the use of features in software. So for example, rather than doing something 10 times for 10 waves of data, you need to do something like write a loop that will loop over the 10 waves of data. So you're using the software or the programming language to work more efficiently. Transparency. You need to be able to show what you did, why you did it, when you did it, and how you did it. So transparency, showing what you did, why you did it, when you did it, how you did it. So this leads on to reproducibility. Can you get the same result every time? Whoever it is doing it or wherever it is you're doing it, do you get the same result if you run the analysis on your laptop or on a university network machine? The work should be reproducible. And this will help, especially when editing. This is a point I'll come back to. So when you're rewriting things like rewriting parts of your thesis, having to undertake new analyses, for example, or when you've submitted a paper and you've got evil referees' comments back and you have to go back to the work and do more or change things. So this will be natural benefit. I'll tell you Long's law now, this is J. Scott Long from Indiana. He says it's always easier to document today than it is tomorrow. Corollary number one of Long's law, nobody likes to write documentation. Corollary number two, nobody ever regrets having written documentation. And the final thing Long says is, has anyone in the history of data analysis ever said, these files are too well documented? Keep calm and write comments. So I'm going to get you to visualize for a moment. It's about 4.15 on a Friday. You're working on something. You need to go relatively soon. In fact, members of your department are meeting up for a drink after work. What do you do? Well, it's very, very tempting to just save your file and think I'll put some comments in first thing on Monday morning. Well, that's a good intention, but we can all guess what's going to happen when you come in on Monday and there's another 40 email to answer and students waiting to see you, for example, if this is all going to fall apart. Write comments, however cursory, comment, comment, comment. This is the key to a good workflow, is having lots and lots of commentary, lots of narrative in your documentation. But the good news, this is a good news story. Improving a workflow can be done with a modest amount of effort. And the other bit of good news is the less experience you have, the better, because you can just start from scratch. If you're a very experienced researcher, one of the things you can do is wake up and say that from tomorrow, every new project you're going to engage in, you're going to improve your workflow. But if you're starting out, if you're a master's student or a PhD student or an early career researcher, you can just say like from now on, from tomorrow, I'll wake up and I'm going to try and follow good workflow practices. At this point, you may be sitting there thinking, please give me some practical strategies and tactics and tips for improving my workflow. And there are a number of things you can do quite easily that will help your workflow. Having a standard directory structure will help you enormously in your workflow. Here's one that's suggested by J. Scott Long. It's quite elaborate. He's got a code books in one folder, clean data in a folder, raw data in another folder, his Stata Do files in another folder, and other folders for documents, figures, logs, tables, a temporary folder that could be used, for example, when merging files or matching data, a trash folder, clearly, for the things he doesn't want, a working folder that he may be using for kind of day-to-day working before something is lodged more permanently in one of the other folders. File naming conventions are very important. That takes us back to the start of this presentation when I said, have you ever wondered about a file? Here's another J. Scott Long suggestion, one that I've worked with for many years now. So a file name for me has a name. It's normally something that is very i-readable. So something that will give me a clue as to what it's all about. A date, the depositors' initials, the version and the type. And the versioning is particularly important in terms of keeping track. So for example, British Household Panel study, BHPS, the Wave A individual response file. You can see that from the i-readable name, and that's their protocol down at ISO. It's a very good one. So the BHPS, Wave A individual response. So that's BHPS, a Indresp. Then I've got the date of the file. So I've got the year, the month and then the day. You'll produce many files in a year. So having it round this way is very important because you don't want to have loads of files ending in 2014. I have produced some other files since 2014. So I've got the i-readable name, the date, the depositors' initials, that's me, VG, and the version. So it's V1 and this is a .dta file. So it's a Stata data file. So this would be a BHPS file, a Indresp. It's deposited on the 6th of May, 2014. It's deposited by Vernengale VG and it's version 1. And the file type is a Stata .dta file. So a good file naming convention. Again, you can start from tomorrow. If you do so, it will help you enormously. Certainly there'll be a time in the third year, the final part of your PhD, it might be later in the third year even, where you're looking for files produced in year one. And this sort of protocol will save you lots of time, lots of anxiety, lots of stress. It will help you keep track. The analysis file, depending on what software you use, if you use SPSS, it will be a syntax file. If you use Stata, it will be a .do file. If you use R, it will be an R script. It might be, you might be working in R Markdown, you might be using R Studio. If you're in Python, for example, when I work in Python, I tend to use Jupyter Notebook. But whatever system you use, whatever software or data analysis language you're using, whatever you're using, have a system so that when you write out your syntactical commands, both in the data wrangling phrase and the data analysis phase, you have some general information. So for me, this is the sort of information that might come at the start of a file. The information on the author, the project, the subproject it's part of, the date of the next meeting or my next supervision, what the date of the latest update is and a track of the previous updates. And you notice I have my next actions at the top of the file. What lots of people do is they write some kind of note to themselves about what they've got to do next, but that comes after a thousand lines of code. So I have to scroll all the way down. When I open up one of my files, I can see what I'm supposed to do next right at the top. So that kind of headlining helps me understand where I've got to and what I've got to do next. Variable naming protocols. Good data providers will tend to stick to clearly defined variable naming conventions. So if you download a data set like the PSID from the US or the SERP from Germany or the British Household Panel Study or the UK Household Longitudinal Study or any of the British birth cohorts, the data providers will have done lots of work to curate the data and they'll tend to have fairly good naming conventions. This, for example, is way one of understanding society, the UK Household Longitudinal Study. There's a variable, a gender variable, called a underscore sex. So in wave A of the survey, wave one of the survey is underscore sex. In wave two, it's b underscore sex and so on. And if you go to the UK data archive, you'll see this in study number 6614. Similarly, if you go to the Youth Cohort Study series, time series for England, Wales and Scotland, this spans 1984 to 2002, the variable T0 SEH type, school type. Once again, I-readable, very, very useful. So it's T0, time zero, time point zero, school type. It's the type of school that the people attended in year 11. So it's at time zero of the survey. And if you go to the UK data archive, this is in study number 5965. However, look at this survey question. Here's a survey question. It says, here are some things, both good or bad, which people have said about their fourth year and fourth and fifth years at school. We would like to know what you think. Please tick a box for each one to say whether you agree or disagree. School does help to give me confidence to make decisions. And the person, either the pupil, either agrees or disagrees. In the data set, this is what it looks like. It's by contrast, the Youth Cohort Study of England and Wales and Cohort Four, this is, which is study number 3107. This variable here is given the opaque name, DX11 underscore A. Not particularly I-readable, so you can't guess what it could be and how you'd ever get some information about that. So again, a protocol, think about your protocols. This isn't a particularly useful one for naming variables in the data set. I'm going to move aside for a moment and just talk about estimating work time, just so that I say this because it's often overlooked. I worked at Stirling University for a long time. I worked with my colleague and personal friend, Professor Paul Lambert, and the students there quite often joke that it was something called the Gail Lambert constant. The idea that work takes five times longer than you think it does, and I actually got some people to test this. A load of data analysts in the last couple of years to estimate the time that certain tasks would take and really the five times constant is very, very, very, very close to being correct a lot of the time. So essentially, if you think a piece of work is going to take an hour, it will probably take you five hours. If you think it's going to take a day, it will probably take you about five days. If you think it's going to take you a week and it's probably going to take you five weeks and so on. Why is this important? Well, this is very important when we plan work, but it's also important when we decide to turn work into supervisors, when we estimate how long things are going to take. It becomes absolutely critically important, for example, when writing grant proposals and costing data analysis when doing consultancy work and so on. So I would say data wrangling, data analysis always takes five times longer than you estimate. So think really hard about that. And it's five times when you've got a good workflow. So you can make it much longer than five times by having a kind of shoddy workflow. But I think you can probably really get in the habit of thinking about how long the task is going to take and then thinking, well, I need a slot that's about five times that to do a good job even with a good workflow. The benefits of a good workflow. There are a number of benefits to having a good workflow. First of all, it limits the duplication of effort. A good example is if you've got a good workflow and you've analyzed a single wave of a household panel study, for example, this will allow you to analyze subsequent waves without duplicating as much effort. The next is systematic work minimizes errors. And I would say that systematic work in most walks of life will minimize errors. It also helps us to detect errors. It becomes critical when editing work, whether this be revised in thesis chapters or revisions that are made after evil referees have sent back comments and sometimes these require quite a lot of work. One of the things we know about the publication process is sometimes there's a large lag between undertaking the work and submitting it and then getting comments back. And if you can rapidly return, pick up where you left off and also go back in the workflow process and sort things out, this will be a major benefit. It will aid collaboration. So if you're working across teams or if you're doing interdisciplinary research or working with people in another university or even another country, having a clear workflow so that they can look in, see what you've done, add to it, change parts of what you've done and you can understand what they've done and change parts of work or add to things, that will aid your collaboration and speed up outputs from it. It will support additional work in the future. So whether that be work that immediately follows on or new work that you take up that draws on the work that's previously done, having a good workflow that is well documented will always help to support additional work. This researcher here was aided by an early career researcher. It was this researcher here. Help your future self by having a good workflow. What to do? First, always have an audit trail. Second, don't use drop-down menus. A GUI will land you in a sticky mess. Write out the commands you need for data wrangling operations and the commands you need for data analysis operations. Have a systematic directory structure. Have a convention for file names and for variable names and pay attention to version control that helps you keep track of things. Finally, most importantly, plan. Don't do ad hoc work or work on the fly. And write comments. You can never have too many comments. Write comments. Comments are your friends. Jay Scott Long produced a book called The Workflow of Data Analysis Using Stata. It's a real bible on the subject of the workflow and whatever age or career stage you're at, it's worth a read. He's also posted a really good PDF version of a talk on the workflow. The link is below. And a few years ago, myself and my colleague Paul Lambert wrote The Workflow A Practical Guide to Producing Accurate, Efficient, Transparent and Reproducible Social Survey Data Analysis. And that's an NCRM working paper that's available from the address here. Remember, all serious work must be reproducible. There must be an audit trail.