 Hello. Welcome to this webinar on reusing data for dissertation projects. My name is Maureen Haker. I've worked with the UK data service for about eight or nine years now. I also teach research methods at University of Suffolk. All right. So what we're going to plan to do today is to go over some key ethical and methodological issues of reuse projects, which you'll need to consider for a dissertation project. And before we get any deep in that, I just want to give a very brief overview of what the UK data service is for those who have not yet had the pleasure of exploring the archive yet. I've tried to pull in a few case studies throughout so you can see some of our data in action. And I'll sign post you at the end to some further resources, which can help if you're planning a reuse project. This is an introductory webinar. So we have assumed most of you will probably not have used the archive before, and may have only had some introductory modules on research methods. And before we get too far along, I want to address one simple point first. What is secondary analysis? So in short, secondary analysis is a method which asks new questions of old data. It's analyzing data that you've not collected yourself. But there is a complicated nuance here around terminology. And you may have heard of some other terms used to describe this method, including secondary data analysis or reuse projects. They all refer to the same method. So don't get too confused by this. But there is an ongoing debate about how to call this method. In 2007, Libby Bishop wrote the article you see here, primary secondary dualism. Basically, she was making the argument that there may be a privileging of methods where you go out to collect data yourself. And using terms like secondary analysis or secondary data reinforces that hierarchy. But actually, primary and secondary analysis are a lot more similar than different if you fully consider the key methodological issues of secondary analysis as a method. Consequently, you'll find an increasing use of the term data reuse. And that's a term that I'll use throughout this webinar. But it takes into account the huge range of ways that data can be used and reused. And it also doesn't imply that any projects are secondary to the initial use of data. So you can use whatever term comes to mind first. Both are used. But just do be aware that there are different ways of describing this process of reusing data. Okay. So now you know what secondary analysis. But where would you find data that's already been collected? And this is where the UK data service, which holds the largest collection of social science data in the UK comes in. We're a comprehensive resource funded by the ESRC. Our main job is to be a single point of access to a wide range of secondary social science data. The main purpose then is the collection, ingest and processing of data. And then further dissemination of that data to people for reuse. In addition to that data infrastructure core, we also have a service layer which provides extensive support training and guidance. Who is it for? Well, we like to think it's really for anyone who has an interest in data. Traditionally, the main audience and the people who probably both deposit data with us and use our data the most tend to be academic researchers and students. There are a lot of other groups that are well represented, including government analysts, charities, foundations, businesses, research centers and think tanks. All give us and use our data. Given the importance of data, how it's used, how it's disseminated, we're trying to reach out and support a wide range of communities. What kind of data do we hold? The majority of our data, at least judging by the number of collections is certainly quantitative. You know, we hold over over 6000 collections of which 5000 of them are quantitative collections. And we hold a wide variety of that data. So there's survey data, both cross-sectional and longitudinal, aggregate statistics, domestic, international macro data, and the census data, and micro data. And then of course, we also have qualitative and mixed method data as well. So just to check that you're still with me, what kind of data do we hold? Excellent. We do hold all of it. And it looks like everybody across the board is kind of answering all of it. Excellent. Yes, we have all of these types of data. So quite a wide range of data really. Okay, so where does it actually come from then all of this data? Well, again, that varies. And it depends on the data type and some of the sources that you see listed here, including agencies, statistical time series, those are clearly main sources for our quantitative data. Most of our qualitative and mixed data comes through individual academics. So, you know, research grants that that are often funded by the ESRC, but not exclusively, you know, we take data that's been funded by others like the Wellcome Trust or Leather Hume and other independently funded work. And of course, we also hold some originally paper-based public records and historical sources, including things like the census. And where can I, where would you be able to find information about it? We do have a website, ukdataservice.ac.uk, which holds quite a lot of information. And from there, you can find the data catalog. And we also have hundreds of pages which discuss methodological issues like gaining consent, anonymizing data, storing data. And there's also student specific tutorials. So we do have some data skills modules, for example. There's some workbooks and exercises that are based on collections we hold. And there's also an FAQ page if you have specific questions about how to use the website. But getting back to dissertation projects. What kind of, you know, dissertation project would you be able to do reusing data from the archive? Well, really, we need to go back to the beginning to answer this and think about the research process. And you're hopefully familiar with this model that you see on the screen now. It starts with some kind of, you know, topic or general direction for your research. You do some background research into the literature already written on your topic. And from there, hopefully, you're inspired to ask a research question, which builds on that body of research. Once you have that research question, you then decide, you know, what's the best way to answer that question? And you design your project. And once you've settled on your method, you then actually go out and collect the data, analyze it and begin your write up, which is, you know, what you would submit for your dissertation. Now, there may be a few extra steps there, or you may swap a couple of these steps depending on your theoretical foundation. But generally speaking, this is how we would normally think about the research process. When you're doing a project using secondary analysis, however, this process looks a little different. The model clearly shows how the research question is built from your chosen topic area, your preliminary search, possibly the literature that you find. However, with secondary analysis project, the research question is derived directly from the data. So you would start with the topic you're interested in. But instead of looking for the literature, you would look for the data and start evaluating collections. You might be able to hear my dog getting quite excited now. There might be the postman walking by. So apologies if you can hear her barking. When you do find a collection that intrigues you, you then ask a question, a research question of that data. And from there, you would then go on to find what literature exists on that question. You then would not need to collect the data, you just need to access it, which is, of course, one of the key advantages of secondary analysis. The data is already collected. You just need to get your hands on it, either by downloading it from the catalog page, or you may need to go into the archives if it's only available in paper form. Increasingly, that's not the case. Most data that we get now is already in a digital form. Once you have it, you then analyze it and write up your dissertation. So the key I'm trying to make here is around reusing data for dissertation projects. Is around where your research question comes from. It would take a lot of time and inside knowledge about the data that's held in the archives in order to be able to come up with a research question and then search for the perfect data to answer it. You'd be searching for the right data for a really long time, unless, as I said before, you've got really good knowledge about what collections are already held by the archive. For a dissertation project, when your time and resources are limited, you'll want to look for the data first, and then from there develop a research question, which gives a new take on that data. You can of course spend some time looking for the right data, but again, for a dissertation project, you may want to first look and see what data exists on your general topic area before nailing down your research question. With that being said, it's probably important to have some kind of idea around what kind of project you want to do. While you might be exploring data without a specific question in mind, you may want to think about what kind of research design your project will follow. And there's four types of reuse projects that will lend themselves well to dissertation projects. Reanalysis, replication study, comparative study, and re-study. So reanalysis is probably the one that comes to mind when thinking about what secondary analysis is. And this involves thinking about the wide range of approaches you can take in the analysis of a data set. It usually means asking some kind of different research question than what the original researchers were trying to do. So for example, Clive Seal and Charterist Black did a study using comparative keyword analysis of illness narratives. Now the original illness narratives had looked exclusively for health research. So the interviews were meant to explore how diagnoses were made. When Seal and Charterist Black came along to do the comparative keyword analysis, they were much more interested in an analysis of the discussions between patients and doctors rather than the actual health issues that came up in the interviews. So the question can be very different in that sort of way. Or sometimes the question can be on a similar topic to the original research, but have a slightly different focus. So for example, Joanna Bornat looked at gerontology as a topic, and she found two different data sets that look specifically at this topic. However, Bornat's research question was on racism, which wasn't the focus of the original work, but the data set was rich enough to allow her to explore this theme within the existing data. If you want to use the exact same analysis strategy, that would be a replication study, and that's also possible. So right now there is some concern over the reproducibility of research, and replication studies can reveal the messiness that's involved in working through data. One of the most famous or infamous really examples of replication is from Thomas Herndon, who is a postgraduate student at University of Massachusetts. He was assigned an assessment to replicate the results from a published study. So he chose Reinhart and Rogoff's 2010 paper, Growth in the Time of Debt. Basically this paper came up with the proportion at which your national debt can be of your GDP before you see negative economic growth. Thomas Herndon pulled the OECD data to rerun the analysis as the paper had laid out, but he got a completely different answer. The paper published said that debt cannot exceed 90% of GDP. However, he calculated that the debt can actually exceed your GDP, and even then it's only a minimal impact on economic growth. So after contacting the original investigators, he found a flaw in their data set where basically they had miscopied some of their cells. The full story is published in 2013 in the New Yorker, so a replication study hopefully won't always find flaws in the original study, but nonetheless this is a study design that's worth considering and it certainly helps you develop an appreciation for the research process. You could even develop a project where you would rerun a series of studies on the same topic or you explore a complicated data set with missing data, transforming variables, and so on. You can also do comparative work, so you might be looking at an international comparison between two countries or comparing social subgroups of the population based on a shared social characteristic. Our key data page for quantitative data sets outlines some of our large national surveys held at the archive, any of which would allow you to do some kind of comparative work without having to collect two sets of data. You could compare samples across time, geographic place, gender, ethnicity. These characteristics are usually collected as standard for these larger surveys. The final type of reuse is exemplified by a case study that I'll go through with you now. And this case is a re-study, which is where you replicate the methods of a study for purposes of comparison, so it does a bit of secondary analysis, but it also allows you scope to collect a little bit of your own data too. The example of this kind of reuse project is from the Collection School Evers study. The original study was conducted by Ray Paul in the late 1970s as part of a much wider community study on the Isle of Shepi. As part of the project, Paul asked teachers to set a particular essay just before students were due to leave school, prompting them to imagine that they were reaching the end of their life, and something made them think back to the time that they left school. They were then assigned to write a short essay of what happened in their life over the next 30 to 40 years. In 2009, Graham Crow, who's shown here with Ray Paul, and Don Lyon decided to reanalyze this data set and focus solely on student aspirations. Using the very same methodology, they conducted a re-study of school leavers on the Isle of Shepi in 2009. The prompt supplied to students in the 2009-2010 data collection was nearly the same. That was supplied to students in the 1970s, so imagine you're at the end of your life and reflect back on what you've done since leaving school. They then transcribed the essays and compared the themes from the new set of essays to the set of essays that had been collected by Ray Paul, and you can see the wording of the prompt here and a small snippet of one of those essays here. The findings are fascinating and show a difference in young people's aspirations after one generation, 40 years of time had passed. But how exactly were they different? Well, in 1978, students expected much more grounded in arguably mundane sorts of jobs. Career progression was gradual and followed on from hard work, and sometimes there was talks of periods of unemployment or even death. You can see a few examples in the left column of some of the quotations from those essays, such as the one at the bottom, I longed for something exciting and challenging, but yet again I had to settle for second best. I began working in a large clothes factory. The 2009-2010 essays, however, showed students imagining well-paid and instantaneous jobs filled with choice but also with some uncertainty. Crow and his research team also noted a clear influence of celebrity culture in these essays. So for example, you have the quote at the bottom of a girl who writes, in my future I want to become either a dance teacher, a hairdresser, or a professional show jumper horse rider. If I do become a dancer, my dream would be to dance for Beyonce or someone really famous. This study is a larger one than what might be realistic for a dissertation project. The goal was to engage the community alongside the research and find innovative ways of including participants in the research outputs. So as part of this initiative, they established the Living and Working on Shepi website, which helps create a shared history and memory of what living on the aisle of Shepi means for this community. While this would be an ambitious project to say the least for a dissertation, it is nevertheless a good example of how you can combine a bit of data collection and data reuse into one project. And for those who are doing something like a PhD, this is certainly something you could consider for your projects. For others, you can design a more feasible study with perhaps a smaller sample or different sorts of outputs. So hopefully you're now budding with ideas of what you want to look for in the archives or what kind of project you might be able to do reusing data. So if you could answer the poll, what kind of project do you think you might be interested in doing? And it's probably worth noting, of course, this is four different types of study design that you could do, but actually, you know, there are more than that, and there are much more creative uses of the data as well. So, you know, this don't feel constrained from what's presented here, but hopefully it just gives you some ideas. Okay, some of you are looking at the re-study. So combining a bit of your own data collection with a bit of reuse, a fair few comparative studies, definitely re-analysis, which again is probably the thing that most people think of when they think secondary analysis. Now, I just want to cover a few key ethical and methodological issues, but I do just want to check. If anybody has ever tried to download data before, certainly not a requirement to take part in this webinar by any means, but just to get a sense of anybody who has used the archives before. Okay, so a fair few of you have actually downloaded data before. Excellent. I won't go through a step-by-step of how to download the data, but I will sort of reference it. And there are other webinars that specifically talk about downloading data, and we have some video tutorials as well, if it's something that you do need help doing. What I'm going to do now is cover some key methodological issues and ethical issues. I'll cover the ethical issues first. This probably is going to be what you encounter first as you write a proposal for your dissertation project. Since you're not collecting data yourself, you'll find reuse projects actually have very few ethical considerations, comparatively speaking, and hopefully you wouldn't hit too many snags with ethical review boards. However, that doesn't mean that there aren't any ethical considerations, and one of the most important things to consider starts right at the point of access. How do you get a hold of the data? First, though, you know, well, when we're reusing data, that's in an established archive like the UK data service. We've taken a lot of pain out of that access point by negotiating licensing issues with the person who collected the data. So that usually means that you'll need to sign what's called the end-user license. So this is a legal document, and as such, it's probably about as exciting as any legal document you've seen. But there's two really important things to note about it. First of these is that there's no sharing of data onward, and that includes with your supervisor. If you need help with your analysis, and your supervisor needs to see the data, then he or she will need to separately register and download the data for themselves. The end-user license stipulates you cannot, under any circumstance, share data or your login details with anyone. Second golden rule. All the data that we hold has been anonymized already, which is, again, likely to be the case if you are reusing data from an archive. However, just because it's been anonymized doesn't mean that it's completely impossible to figure out identities of participants. Mark Elliott has some excellent YouTube tutorials which go through anonymization theory, but in short, he makes the argument that no anonymization strategy will be 100% effective, and if it were, it would deplete the value of the data itself. So consequently, should you inadvertently uncover identities of any of the participants from the data that you're reusing, then this document you signed says that you will not reveal that identity to anybody. Okay, so, like I say, it's already anonymized, but if you did figure out an identity somehow, then you are signing something that says that you won't share that identity with anyone else. So those are the key issues to recognize when signing the end-user license, but what if you're reusing data from another source? So, for example, I supervised a student who was reusing data that was collected by schools, so she sought permission to use it first and foremost. She spoke with the person who controls the data, the head of school, and also sought information about what people were told when the data was collected. So in this case, they were told it was possible the data would be used by third parties. There's no access agreement like you would see from an archive, but she had done her due diligence to ensure that she was able to reuse the data for a new purpose, and she clearly specified in her proposal that she would not share the data onward and she would keep identities anonymous. So if you are looking to reuse data from online sources, so perhaps there's an online repository, you should definitely look into the terms and condition agreements that would have been signed by those whose data was collected. Check if there's anything else you would need to consider before reusing the data. Another angle you might want to look at is if the data is discoverable and open. So if you don't need special access conditions such as registration in order to see it, you know, if it's open access in that kind of way, then you can probably make an argument that, you know, as something that's openly available, you can reuse that data. Whatever the access agreement is though, it's probably a good idea to ensure that you won't share the data onward without permission and that you will endeavor to keep identities anonymous. Once you have sorted out the access, then you need to ensure that you're citing the data. So in short, citing archived data helps data creators track the impact of their study. It also supports reproducibility and it makes it easier to find the data you use for your project. The issue is so important that the UK Data Service actually has its own page on the website which goes into a little bit more detail about data citation and it will also help explain why it's an important ethical issue. With the UK Data Service, it's really easy to do data citation because you're already supplied with the citation that you need on the catalog page. So if you look at the catalog pages, you'll find a citation and copyright section and you can literally just copy paste that. This is a little bit trickier when the data that you're reusing is not actually archived. You won't have a preformed citation. So for example, my student reusing the school data didn't have the sort of thing that she could just copy paste. Nevertheless, she did have the organization who collected it when it was published to the system, what system it was stored on. So all she had to do was follow the citation guidelines for the referencing system that was used by our university and make a citation for that data set. If you've pulled your data from online sources, again, you'll need to find out who collected the data, what year it was published online, and any URL or DOI that might help locate the data. And that citation would then end up in your reference list. So you've got access, you've sorted out your citation, now comes actually doing secondary analysis. And the first thing you need to do, whether you're reusing quantitative data or qualitative data, is to orient yourself to the project. And I think the main point to make here is to not underestimate the amount of time that it would take to get acquainted with the data set. There may be multiple levels of context to get through in order to really understand the data. And what I mean by that is you may have more than just the data that's collected at the time of the interview or the data collection to actually consider. You may need to start thinking about the basic social characteristics of the participants, perhaps the historical time period in which the data was collected, or perhaps the location where the data was actually collected. So really the idea is that you need to understand the data set as a whole in order to really get at the root of what that data can convey. Every collection archived at the UK Data Service comes with some kind of documentation provided with the data set. And that is a really useful starting point for that. It often contains much more information about the methodology such as it could have an interview schedule, it could have a call for participants, or sometimes it includes segments from publications arising from the original study, or I've also seen funding applications included. Sometimes, however, the documentation isn't always readily available. So for example, April Galway, her research for her PhD explored identities of single mothers between 1945 and 1990, and she used the Millennium Memory Bank collection at the British Library. In her publication, which is shown here, she explains that context, so understanding how the data was collected, how each interview went, and the basic demographic details of each interviewee were imperative to validly analyze the data. The Millennium Memory Bank has over 6,000 short oral history interviews within it. So they were conducted by the BBC for a radio show Voices of the Century at the end of the Millennium. However, there was very little accompanying literature and no metadata. And metadata is basically data which describes the interviewees and summarizes the interview. So as part of her project, she then pieced together that information herself by interviewing those who were involved in making and curating the collection. In her publication shown here, she makes the final points that we should view oral history, and I would say arguably many types of data, as an ongoing creative project which goes far beyond just the initial recording. This exploration into its creation, the contextualization, the reanalysis, is just as important as the original analysis of the data. And this process of recontextualization helps you put that data in perspective and ensure that you have a full appreciation of the limitations and the opportunities of the data. In addition to the context of the data, you may also need to consider the sample. So for example, if the data set is too large, you may need to take a subsample, or you might be interested in a particular subgroup of the population. The UK is really good at collecting data on its own population, so there's several longitudinal studies which regularly collect data from a representative sample of the UK. If you're looking to draw conclusions about the UK as a whole, these surveys are a great place to look, but you'd have a lot of data to work with. Qualitative collections tend to be smaller anyways, but many of the archive data sets are funded, and therefore they collected a considerable amount of data. For small dissertation projects, you may want to, you know, be realistic and decide if you want to limit the number of participants to a smaller subsample of those larger collections. So for example, the Edwardians collection, which was put together by Paul Thompson, and it was widely considered to be the first oral history of Britain, contains 453 80-page or longer interviews. Conversely, most dissertation projects probably have an expectation of, you know, maybe six to ten interviews, so you would need a clear sampling strategy, basically, to help you choose which interviews to look at. You might also want to combine data from different collections that complement each other. So remember, you know, it would take a really long time to sift through and find different pieces of different data sets to pull together, but that is another possibility. If you feel the data all speaks to the same topic and the same research question, and you've done the work needed to actually harmonize the data to make sure that it's comparable to each other. And the last point that I want to talk about before starting to wrap up here is the write-up itself. So you've probably covered extensively how to describe and write up reports when you've collected data yourself. However, it does get a little bit more complicated with secondary analysis. So for a dissertation, you'll need to describe your research process and that method in detail, so secondary analysis and what it is. But then you also need to describe the data and the details of how that was collected. So essentially, you're describing both your method and the methods of primary investigators of the data set you're using. And there's no single way of doing this, but I would offer the following advice. Keep the focus on your method, secondary analysis. It gets confusing describing the method and the method of the data, but that's the task at hand basically, and you need to remember that your method is secondary analysis. So for example, this study on loan mothers has a section for describing secondary analysis, followed by another section giving the background of the data set that was reused. Often with quantitative data sets where reuse is much more common, you'll find that these jump right into describing the data before moving on to sampling and statistical tests of secondary analysis. But it might not actually even be titled secondary analysis. So it's really important you discuss with your supervisors what the expectations are for describing your methods. It may be that they want a more in-depth critical discussion of secondary analysis as a method, or they may want the emphasis on describing the data set and the methods used to collect the data. So your supervisor will know the assessment criteria, what kind of dissertation your markers will be looking for, and they'll be able to help you structure your write up of your methods chapter. Hopefully, this gives you a little bit of inspiration of a project that you could do and a good idea of what is entailed in doing a research project. This was just a quick overview, so I just wanted to recommend some further resources that will help you navigate through your research project. So there's a few publications which might be useful if you're doing a quantitative project. I highly recommend Anna Smith's book, Using Secondary Data in Educational and Social Research. There's also a chapter in Sage's handbook of social research method on secondary analysis of quantitative data sources. And we also have some data skills modules, which will help guide you through the steps of statistical analysis. For qualitative projects, there is a chapter in Silverman's qualitative research textbook on secondary analysis. And there's also a working paper series through the Timescapes archive. Karen Hughes and Anna Terrent have also recently published a handbook of qualitative secondary analysis. And finally, if you need more help with finding, accessing, or downloading data, then we have video tutorials which walk you through that process with step-by-step procedures.