 Okay, we'll go to the next speaker, Sarah Kelligan. Well, first, get the PowerPoint up. So, let me introduce Sarah first. Dr. Sarah Kelligan is currently senior researcher and project manager within the British Atmospheric Data Center, responsible for bit preparation, project management and technical research. She's the project manager for the EU-funded metaphor project and has previously managed the JISC-funded OJIMS project, which aims to develop journals for data publications. So, next step. She's currently data manager for the NERC flood risks and extreme events program at the BADC, and she will talk to us a step-by-step guide through the research data set creation, big data versus long-tail metadata, data centers, data repositories. So, lots of information. Sarah, the floor is yours. Thank you. Okay, and this is the point where I realized that I really needed to update my biography on the website. No worries. Okay, can everybody hear me at the back? Yes, can everybody see the thing? Yeah, plus or minus. Okay, starting off, hands up everybody who's created a data set. Ooh, that's better than I was expecting. Good. Okay, hands up anyone who's involved with data management on a regular basis. Okay, good, right. So, there will be, I'm covering an awful lot here, and I do have the tendency of talking very fast, so I will try to slow down. And if I say things that are very, very obvious to people, please forgive me, if this is all a bit kind of too simplistic for you, you can play Count the Cats and entertain yourself that way. Okay, so let's start off. So, just to give a brief introduction about myself, I work for one of the, I work for the British Atmospheric Data Center, which is a data center which is funded by the UK's Natural Environment Research Council. And our job is to take in all the research data that is produced by scientists funded by NERC in a variety of fields, primarily environmental sciences and geosciences, and we archive that data. And we have to do this because observational data is, we can't reproduce it. We don't have a time machine, we can't knit back to last week to do the same measurements in the same place again. Once we've measured it, that's it, that's our only chance. And these environmental data sets are really, really important. So that's our job. That is our entire reason for being, we take in data, we look after it, and we give it to other people. So, taking it back to first principles, scientific methods, whole research method in particular, it's all about you start off, you define your idea or the problem, you form a hypothesis, you run your experiment, and you collect your data. Does the mouse appear? Yes, the mouse appears. So we've got to the organize and analyze data. And the way that this is done is crucial to be able to reproduce the science. But more often than not, that is done by the researcher, and that organization and analysis of data doesn't really get shown to anyone else ever, except possibly the collaborators. What gets communicated to the rest of the world is this communicate results bit. And we want to be able to change this, because as we all know, if your conclusions are based on faulty data, then your conclusions won't stand, they're rubbish. We need to be able to find the data. So, here's the research data lifecycle. I got told I had to show this, so here it is. So, as I said, researchers are used to dealing with these three green blobs here. They are used to creating data, they're used to processing it, and often that's all they need to do. They run the experiment, they take their data, they analyze it, they write up a paper, end of story, as far as they're concerned. Preserving and giving access to data isn't something that researchers generally do. And that's where they tend to farm it out to data repositories or institutional repositories. And then we have the whole reusing data, which can be done by third parties, but can equally be done by the originating researcher ten years down the line when they've gone back to their data set to do some extra work on it and realized that they can't, well, they haven't documented it well enough for what they thought they'd remember, they can't. So, you're always going to get this question, what is a data set? Data sites have a definition, and their definition says it is recorded information regardless of the form or medium on which it may be recorded, including writings, films, sound recordings, pictorial reproductions, drawings, design, or other graphic representations, procedural manuals, forms, diagrams, workflow charts, equipment descriptions, data files, data processing, or computer program software, statistical records, and other research data. So, in other words, a data set is anything you can possibly think of, right? In my opinion, a data set is something that is the result of a well-defined process. It is, because I'm a scientist, I trained as a physicist originally, I think it has to be scientifically meaningful, it could be just as meaningful from an arts and humanities perspective. It has to be meaningful. And I maintain that it has to be well-defined. It has to be very, very clear what is part of the data set and what isn't part of the data set. If you've ever created a data set, and I know that some of you have, you know that it is hard work. You spend years of your life on this. And oftentimes, you distill it down, you spend years, you write it, you collect it, you interpret it, you write about it. And then at the end of the day, you have a graph, and other people look at it and go, is that it? And people just don't get, unless you have actually gone through the process of creating a data set, they don't get what it is and how much work goes into it. And oftentimes, when people say, where's your data? I want to see your data. They don't actually want to see your raw data. They want to see your analysis or your graphs or your notes or something like that. Even then, they expect you to kind of explain it to them. So dealing with data is tricky. So I'm going to give you an example from my own personal experience because I didn't always work for the British Atmospheric Data Centre. When I started work as a researcher, fresh out from university, I got hired to work with a particular experiment. So basically, I was creating a data set, and it was a radio propagation data set, which means that there was a satellite, little sat F1, up there in geostationary orbit, and we were taking measurements of the beacons that it had on it. Because the beacons were at the radio frequencies that if you get a rain storm or a cloud in between your satellite and your receive path, your signal goes out. And that's not very good for if you're wanting to kind of use that particular frequency to broadcast, say, the FA cup final. People might get a bit annoyed. So we had an experiment. We put instruments out in the field. There was a receiver inside the port of Cabin, and those are receivers inside the port of Cabin. This is an example, just a graph, of one day's worth of raw data. It was a particularly horribly wet day. It rained a lot. And you can tell because the black line with all the jaggedy bits, that's the signal, and you can see it drops out all the time. Even then, at the bottom there, it drops out so much that we've lost the signal completely. That's the growth of data. These data sets span years. I had to take each day, run it through a number of programs, which were written by a colleague of mine, and he took that and turned it into that. And then, finally, at the end of it came out with some nice cumulative distribution curves. This process took four major steps, four different computer programs, and 16 intermediate files for each day of measurements. Each month's worth of collected data took me somewhere between a couple of days and a couple of weeks worth of effort, depending on how rubbish the weather was. And it wasn't something that I could put the computer on and set it going because the computer didn't know what it was looking for. It wasn't clever enough. It had to have a human's eyes on it. I spent far more of my life than I'd care to remember doing this. All right. This is how not to preserve your data. This is the IttleSat data archive on CDs on a shelf in my office, right? This is the point where I hold my hands up and say, yes, I did it wrong. Definitely wrong. This is what the process data set looks like on disk. There it is. A whole heap of file names. You might be able to guess that there might be a date in there somewhere, but who knows, and what the hell is a .000 extension when you get right down to it. Luckily, these ones happen to be ASCII text. So when you open them up, you can actually open them up in WordPad, but then you look at them and go, what are these again? I do have documentation. Here's the documentation, or some of the documentation. You can't read it, but you'll know that there's a bit of a flowchart thingy going on there, and there's software file names in the documentation. The IDL files that I used to create that were the software, I have absolutely no idea if they will run in current versions of IDL or not. There we go. Documentation, it can produce some mixed feelings, right? Because what everybody wants out of life is to be an expert, really. They want to be felt like they're important, they're indispensable, and sometimes in the cut and thrust of the economic climate, often knowing that you are indispensable, that you are the only person who understands this very important data set, that's really good by way of job security. So that is my example. This is my past. My data organization was horrible. It was bad. And I wasn't even preserving my data properly. As for giving access to it, I spent more months of my life than I care to remember. Spending time working with legal and contract people in the department trying to figure out non-disclosure agreements. I'm not a lawyer. I never wanted to be a lawyer. If I wanted to be a lawyer, I would have gone to law school. Right? It was a pain. Sharing was difficult. And it wasn't... I was wet behind the years. I was a newly-made researcher. The head of my group was wary about the prospect of making it all open data. Because this was a few years ago when open data really hadn't made its way or made itself felt. But also because in theory at least these data sets could have been commercially valuable. We never managed to sell them but in theory they were valuable. And I felt like I didn't get the credit for all this work I put into sharing the data and documenting it and all the rest. But the very first publication that was based on this data didn't have my name on it at all. I had given this to a PhD student by request and we got named as a group in the acknowledgments. I didn't get a personal mention at all. My only consolation is the fact that that particular paper has had two citations since. So it's not exactly been that impactful. So the good news is I have seen the light after a bit of poking and being told to put your data on the BADC or else. The data is all on the BADC now which means I don't have to worry about it anymore. It is there. There are people there who will manage it for me. Okay. So any niches in the audience? No? Oh dear. This analogy could go really, really flat then. Yes, there's one. Okay, so my scarf. Because data sets are so varied, I kind of thought I'd come up with a really an analogy for why a scarf is like a data set. So you can think about data sets. So how's my scarf like a data set? The raw material, the wool it's made from, it doesn't contain any information. It's just all rolled up into a ball. But the process of knitting encodes the information into the scarf. So the scarf is the result of a well-defined process. It's very clear what the active knitting actually does. And it has a particular method used to create it. I need to be able to describe my scarf. I need to be able to find it. I need to store it properly so it doesn't get lost or corrupted. So it doesn't get eaten by mice or moths or anything like that. I keep information about it because I might want to recreate it someday in case I do lose it. And possibly the most important thing is I've put an awful lot of time and effort into making it. I'm very personally attached to this scarf. For a scarf think of something that you've created yourself. This could be a garden or it could be a document that you've written or it could be a Lego model or it could be any other sort of model. Something that you feel like you've personally invested and put a lot of time into creating. It's that sort of feeling that you get. Just like not all scarves are the same not all data sets are the same. Just because somebody says oh I've got a scarf don't assume that you know exactly everything about that scarf just because they say scarf. Because scarves come different shapes different sizes, different colors they're used for different purposes they're made in different ways and the thing here is data sets are like that. If somebody says oh I've got a data set what that data set looks like or contains. If in doubt talk to the person who has the data set ask the creator get the information off them. Right, measure data. So measure data is important because we need to be able to describe things. So the example here is that we need longitude and latitude about the planet. It's to be able to locate us on the surface of the planet so longitude and latitude are artificial you're not going to find big massive lines drawn along the lines of longitude and latitude but we are artificially put them there and we've arbitrarily put them there but we are using them as a consensus and they allow us to communicate about places on the earth and they were designed by those who needed to navigate the ocean so these are a useful feature and they can often act metadata can often act as a surrogate for the real thing. So here's my scarf again I have metadata about my scarf I have descriptive metadata it's a scarf, it's color, teal blue I've got the dimensions how big it is I've got it's locative metadata so it's either hanging around my neck or it's hanging on the door of my wardrobe and I can also assign a knitted object identifier to my scarf should I feel the need no matter what my husband might say I've also got information needed to be created so I've got information about the raw material, the specific type of wool that I use to create it and the color, the dialogue these are all important things if you're really getting into knitting geekery I've got the knitting needle size that I needed to create it the algorithm effectively the program that I use to create the pattern that is in the scarf, feather and fan stitch I've got the number of stitches I cast on how I started it, how I finished it how tightly I knit it, this is all the information that you'd need to go ahead and recreate this particular scarf metadata can get really complicated really quickly and there's lots of this is a metadata taxonomy that was written up by my boss Brian Lawrence and it goes into lots of detail about how you do it it can get even more complicated than that I'm not expecting anyone at the back to be able to read anything of that the most important thing I think for this is that when you are dealing with researchers if you show them documents like these they will need to go away and have a nice quiet lie down in a darkened room for a while because this stuff is complicated and most of the time the researchers don't need to know about it so what do we as data centers do here is the data curation life cycle courtesy of the digital curation center and you'll see there data is right at the heart of it and then you've got extra layers of stuff that goes on all around it and that's basically what we as the data centers do we look after the data we make sure that the metadata is there we curate and preserve it we migrate formats occasionally occasionally we even throw stuff out although that doesn't happen very often I have to say data repository workflows again these are complicated again don't show these to the researchers they don't really need to know about it because all we need to do is have a good relationship with the data producers because you want to be able to interact with them to get the metadata about your data set workflows are going to vary across all the different data centers and all the different libraries anyway everybody's got their own way of doing things what is important to remember though is that you really want to be in the situation where you've got good interaction with the people who are submitting the data what you don't want to have is what we call a data dumper where somebody goes here is all the data from my 50 year career in astrochemistry or something I'm retiring now bye and they go off and they leave you with half a billion files named my data and you have no idea what to do with them or what they are and the only thing you can do in that case is write them to disk and hope that sometime in the future somebody might find some money and deem it's important enough to actually go and do some data archaeology on this stuff so you will get people asking why should I bother putting my data into a repository has anyone ever had that sinking feeling as they've just clicked close on a document and they haven't saved their final changes yeah that's the reason why we don't want to lose data and they could say it's alright I'll just do regular backups yes that's true however so this dates back to 1700 BC the face to Stisk and I don't know how linear or how old linear is these pieces of information these documents have been preserved for thousands and thousands of years unfortunately we can't read them so it's not enough to store the files we need to be able to actively curate them to preserve the information on them it's no point storing something if you can't actually read it so big data versus long tail data alright so I'm going to give you an example of big tail data which is different from the Large Hadron Collider this is the climate fifth coupled model intercomparison project CMAP5 and this model brings together loads this project brings together loads and loads of climate models and the results of these climate models will feed into the next intergovernmental panel on climate change assessment report which then will inform governments and policy advisors all around the world what they should be doing to deal with the effects of climate change what we've got is we've got lots of distinct experiments to find with very different characteristics and these all influence the configuration of the climate models what they can do and how they can be interpreted and all these climate models run on massive supercomputers so climate models are getting more and more advanced as time goes by and they're getting finer scale there's more data coming out of them there's more stuff being looked at them for example in the mid 1970s dioxide, rain and the sun now we've got AR4 which was the previous one from this we've got atmospheric chemistry, interactive vegetation the sea, the ocean, all sorts so it's getting more and more complicated and we're getting more and more data so the BADC is a part of the earth system grid federation and we host some of the CMAP5 model archive we are talking about petabytes worth of data here this is big data it's lots there's lots of numbers up there there's 90,000 years worth of simulation 60 experiments, 20 modelling centres from around the world we are talking big numbers here not quite as big as the LHC but still pretty big handling at all is a job it's a major international collaboration you probably can't see but there's lots of inside that green circle there of different countries we've got Australia European countries, the United States China and Japan I think and it's funded by a wide variety of sources it's funded by the Framework 7 projects easiness and metaphor and US funding and other sources like NERC in the UK is funding it as well this is a big international collaboration and it needs major physical infrastructure networks, computers, all the rest of it we need to comprehensively collect all the information, all the metadata about these climate model runs and we need to make sure that they are also saved properly the metaphor project that was mentioned earlier the entire project was just about collecting the metadata for climate runs major distribution systems are social challenges as well as technical ones you've got to remember that you can easily computers pretty much do what you tell them to people can get a bit funny so CMAP 5 is an example of big data it's got lots of different participants lots of different technologies international community who are willing to work together to standardise and automate the data and metadata production and creation so going back to Scarves big data is industrialised and standardised data it is the industries that crank out 50,000 of exactly the same Scarves in exactly the same shape and exactly the same colour and they all have little tags sewn in on them to say who made them, what their ID number is and all that washing instructions big data has large groups of people involved and it also has methods for attribution and credit for data creation established, it is all wrapped up everybody knows what their job is industrial process everybody knows their roles the long tail data is my hand knitted scarf it is bespoke data and metadata creation methods, it is an individual researcher or a small group working away in an office somewhere creating stuff that works for them but may not work or probably won't work with the office next door to them should they actually want to want to share it and for the long tail data we're talking generally any accepted methods for attribution and credit for data creation which is why the people who do the long tail data creation are so protective about their data they feel like they have to hang on to it to get every single last scrap of publication worthy material they can get because if they don't get the credit for making their data open then why should they bother so I got asked to say a few words in the future role of the library and I'll start off by saying I work for a domain specific repository and I feel that we really do have it lucky we get to pick and choose what data to keep and what data to throw away we can ask for and we sometimes even get the detailed metadata that we need to to kind of make the data sets more interactive, more valuable, more standardized we can provide specific tools and services we can do visualizations, we can do server side processing and we can do all this because we can mandate things like file formats and standardized metadata schemes we can deal with big data, we have the industrial processes, we can do this libraries, you guys are going to need to pick up and manage and archive the long tail data where there might not be a domain repository and you're going to have to have a generalized, widely applicable system that can cover everything from astronomy to zoology and everything in the middle to the sides as well and you're going to have to be ready to deal with pretty much anything so I think, yeah, good luck the important thing is don't panic there's a lot of information out there about managing data some of it will work for what you're trying to do some of it won't but the important thing is don't feel like you're in isolation on this one come and talk to the people who've done this before or the people who are trying to do it and learn from other experiences the good experiences and the bad experiences it can be often as valuable to learn oh god, no, don't do it that way as it can be, yes, this is the way to do it and as I said, good luck so, summarise and maybe a few conclusions, I'm not entirely sure so we all know that data is important and it's getting more important well, it's always been important and we all are starting to realise just how important it is and by people, I mean not only other researchers both in the domain that the data is but also outside those particular domains but also people like politicians the UK's Chancellor of the Exchequer has gone on record saying how important data is if the politicians are talking about how important data is, it must be serious conclusions and knowledge are only as good as the data they're based on if your data set is rubbish your conclusions are going to be rubbish the only way you can find out if the conclusions are rubbish is if you can actually dig into the data and find out if they support them or not that's why we need open data we need shared data because the science is supposed to be reproducible and verifiable there's a great quote that I heard at a meeting which says that research without data isn't research, it's advertising and that's quite important it's up to us as researchers to care for the data we've got and ensure that the story of what we did to the data is transparent and researchers are going to need help from data management experts whether they're in domain specific repositories or in libraries because we're going to want to use the data again and we're going to want people to trust our results it's not an easy job but someone's got to do it thank you very much yes, thank you Sarah any questions at this point from the audience we've got all kinds of questions coming up let me come to you thanks Sarah, very nice talk just one question do you have any data about the claim that long-tailed data are only made by few people no, this is kind of based on my own personal experience it seems sensible to me because if you've got big collaborations they tend to be more there tends to be more people so therefore they tend to standardize it more because they have to share it with more people it would be interesting to analyze because actually looking we always hear for example about the large Hadron Collider and these big data that is produced there but these are a handful of people compared to the amount of researchers in the world it's a really tiny fraction of people so it would be interesting to look at I'm coming from Stoen so I can perhaps just ask a little bit so we do have a lot of people working on LHC and they do produce a lot of data however all the theoreticists they also produce a lot of plots, a lot of other things that they don't store in standardized formats at the moment so we do also produce lots of long-tailed data if you want an example of long-tailed data and failure in data curation you'll notice that the slides I had where I showed the images of the graphs that I plotted as a result of the data analysis you'll notice that they were photographs of hard copy printouts of plots because I didn't save or I couldn't find the original soft copy plots on my computer so yeah data management occurs for everything even the things you don't necessarily think you should be data managing do you have the feeling that in the flow with the big data that you are dealing with that this is a solved case or more or less a solved I think I think it's more or less solved case for that particular situation so CMIT 5 is just one measurement project or model project there are still people doing stuff with climate models that don't necessarily fit in with CMIT 5 but the key thing with CMIT 5 is that you've got the opportunity to have enough people together to get them to agree that this particular standardized way of collecting metadata is good for CMIT 5 but can also be used for other projects that aren't directly related with CMIT 5 so you've got a chance of standardizing a bit better and was it easier was it quite easy to do no it took a long time it took a long time so metaphor was three years three and a half million euro it was a big project and that was not to do any of the infrastructure that was so lead to create the web based questionnaire that used to collect the metadata about the model runs any more questions no well we have sorry Nadja thanks a lot Sarah that was great energetic talk I wonder if you could talk about DOIs when you assign them when researchers can be credited for their data at NERC in particular okay so NERC now has the ability to mint DOIs for datasets held in our archive and because we are using DOIs in a very particular way we are using them to give credit to the researchers for the work they've done but we also want to assign a DOI to a dataset that then becomes part of the scientific record so we have rules for ourselves that say that the dataset has to be complete and frozen before we can assign a DOI we also say that it has to be of good technical quality so the metadata has to be there and complete the files have to be in the right format now this is we're still at the early stages of assigning DOIs we've done it for a few datasets and we're hoping to roll it out for more legacy datasets we've also changed the NERC guidance for authors so when people apply for a NERC grant they will see text assuming they read the handbook they will see text about how you go about getting a DOI and how you go about using it the DOI thing is all in the context of the wider work that we're doing about data citation and publication and Fiona Murphy is going to be talking about data publication later on this afternoon yeah okay thank you Sarah so we have now a little break let's see what the time is at the moment