 Hello, my name is Venkat and I work for the Digital Curation Centre, or DCC, in Edinburgh. We're a group that has been advocating and training best practice in how data is managed, and to make data as freely accessible as possible. In this session, I'll take you through a brief introduction to Research Data Management, or RDM, with an emphasis on what one might encounter when producing or reviewing data management plans, also known as DMPs. Let's get started then and define what RDM actually is. Data may come in many different forms, and sometimes it is easy to overlook that some of the things that we work with are an actual fact, data. Especially when comparing quantitative versus qualitative data, it is sometimes easy to overlook the many types that fall into the latter. Whichever kind of data that you may deal with, a common thread is how one should treat their data such that they are acquired, processed, analysed, stored, and disseminated in a similar way. Specifically, for any given dataset, one can map a life cycle that it will typically progress through, from its creation through to its eventual preservation. This life cycle should hold true for data from across the spectrum of disciplines, and making sure that each step is managed in a proper fashion is crucial to how well data will be, for example, re-used. It will have a bearing on other aspects too, of course, such as data integrity and reproducibility. As part of the RDM process, one should keep in mind FAIR. The FAIR principles are a relatively new concept, and were originally developed from the life sciences to address the issues of making data as useful as possible. They now encompass the full spectrum of disciplines. Maximising the value of data can be financial, societal, and temporal. That is, we want to try and minimise replication of data that is largely fundered by tax payer money. By following these simple rules, we can hope to minimise wastefulness and increase the value of data. So, drilling down a little further, good RDM needs to consider the following concepts. Fulfilling these criteria adequately, especially to fulfil FAIR principles, will allow researchers to manage their data more efficiently and meaningfully. Let's look at each of these concepts in turn then. There are a vast array of data formats for each discipline. Some are more widely used than others, but some of the most commonly used may be proprietary and may have long-term disadvantages. When choosing data file formats, a key consideration is longevity. Whereas something like a Word document may be popular right now, there is no guarantee that it will exist in perpetuity. We need to make sure that a file is readable further downstream, and therefore when archiving and preserving data, it is advisable to use formats that are universal. As well as longevity, it is advisable to use data formats that are lossless. For example, in image and audio data, you may frequently encounter heavily compressed file formats such as JPEGs for images and MP3s for audio. The preferred lossless formats in these cases would be TIFFs for images and FLAC format for audio. There are various places that you can get guidance on preferred file formats such as in this following table. This shows the preferred file formats for different data types with the emphasis on making data sustainable. And as you can see, for different disciplines, there are various different file formats that could be used. But in all cases, there is a recommended format. To ensure maximum interoperability and reproducibility, data should be accompanied by full enriched documentation. Metadata is data about data that contains descriptors about how, for example, an experiment might have been done. There are a number of established metadata standards available for different disciplines. Choosing the appropriate metadata standard may be discipline specific or more general. It is advisable to use one that is as specific as possible. But this might not always be possible since a particular field is very niche, or it hasn't been fully addressed. Regardless of this, if your data can be assigned to a set of standards, this can allow cross comparison with another data set that uses the same standards and consequently reuse of the data itself. Here we see some examples of metadata standards that are already available. On the one hand, a very broad set of standards in Dublin Core, and on the other hand, some more specific ones. And where can these standards be found? There are directories such as these where you can search for an appropriate set of metadata standards for your particular case. These should be the first place to refer to when seeking an appropriate set of standards. And what happens when you ask a group of people to describe something in their own words? This example shows the answers to the question being asked from a Cambridge University workshop involving 33 participants. When asked what species the study was on, a range of answers denoting humans were provided. Of course, we as humans can see that all the answers mean the same thing. However, this does not hold true for computers which cannot differentiate the different answers. And the best course of action is to restrict the possibilities such as using multiple choice. Controlled vocabularies can circumvent this ambiguity, limiting the number of choices disambiguates the possibilities. Moreover, structured controlled vocabularies, also known as ontologies, add further organization, an example being anatomical components of an organism where you have parent-child relationships. This has very important implications for interoperability. For example, if there are thousands of terms in a vocabulary and you want to compare a term or terms to another similar vocabulary, computational analysis now becomes possible, leading to increased speed of analysis, for example, cross-species anatomical searching. Moving on, once data has been established with defined file formats and metadata standards, they should be made as freely available as possible. How you license your data has very profound implications on its reuse. Creative Commons licensing describes a set of different licenses of varying openness. By stating an appropriate license for data, third parties will know how they can reuse these data. Since many data are publicly funded, it is encouraged that it should be made as freely available as possible, which in the case of Creative Commons equates to the two top most options in this table, public domain and CCBY, or attribution licenses. The latter is a preferred option in many cases, since it forces third parties to state where they use their data from. To determine which level license to use, this tool from UDAT can be used. It takes you through a wizard to determine appropriate licenses. An important additional useful feature of the UDAT tool is that it can also be applied to software. In many cases, third-party software and scripts may be produced to analyze data, and these in themselves should be considered data. Of course, it may not always be possible to use the most open license due to, for example, intellectual property rights issues or information that could identify an individual, and these must be, of course, taken into consideration. How data are preserved for long-term use and reuse is not the same as archiving, where data will be inaccessible. Data repositories are a very useful way of preserving your data for the long term, and to allow your data to be discovered and re-used. Re3data.org is a directory of the many different repositories available, and can be searched to find an appropriate repository for your data. It is always advisable to use a repository that is as closely aligned with your data as possible. This is true even if a researcher's institution provides their own repository, since these will most likely be discipline agnostic. It is not possible to use a closely aligned repository, then the next most appropriate should be used. It may be the case that there are no appropriate repositories for your data, maybe because it is very unique, and therefore a generic database such as Zanodo will be the preferred option. In any case, look for certification. This will give peace of mind. Moreover, it is also useful if a chosen repository assigns persistent identifiers, which is the final topic that will be covered now. As well as making sure that data can be stored in the long term through repositories, how data are identified is a major consideration. As the number of data objects increases at a very fast pace, as well as those that are already existing, problems with uniqueness and ambiguity can arise. To address these problems and to make sure that a data object can be retrieved in the long term, unique persistent identifiers should be given, and these are some examples such as DOIs, which are commonly used. They allow direct access to the data, and in many cases can be resolved directly from a web browser address bar. To look at a specific example, here is ORCID. This is a not-for-profit organization that was initially established to address the problem of ambiguity of attributing researchers with their work outputs, such as peer-reviewed papers. In this case, the researchers are the data object, and the solution is to assign a unique persistent identifier to the researchers themselves. Therefore, if there is another research in the same field, the unique ID will provide disambiguation. The scope of this is far reaching, and in combination with employers who can also be assigned unique IDs allows provenance. Therefore, in conclusion, these are some key factors to take into consideration when managing data and that will be encountered in data management plans. Fulfilling these criteria will allow data to have value added to them and increase the possibilities of new discoveries to be made. Okay, thanks for watching, and please follow these links for more information, and good luck in your RDM.