 Hello everyone, this is Nancy. I am here to talk about text and data mining, and you may think that the topic is not suitable for this kind of audience, or you may think that it is too technical or something that you don't understand, but I'm going to try to make things easy and I'm going to show how this topic fits in the research support administrative staff or librarians or other related stuff that deals with the research. I would like to thank Gwen for inviting me to talk on. My work mainly is at Core, which is a global aggregator of open access content, but nonetheless, I am also involved in the foster project, which you may know already, which tries to facilitate open science and I have also been involved in other European projects, and I'm mentioning this because I'm going to talk about the work that we did in one of them, which is called the Open Minded, and the purpose of this project was to find, text and data mining solutions for the scholarly outputs. So I will proceed with the presentation and I hope that you're going to find it interesting. What is text and data mining? You may have, Jim, in reality, I didn't want to, I wouldn't want to give you a term that may not mean something to you, but what I would like you to consider is that currently there is this massive corpus of information everywhere around us, and those of you who relate to libraries or repositories, then you will probably see this much more in your everyday lives. And the problem right now about researchers and in general the society is facing is that due to this massive amount of information, we do not have the time to read everything. And even if we had the time to read everything, we wouldn't be able to make all these connections in our minds from the readings that we are doing. And text and data mining is this practice that what it tries to do is that it tries to make these connections between the various readings and it tries with the help of software that is being, of the software that it's installing computers and processes that take place via this software to create new knowledge and new understanding of the world that it is around us. There are a lot of benefits to text and data mining and some of them may seem very obvious. For example, the fact that someone doesn't have to read, for example, 4,000 papers to be able to understand a topic or to be able to be an expert in the topic and be able to make connections, be able to make inventions, be able to discover new knowledge, then this means that there is a clear benefit of time to the person who is going to be able to discover all these knowledge and needs via text and data mining. And at the same time, apart from the time component, there is also a benefit on the cost that someone needs, for example, as a university needs to hire research assistants to go over some specific literature so that the problem is being solved. In that case, this research assistants can actually work on the solution of the problem rather than trying to find what the problem is and how this problem could be solved. Of course, this means that information that it was not known in the past, now it becomes known, it comes in front of us and it is viable. And of course, with text and data mining what we have also seen is that especially nowadays where the scientific subject fields are not restricted in one field per se but they would like to borrow knowledge from other similar fields, then we see that in that case, we can explore new horizons. So for example, people in medicine, they may need to read and they may need to borrow practices from others who work in general in the healthcare area. And that way, with text and data mining, people can explore new horizons. And of course, all in all, we can see that there is an improvement in research on its own because the amount of knowledge that we get now is much more, the data is more accurate, the connections appear more obvious to researchers and to anyone who wants to discover information. Hey, we're thinking like all these five minutes that I'm talking. So why should we support text and data mining and I don't see the connection between like what I'm doing or what I have access to. And this topic that you are trying to present right here. And maybe you will find that this becomes more familiar to you if we talk a little bit about open access and if we talk a little bit about the definitions of open access and the initiatives, but also all the recommendations that organizations make, organizations who have open access policies and organizations that support open access have made. To start with the Budapest Open Access Initiative, over there it is mentioned that the purpose of an open access output is considered to be open access not only when people's eyes are able to read, but not only when there is an open license that supports this item, but also when this item is machine readable in the sense that it doesn't have a restricting license for the machines to be able to access the document and take advantage of the document. And the Budapest Open Access Initiative supports text and data mining. The, here in the UK, we also have the periodic research excellence framework and the organization has introduced an open access policy for all the materials that should be reputable. And in that open access policy, the organization says that the copies should be machine accessible as well. And then the coalition of open access repositories, what it asks again is that the repository content can offer the foundation for a whole host of text mining activities to be developed on top of the content, which means that right now repositories in academic institutions, but also subject repositories hold a very rich amount of information, millions of research outputs that are available and it is very important that these outputs are accessible for text and data mining purposes. How repository managers can make this possible? This slide is a little bit technical and I'm not going to mention a lot of technical details, but none of them, I believe some repository managers may be attending this webinar or may watch this webinar online in the future. Therefore, I thought that it would be a nice idea to mention the most important parts for that allow the harvesting of information so that other services can take it and make it machine readable for text and data mining purposes. As I said earlier, I work for Core, which is a harvesting service and we harvest content from thousands of institutional and subject repositories around the world, but also from open access journals and also from gold hybrid journals. All this content is not technically, it's not very easy to get if some of the rules are not followed as they should be followed. For example, the first output, the first bullet point in this list is to choose an industry standard platform. This doesn't happen so much lately, but in the past the tendency was that universities would build their own platforms. And this was a very good gesture for universities who wanted to try to find a solution to hosting the open access outputs into the repositories, but nonetheless, those outputs, those instances a lot of times were not supporting very major, what is the word for that? They were not supporting very major protocols. And therefore one recommendation is to choose a standard industry platform. And then to register the repositories OEI, OEI PMH URL, this is a very small link which allows machines to access your repository. It should be fairly similar to the link that you provide to your users to access your repository and be able to download and read the outputs with their own eyes. But nonetheless, the OEI PMH is the URL for machines to be able to get to your repository. And for example, at Core, we mainly get the rest of the repositories that we harvest from the directory of open access repository and from the registry of open access repositories. And of course, we get the journals from DOAS, the directory of open access journals. So it is important if you have an open access instance in your institution to be able to have it registered in a registry. And then the third point is a very technical one. But in reality, just to show you a little bit how it works, imagine that in your repository, there is one file which gives the instructions to other machines who come to your repository and knock on your repository's door. And they say, knock, knock, hello, would like to come to your repository and get the information from your repository. And this robots.txt is the first person, think of it something like a battler who is going to tell you, oh, you are allowed to do this by the moment you entered into the repository, but you are not allowed to do that. And a lot of times, there are a lot of limitations in this document. So like the battler gives us a lot of, it doesn't give us a lot of freedom. And this results into the outputs not being able to be shared with the rest of the world in a machine world. And then the fourth output, the use of meta tags. This means that there must, the meta tags are some meta data for websites. And of course, in an instance of the repository, it's the repository on its own. It's an instance that is online. And it's a record, again, it's another smaller instance that appears online. And it would be nice and useful if those tags, meta data tags into the repository for the online outputs were properly described. So that machines are able to understand again what kind of information this page contains and how they can create the information in this page. Then the item number five relates a lot with the meta data of the single record. The DC identifier is where we hope to see the full text being linked to. The external URL resolving is this URL that appears on the very top of the meta data page whereby there is a handle for this specific record. And there are all the ready tools out there from services such as handle.net or DOI who can treat these handles, these URLs and make it easy, sorry, and make it be able to find out about the text. And in general, what happens is that you may be doing all those things, but of course, this list does not contain all the things that could go wrong. Therefore, it would be nice if you would also monitor your repository so that you can also tell that your work and all those things that you've done and taking into consideration that you may have done bullet points one to six to ensure that even though you've done these points, the end result shows and there is nothing else, something that you could not foresee, something that it's too technical and perhaps your knowledge doesn't get that far. So it would be nice if you could monitor your repository and make sure that it shows the output show well on the outside world in a machine accessible form. And the reason that I'm saying all this and the reason that I insisted and I spent a little bit more time on the previous slide is because there are services like Core and what these services do is that they take this content and then they give it out to everyone who needs it to use it in any way they want. So one of the ways that we give the content out is through the Core API and the users who have the technical skills to detect some data mining, they can download all the collection of the Core service. via our API and then they can build either new services on top of this API or they can also create queries and find out information about the impact of the work in their own affiliated institution or the impact of all compare the impact of their institution with another institution. And this creates new knowledge that we are not so much as core in interest to show to others, but we give the raw data so that others are able to discover everything they want to know themselves. And then we also give this apart from the Core API, we give all this collection through the dataset. And if you don't know the difference between the API and the dataset, consider the API to be a live representation of the content that we have. While a dataset is something that we generate every three months or every six months and it's not updated on a daily basis such as the Core API. And for example, the Core API is being used by other services. Very recently, Core is being indexed by ProQuest or there is also another scientific research engine which is called Iris. And they use the Core API to enrich the content of their services. But to make this and extend this even longer, there are also other things that can be done with the content that repositories and journals in general have. So in this diagram here, you see the work that we did with the open-minded project which was the European project that I mentioned earlier that focused on collecting scientific outputs and creating an infrastructure for text and data mining purposes of those outputs. And in the diagram here, you can see that at the very bottom are the various categories of data providers yet you may also think of yourselves that yes, that the work that I'm doing and where I'm affiliated, I belong to one of those data providers. And then apart from Core, of course, there are other services like Opener, for example. And I'm sure that you are all familiar with Opener and the agronome, those who are into the agricultural field, they may know agronome as well. And so in the middle level, the one with like the purplish background, you have all these services that collect the content and then all these services, they feed this content into the open-minded text mining services cloud. And this is just an example of this specific project but in general, what I would like to introduce is this idea because you may think yourselves, oh, for example, you run an open access journal and you may think, oh, my open access journal is very small or you may run an institutional repository from a very small institution and you may consider, oh, my institutional repository is very small. We only have like a very small amount of information or only a small amount of them is open access and can be shareable with others openly. Nonetheless, think of it in the bigger picture. And what I haven't mentioned so far is that for texting data mining to be successful, the people who practice all of these various methodologies to discover the new knowledge require a very large corpus of text. So the bigger the corpus of text, the better it is. So even if you think that your repository, for example, is only a big portion of the bigger component, every output would be beneficial for the general purposes of text and data mining. So some other work that we did with this project is that apart from harvesting all the open access outputs from the institutional and subject repositories and also from the open access journals, we tried to see whether it would be possible to harvest the hybrid gold open access outputs. This means that we wanted to see whether it would be possible to take journals, open access journals, but those journals that do not have a traditional business model as being open access, for example, in Zellvieros Spring, it is one of them. But nonetheless, they offer an open access business model. And some of the articles in one specific issue may be openly accessible to everyone, but some others may not, and they may require a fee or a subscription. And what we managed to do is that we managed to get a very large amount of research articles from these publishers. And we discovered that there were a lot of technical challenges and legal challenges with regards to this effort. And I'm not going to bother you a lot with the technical challenges of this war, but perhaps you should consider that in general, if there is a technical challenge, then the end result is that the output doesn't see the outside world from a machine access point of view. Present of the work that we did on one of the conferences, the FORCE 2017 conference last year. And we chose on purpose this conference because we wanted to, it's a conference that a lot of publishers attend, first of all, to begin with. And we wanted to make sure that the work that we have done gets to publishers. And 33 publishers attended. We had some feedback cards. We also had some presentations where we showed the technical challenges that we faced for this work. But even though in the beginning, we were thinking that perhaps publishers are not going to see this work very positively or even if they see the work very positively, they will not express any interest, especially those who were not participating in the first round of the effort that we did to extract open access content. But nonetheless, this happened. And after the presentation, we managed to add four more publishers into our systems. So what this was very, very important for us, and we welcome this gesture. So we are also seeing that there is a tendency and there is a shift in the culture, even of the publishers, to embrace the open access of outputs, not only from the reader's point of view, but also from the machine access point of view. And of course, this is only to the benefit of the text of Data Miners, who now are going to be able to have access to a wider variety of content. And here's a little bit of the statistics and how we managed to do that. So we, for example, right now we have, from open air and core, the amount of open access articles into this text and data mining cloud collection. And then via the core publisher connector, which is the work that we did with the publishers, we managed to get even more content. And this resulted into creating a very large database of content that others can use. And of course, this content is content that data providers give us. So in this content, it's content that people who work, for example, in the depositories, are the ones who are trying to ensure that the metadata is correct, are trying to make sure that authors in their academic universities have deposited the outputs that they have published in a journal. And so I perceive this because I'm also a very big open access fan. And I think that open access is a work of many people so that we get the open access output to the end user. I consider this being a very nice collaborative work of many people around it who make it possible so that the end user in the end sees the benefit of it. This, and if you would like to read more about the work that we did, you are going to find a blog post that we wrote for the LSE impact blog. And the URL is at the bottom of this slide which describes a little bit more the technical issues and a little bit more in detail what we did. That also provides links to the useful tools that we created after this work. There is also a GitHub page with information about technical limitations that we found per publisher and recommendations. Now, if I have not convinced you yet to start thinking about text and data mining, to start talking about text and data mining in your institution, to go back and see if your repository is ready to share its data with the rest of the world, then I've added this slide here, just in case you wanted to have something extra to convince you. Now, I personally believe that research support staff and people who work with libraries are the kind of people who could promote text and data mining in all its forms. Because, first of all, this group of professionals knows how to spread out the information. We have been doing this in the past for other things that were important in our field and we have been doing this well. We already have people who work in research support staff on libraries, they are aware of the needs of the researchers. And usually there are research teams within libraries or academic institutions who support the researchers in this institution. And these people can do advocacy and they know how to do advocacy well and they know also how to start approaching others and spread messages, educate others about new concepts and new ideas. Perhaps some of you also know the right people at your institution. The people who, if you go and talk to them, a change may happen in your institution. Of course, I want to, I'm confident that the vast majority of the librarians know the copyright terms that apply in their specific country because a lot of text and data mining relates to copyright issues and allowances and it is very important not to inference copyright whenever it is possible, but at the same time it is very important for the open access outputs to be out there and be usable by others. Since there is no copyright restriction. And in general, I believe that the research support staff can start creating these concepts of text and data mining and to be proactive and learn how they can support their researchers about text and data mining. And in general, how can an exchange information among the research support staff teams so that they ensure that they provide support for the needs of researchers. Yes, okay, you have convinced me. I would like to do that, but I don't know where to find material. I still believe, for example, that text and data mining is a very practical component. I think that I am not a technical person and I don't know what to do. Well, partially you are right. Text and data mining requires technical knowledge and it is a technical practice. Nonetheless, in order for you to be able to assist, there is a lot of information out there on text and data mining from a non-technical point of view. I'm going to show you now the open-minded page which is hosted in the Foster project. So in the Foster project main page, apart from the Foster project outputs, which are outputs relating to open science, we also have another page which is on... It is a dedicated page for the open-minded project. And over there, you are going to be able to find information about text and data mining. You may also be able to find information about, let's say that there is a presentation, about open access, open science, or open data and text and data mining. So you can go into the Foster website and what I've just realized is that I have failed to add the URL at the bottom of the page, but I'm going to say it now that it is fosteropenscience.eu and then when I update my slides, I will give them to Gwen so that you also have the slides with the URL updated. And in this open-minded page of the Foster platform, we have divided the page into various sections. The first sections that you see on the very top are the taxonomies. The purpose of the taxonomies is to show others how the field of text and data mining is being involved. So we are hoping that by the moment that you see the various terms in this taxonomy, you're going to be able to understand the concepts that evolve around text and data mining. We also have another taxonomy, which is the TDM methods and workflows. And also apart from the taxonomies, apart from the scope of the taxonomies, apart from giving you the basic ideas around text and data mining and their components, it's also useful because if you click in each one of these terms, then you're going to be able to see the content that relates to one of these terms. And then further down, there are the open-minded tutorials and courses. Be honest about it. Apart from one course, the rest of the open-minded tutorials are a little bit technical. So I'm not giving this information to you so that you are able to perform those text and data mining techniques, but I'm giving this information to you so that you know in the future, if an academic comes to you and asks about maybe like tutorials on how to use infrastructure on text and data mining, then this is one of the places that you can send your academics or your researchers. And as the point of some educational training videos that introduce the text and data mining concepts, those are not technical at all. They are very brief videos and something else also that I would like to share. Perhaps when I started working on this project because my background, I was a librarian and then I have an expertise in open access. I didn't know a lot of things about text and data mining and perhaps maybe some of you even know more now than what I knew back then. And by watching those training videos, I learned a lot about text and data mining because they are short, they're not technical and the main purpose is to introduce you to the concepts of text and data mining. And you will find that a lot of the videos also reflect the taxonomy terms that I showed you earlier. And then there are the top resources in text and data mining and the corners with the various colors as you can see from the output describe what I explained to you earlier. For example, you may find that something that has a corner, a blue corner has a topic on text and data mining, but something that has a corner that is orange has a topic on open science. So the second output that I'm looking at right now from the first row, you can see that it has both accounts on open science and text and data mining. So here is a closer view of the text and data mining taxonomy and you should feel free to go to the website and explore further. At the end of this slide, I have the specific URL for the open-minded, the foster URL for the open-minded section of the website. We did in the project because this project, it was like two groups. The one group, it was very technical, so they dealt with infrastructure and then it was like this other group and this other group was not very big, I have to say, but it was Liber and it was myself from the Open University. I was the non-technical part because other people from CORE were the technical part who were doing the technical work and then at the Liber conference, I think it was two years ago, the one that took place in Patras, Liber run a workshop and in the workshop because the Liber conference is the audience's librarians and people in general who work in libraries. The audience there was like that audience that felt that even though there are some technical outputs over there for others to understand what text and data mining is, there wasn't something for research support with administrative staff and this resulted into some work that we did last year, which was between us, CORE, at the Open University and Liber and the people who got also involved highly in these conversations were from the research support team of the University of Cambridge. So we joined forces and we created this course which we call Introduction to Text and Data Mining and to our knowledge at least, it's the first course that it's not too technical but tries to educate a little bit those who don't have technical skills and show them a little bit how text and data mining is being done. And we also give some examples on how others can do text and data mining. So just to show you a little bit of snippet of this course, how it looks inside, we have lists of suggested readings and we did not create something from scratch but we took readings that they are already out there on the internet available and we just did some research and we collected them and then we have also those introductory, we embedded those introductory videos that I talked about them in the past, pardon just to show you how it works and then we also have quizzes. So we separated the course into chunks and sections so that in any case that you wouldn't want to do like all sections you didn't have to and you would be able to focus on the ones that you are interested. And then in the end, you can also claim the course on the foster platform to receive if you have an account in the foster platform and if you are registered in the foster platform to receive a bud that you have completed this course. What I would like to stay on for this course is that the introduction to text and data mining has some sections that say how research support staff can deal with text and data mining and what are the roles of libraries and the roles of the research support staff. And the other component that we have in this course is that we have a technical section and in this technical section, we try to explain in simple terms some of the technology that is required for text and data mining. And we have divided the section into two pieces. The first one, we call it that it's like the beginners introductory one where we feel that the vast majority of the audience can follow. And then the second one, we call it something like a little bit more advanced and over there we have created some examples using the core API so that we can show others what is happening and show them some examples. I feel that maybe those are easy for you and I would add that you try and take the course and the technical part because in reality, we don't ask you to create code. We just ask you to create like small numbers in the code which do not change something in the code. It's just like how many outputs you would like to see back and for example, which dates you would like like the variation of the dates from 2015 to 2017. And it's not so much that we ask someone to code. And maybe if you like have time if you would like to play with it and it's also safe because you're not going to break anything so you can play with this and you can create your own copies and you can actually see on your own what is like in practice this text and data mining that everyone keeps talking about. And that is all. Thanks a lot.