 Hi, rydw i chi a chyfrachau cyfeirio am y ddweud am gyhoeddiad yr ODI. Hefyd genna'n gyhoeddiad Gymru yn cael ei bwysig ymddir cael eu gwirio. Mae'r unig siwr gynnwys ydda i yn clydydd yn hynny'n gyfer gyda'r camlu Lloran Amelia yn gyhoeddiad Fyrdd Llyfrgell. Mae'r sefydlu'r HED контwysbydd ffócaith ei chwarae'r odi i'r sefydlu odu'r sefydlu. Rydw i fod yn nosion gyda Jenny Tennison cs ym ODI a Elewnor Sompol gyfu phall gyda Sfab Hampton, So, over to you. Oh, just a few housekeeping things. First thing is, if you're watching from the live stream, please use hashtag ODI Friday, so ask any questions or to talk about the event or if you're in the room. And also, if you're wanting to ask a question, please wait until the very end with the question to Mike. And then people at home or at work who are watching will be able to hear you. Thanks, over to you. Thanks Anna. Hi everyone and thanks for taking the time. My name is Laura Kaston, this is Emilia Katzbreg and we'll be talking about some of the research that we've been doing here at the ODI over the past two years, where we looked at data search and data discovery. We are both in the third years of our PhDs and we're part of an EU Horizon 2020 program that's called WD Aqua, in which we work on a question answering system that uses web data to answer questions. Um, yeah. Sorry. We're going to tell you a bit about the background of our work. What we mean by data, what we mean by search, and we'll outline why data search is complex, both from a technical but also from a user or interaction perspective. And we'll describe how we imagine the future of data search and discuss some of the research that we've been doing in this area. But first of all, to make sure that we all talk about the same thing. There are many different definitions of data, but what we mean by data here is information that's organized explicitly, so that's structured in some way. So imagine spreadsheets, web tables, maps, things like that. And the question that we've been wondering about is, why is data still so hard to find? Who's ever struggled to find data here? We definitely have as well. And we've been interviewing data professionals, so people who work with data in their day jobs and how they feel about searching and finding data on the web. This is what common answers looked like. To situate this a bit, broadly speaking, when we search for data, we try to answer a question with data or we try to find a data set to work with. So one, this is an example of a very high level question. Can we find links between Trump's political decisions and his personal business interests? So the problem with such a question is that it would need to be broken down into several sub-questions. And these would be the types of searches that people do to answer such a question. One of the problems with data searches that is often exploratory, so we don't really know what data we can expect to find, so we don't really know how to ask for it and how to search for it. And in the case of our question, we'd probably try to find a data set that lists the companies owned by the Trump family and also a list of all the executive orders done during the Trump administration and relate them to each other. In reality, we'd probably also try to find different other data sets and add them to it, depending on the actual detail that we'd want to find out. In addition to talking to data professionals, we've also done a search log analysis. We looked at the logs of four national open data portals, including the Office of National Statistics and Data of UK. And we looked at the queries. The queries are the keywords that you type into search engines in order to find what you're looking for. And we found that people search for data differently than they search for traditional websites. So we found the queries are shorter, people ask less direct questions, and they also put more temporal information in the queries. So that means they specify a year or a time period that the data talks about. I also understand why it's so hard to find data. We are going to work it through a number of challenges that people need to go against. Some of them are purely connected with the technology that is used to build data search functionalities. Those could be summed up in three points. Data is often hidden in files that are not visible for search engine. That means search engines cannot see the content of the file. And the only thing they can rely on when searching for such a content is the metadata that is provided. This result in vast majority of data not being indexed by search engines. And it doesn't really matter what kind of query you put. They would be just not found. Google and obviously other website engines are not directly indexing the content that is structured data. They are built to structure, to index web pages, and these web pages containing only mostly text. And this was the primary idea for the functionality. So when we think about structured data, we cannot index it straightforward in exactly the same way. And the last one, the data set is conceptually different. Search functionalities need to mirror the fact that we think and use data differently that other sources of information. For example, when we need to answer a question or a task that we have with data, we often need more than one data set as we showed earlier in the example. This indicates that we need to rethink how search for data should work in terms of indexing, ranking, and presenting the results to the user. But also if we look at data search from a user perspective, things don't quite work the same way. So it's currently not as easy to find data sets as it is to find websites. There are more steps required to, first of all, get to the data, but then also to get to an answer that might be within the data. It's also harder to keep track of where things are. You often need to click through a couple of pages to get to a download link. Or you need to go to dedicated data portals and sort of remember where a data set was sitting when you tried to find it again. And as Emilia already said, we often need to combine different data sets in order to answer a question. We also know that in comparison to reading, there are a number of different or additional skills required to find and understand data. So in order to access and find data, we need to be able to deal with different formats. We need to understand connected licenses. We might want to combine data sets from different domains that come packaged up differently that have different naming conventions or schematise and we sort of need to find the way to put them together. But also we need to be able to interpret and understand data in its context. Because if we look at something like this, we don't immediately understand what this means or I don't. And as much as this might be true for some text that's online as well, we think about data differently and it is not the same as reading. Because one of the characteristics of data is that it is not in the same way self descriptive as text is. And so we need tools that help us to be able to understand it in its context and that provides some information about the data that comes together with it. So what about the future? We imagine search for data to be equally easy as searching for websites, but also for images, for maps, for flights, for flight on the web. We think that data as a source of information is so unique that we can't just use the same approaches that have been used for other domains. Because so we think about tools that help people make sense of data. For instance, the tools that recommend data sets based on one that you already have, that are similar to it, or based on one that you've just found, or tools that would summarize data based on the type of query or question you put in a search engine. But also tools that allow zooming in and out of data collections that you find in such results easily so that you would get an easy overview of a data collection, but also be able to explore it in more detail to actually understand what it is that you're presented with. We believe that we need tools that provide context to data, that let you understand where data set comes from and how and why it's been created. Because one of the things that we realized when talking to data professionals and in the course of our research was that so much of understanding data comes actually from understanding how it's been put together and for what purpose. Some of you might ask the question of why this is so relevant or why an organization as DODI invests in such research. I think the answer is quite simple. If we make data easier to use and data search better, data can not only be used by a small number of data professionals or fairly technical people, but by many more. We think if access to data is easier and if we remove the barriers that we currently have between people and data, then we make data truly open. So we think if data cannot, if no one can find and understand data, it is not open. And with the growing amount of data that's been put online, data search is becoming more important and sort of becoming an essential part of a general data infrastructure. So going back to reality for a while, we will discuss some of our work that we've been doing in this direction. After identifying that existing descriptions of data that are provided on, for example, open data platforms, that here we have example of a description for one of the data sets on Data.gov UK, we realize that they are not providing themselves often not providing themselves useful. We decided that we would like to know what the features of summary that would actually be useful for a user, what such features should be, what kind of things such a summary should contain. So we have asked 80 people in writing task to summarize a data set for us. This gave us 400 summaries which we then ask people again to rank according to the quality. And this experiment allowed us to determine the key attributes of a summary that are necessary to give a context to the user of the data in the future. We formalized this as a template that could be provided to, for example, publisher of the data and the template consists of questions. Here we just see a few of the questions just to give an overview. Some of them could be generated automatically. Some of them used to be provided by a human, by a publisher. We believe that such a summary would provide itself useful not only to the user of the data but also could be helpful for search and indexing functionalities. That's, as we said earlier, awakened well with the text. One of the other things we looked at was how to provide context to numbers in data sets. So numbers are, on the one hand, the most popular data type, especially in data sets on open data portals, but they're also the ones that require the most context for us to understand what they mean. So looking at numbers in data sets, we were working on an approach that analyzes the rows and the columns in a data set and tries to automatically work out what these numbers might mean. So we try to assign semantic meaning to such numbers. And, for example, if you have a column in a spreadsheet that lists the populations of all countries, we would take all the numbers in that column, look at their distribution, and then match this distribution to existing knowledge bases on the web, such as Wikidata or DbPdia. And if we find a distribution that's the same or, let's say, fairly similar, we can sort of infer that this column actually talks about the numbers of populations in countries. Approach is a bit more complex than that, but that sort of describes the idea. And so the aim is to automatically understand what numbers in data sets are talking about. Because if we can do that, then we can store this information together with the data in the metadata, and it's easier to find. So yes, both the summaries that Amelia talked about and these semantic labels to numbers in data sets, both of these information could be added to existing metadata. And we see metadata as the point of interaction between a user and the data set, but at the same time the point of interaction between a search engine and the data set. So it is, in both cases, really important for data search. And the general aim of our work was to understand, or is to understand what is the right amount and what is the right content of information that should be stored together with the data set in order for it to be searchable. And with this, we hope to build on and extend existing metadata standards. So in summary, data search is complex, both in terms of technical approaches and from interaction point of view. We hope it will mature as a research area, hopefully soon. In our work, we try to look at some of the aspects that could improve data search by analysing how people look for data, how people search for data, what summaries of data should look like, how to understand the context of numbers in data sets and resulting from all above ideas for metadata guidelines. That's all for us today. Thank you for coming and thank you for listening. OK, it's time for questions. Again, can you use idea i Fridays hashtag if you're using the live stream and can you talk into the microphone to ask a question? Great, otherwise people watching on the live stream won't be able to hear you. I've got questions to kick off with. Here's one I prepared earlier. What can large data pintals do to improve how people use them to search? I think the main thing they could do is to work together with researchers or with organisations like the ODI and work on removing the barriers that are between people. This is very much a new area where we can still try many things out and it's not set how... We don't really know how data sets should be resulted in such results, for example. Removing these barriers and making such collaborations easier would be one thing. Could you give us some idea about semantic labels for numbers? Is it more than big, small or medium? A semantic label? A semantic label in our context is when you look at the data sets, you have a number of columns and each column can be described by a concept that is well known in semantic web. For example, we know that this population in a specific column is exactly the same population that is in different data sets and also that is in Wikipedia, which is described in the Wikipedia or WikiData database or knowledge base. So we would like to... Or like the ideal world, we would see everything connected to the same concepts that everybody understands and can search for. So for example, we have a column, a population of a column with countries of European Union, for example, or... Yeah, so we, for example, know that these are exactly the same populations. Maybe numbers are different, right, because information can differ between sources, but we know that this is what we mean by the specific number in a position to just having a number. So actually, this understood. So the semantic element refers to the labelling of the set rather than some kind of indication of the actual values of the numbers, because obviously it's very difficult to get numbers to be the same. They'll always be different. Yes, but we can know what the concepts... What is the concept of this number? So we know that these numbers mean population, for example, of cities in this specific data set. Is there any way you could assign some sort of figure of merit? If you have a population list, for example, could you assign some sort of word which says how similar are these lists or how different these lists are? Which one might you not trust or something like that? So I think it's very much dependent if you trust the source, right? So if you trust the publisher of the data, for example, in terms of open data ports, if you trust Wikipedia as a source, some people might say, oh, I don't trust Google Maps because Google or something similar, but I guess nowadays we all need to validate the data that we look at. So you just need to make the educated guess if the data is trustworthy. Thanks for the talk. You mentioned these summaries you were collecting from people and I was wondering how these summaries are going to be used for the data sets you're describing. How could they help? Yes, the summaries, from a research perspective, we wanted to find out how do people describe data sets? If you see a spreadsheet and you have no idea what this is about, how would you describe it to another person who can't see it at the moment? But they would be used, first of all, their meant for data publishers in order to put context to the data set and has been proven in websites we use snippets of text. So in such results you get these little elements of text that help you understand what the page is talking about before you have to click on it. And so we would like these summaries to act as guidance for data publishers on how such a summary should be written, what should it contain, what are the different elements that should be there so people can understand what the data set is about without downloading it. But what we would like to do is to create a tool that takes the elements of these summaries that could be created or extracted automatically from the data and provides them automatically and then only input the elements that are not as easy to automatically extract and sort of help people. If something like that would work, then it would be much easier to create good quality summaries for a larger scale of data sets. And so the idea behind it was is that we know or research has shown that we actually find it quite hard to create free text without any sort of narrative structure or any guideline. And so starting with such a guideline was what we were trying to do here. Is there a bloodline between searching for structured data and searching for information? So when I think about when I search for flights using Google, I don't necessarily search for a data set that contains data about when flights are happening and when they're not, but rather Google serves me up increasingly structured data about that. Yeah, is there a bit of a bloodline between structured data and information and searching for both? Yes. I mean, if you talk about flights that give that example, it is structured data, but the difference to the scenario that we are talking about is that with flights we know exactly where that data is and we also know exactly what people want to know about it. So it's much easier to build applications that work for a set scenario where we know what the data is, we know what people search for, and we can sort of create functionalities around that, whereas it's a bit more like searching within the spreadsheet even or within the database, whereas in the scenario of the open web is we don't know where the data sits. It comes in different formats and there's no schema behind that. So it's much harder to actually get to the data. That answers your question. So you're talking about this semantic part and the labelling part, but a lot of the slides, they have things like Trump company's dataset, which is obviously like a fairly ethereal search from the work if you're sitting where the dataset is. Have you thought at all about how things would work when you start graffing them? So when you have Trump and dataset and companies, then what is a Trump company? Because there are companies that Trump's heavily involved in, companies that Trump might own, companies that Trump's family might own, and they're only really like a click or a link apart and I'm sure search engine could understand that difference, but it's a question of knowing that there is that link to follow and then following it without making the person go into it. Is that something that you figure fits into the big picture? So we've been focusing on how to search for... I don't know if I understood correctly, I'm sorry. How to search for what kind of metadata could help users to input such a query as for example Trump companies, but then depending on the task you would want to assess which companies you actually want to think to account, right? I don't... Thank you. Kind of, yeah, but with the textual web you do a Google search and then you might follow a couple of clicks through. So something like related datasets, this is what you mean? Yeah, related but not necessarily related, but in a very specific predictable way that a computer would actually be able to realise like, oh and this company is owned by this company, which is owned by this company, and they have directors. Yeah, there's a lot of reasoning to do in such a case. I know that the research is looking also how to do such a reasoning over semantic web, but I think with data that is not in the semantic web that it's very much, we don't really know what is inside datasets yet. So clearly, I think we are not that far yet with for example open data or not always, maybe some of them would work, but I don't think we are there yet. I guess one of the reasons we haven't been looking in that direction and I think one of the reasons was that when we started we looked at what are actually most popular data types and it we found that on open data portals it's like roughly 2% of RDF datasets. So we tried not to focus our research in too much in the semantic web area. Going back to the semantic labels, two things. One is you expect to hope that they could be automatically generated by an algorithm. And second, do you see them, the semantic label, do you see them structured in a sort of de-weigh classification sort of thing or otherwise what sort of structure would the semantic label have to maintain consistency be useful throughout a large number of datasets? So in the ideal world, it would be automatic. Probably it's, how in a real world it would work now. It would be semi-automatic that we suggest a list of the most probable labels for given concepts. Obviously something like city and its population would be much easier to disambiguate to some specific semantic label in opposition to some concepts that are just simply are not there yet in semantic web. For example, in the Wikipedia or wiki data. And in terms of how such a semantic label would look like, it's, so in our work we looked how to disambiguate specific columns. So every column would be disambigrated to a property or a class. So for example, it would be one unique identifier from for example, the Wikipedia describing for population. So we don't know that there is this group of populations and then by the class assigned to different column in the dataset, which is for example city, we would know that this 20 datasets in a set talk about cities and populations. So maybe for query city population or London population, if we know that London is a city then we can point to those 20 datasets for example. Hi, I missed the first part for your presentation so I'm not sure if this question is really relevant, but anyway, I get the impression that you are trying to push for a standard for metadata. I was just wondering if you could give us an example of maybe an organization or maybe a local government or national government, which you think is the gold standard or one of the best standards for all of this as you're saying that semantic and metadata. The reason I ask is that I think your research has assumption of really good structured data, but the reality is also that's ideal but the reality is that a vast majority of data is very much either unstructured or uncollected. I'm just coming from my perspective, I come from the Philippines developing country. Collecting data is still not part of the priority or even in the mindset of our local officials and it would be great to see like, okay, if we, as we shift into collecting more data, it'd be great if we tried to, as much as possible, at the early stages, even structure it at this kind of level. So that's just my question. Essentially, if you've got a gold standard, a good standard for starting organizations so they can look into. If you for example think of a summary, I don't think data needs to be very, very clean if in this case, if it's maybe the automatic generation that wouldn't work as well as on a very nice destructured data, but still user, summary for a user could be done for every data set and it would already improve the user experience when searching. And maybe to add to that, if the data sets we've used for this summarization experiment, they were not necessarily clean, they had missing values as well, they had the formats of dates were mixed up and it was intentionally because what we wanted to have is realistic summaries of the actual data that is there that they didn't assume clean or extremely structured data. But I think in terms of standards, I mean there are metadata standards and I don't know, for example, the data of your case using, I think based on Decad, but I think our, or Schimeth, yeah, Schimethodog, but the things that we looked at was if we tried to think about what makes these standard, these existing fields in method at the standards, how can they be really used for search and what could we add for things to be more understandable from a user or from an interaction perspective? How can we take a step further and think of what is missing there, what could be added to that? We're not claiming to, we're not creating a standard per se, we're just looking at standards and saying what could we add that would be more useful or improve discoverability of the resources. So, for example, all of the standards do have a description field, so there is the description in those metadata standards, just when we look, nobody really is requiring to put in a specific kind of information into description, so it's everybody's writing whatever they think it's important. The most common description of a description field in metadata standards is textual description and it doesn't actually tell us a lot of what would people want to see there or what, even if we put in the things that people would want to see there, what if this would actually help set to find the data? So I thought the template you came up with was really interesting, in particular, the first question, which I think you said, that you got people to say which variables they thought were the most valuable or interesting. If you're going to summarise, if you're going to look at getting some sort of summaries, would you sort of, would you determine which ones were the most important programmatically, would you just sort of say, see what these people have said and say, oh, in general, people find this type of information interesting, so that will be the sort of thing we prioritise in the summary. I mean, yeah, that's a really interesting question, this is something we've been wondering about as well, because this was the first of a step of the sort of experience that we would like to do, but what we found is that for all of the data sets that, for all of the summaries that we had, almost everybody had like one sentence, high level, subtitle or description of what this data set actually talks about, and these would be what we would for now define as the key columns or as the subject columns, that also some ways of automatically determining the subject column of, this is the main topic of a data set. Or it could be determined by the publisher or by the person who created the data, but yeah, that's a big question we've been wondering about as well. Yeah. Any more questions before we finish up? No. Thanks everyone for coming. We've got a three week break for Easter, so we're back on the 24th of April. Can we just give Laura and Amelia one last round of applause?