 The meeting is being recorded. So next we have our keynote for this year, which will be given by Yolanda Giebel, who is Senior Director for Strategic AI and Data Science Initiatives at the Information Science Institute, ISI, of the University of Southern California in Los Angeles. She's also a Research Professor in Computer Science and in Spatial Sciences. She's a Director of Data Science Programs and of the USC Center for Knowledge, Powered in the Disciplinary Data Sciences. She's a Fellow of ACM, AAAS, IEEE, and AAAI, a lot of triples and a lot of fellows. And she's in fact the past President of the AAAI, which is a very prominent role in the world of computer science. Yolanda's research is on intelligent interfaces for knowledge capture and discovery, most related to the topics of this weekly workshop. She's a leading researcher on semantic wikis. She's done important research around making the data from semantic wikis readily available for researchers and also around understanding how the community that creates semantic wikis co-organizes and so on. I'm very excited about her talk today, which will be on crowdsourcing to synthesize scientific knowledge. Welcome, Yolanda. The screen is yours. Thanks for being with us. Thank you. Thank you, Bob. Very nice to have me here. I really appreciate any opportunity to interact with the wikimedia community and the larger wiki community. I'm going to talk today about crowdsourcing. We used to call it knowledge collection from volunteers. When we started to work on this from the point of view of AI, knowledge capture and knowledge collection. But I'm going to talk about crowdsourcing to synthesize scientific knowledge. So I'm interested in how crowdsourcing enables us to capture new forms of scientific knowledge. And I'll give you some examples of what we've been able to accomplish so far. Why am I interested in science? Because I think the questions and the problems that science is tackling these days are very close to my heart. So health, understanding, intelligence, anything that has to do with questions about human health and well-being are important to me. As well as the health of our planet, sustainability, the environment, I think that these are big questions that will be key to help humanity. I think that working in these areas makes it a huge different long term for humanity. And if I think about these problems, they are getting more and more complex. And so how can we bring AI into the picture? That's my personal interest. So when I see what's happening in science, I see that collaborations and showing up knowledge from people that have different backgrounds and different expertise is more and more important. This is a figure from Barabasi who studies scientific collaborations. And he's showing co-author networks over the years. So I think you recognize some of the pictures on the left, Watson and Creek starting to collaborate at the beginning of the 20th century. You could see some pairwise and small groups of collaborations. The next picture is a human genome project. So starting to see large networks of co-authors. The one on the very right is from the Atlas collaboration that led to the discovery of the Higgs bosom and the main paper had about 4,000 authors. And I've seen their collaboration networks and they're quite incredible artifacts. And I think it's a testament to these hero findings in science. We want to help make them more ubiquitous. That's something that I would like to see. So one of the ambitions that I have is that we should be able to combine human abilities with other human abilities. But can we insert AI abilities in the middle? So I want to share with you this quote from Gary Kasparov about freestyle chess. It's a new kind of competition that he designed after being defeated by Deep Blue. So Gary Kasparov had never been defeated before in the chess game. And he knew that this was coming. I've heard him say that he was honored to be the human that was first defeated by or the first grandmaster to be defeated by a computer. And so when he created freestyle chess, he said a player can be any number of humans, any number of computers, no matter how many grandmasters or supercomputers you use. And what they found in this competition is that even if the humans were not grandmasters and even if the machines were not supercomputers or the best chess playing programs, what really made a difference is that they had very good roles in the game that they had identified and that each member of that player team had strengths in. So I'm very intrigued by this idea of how do we insert machines and help people collaborate in a way that results in teams that are better than otherwise. And this diversity of talents and skills that is important to bring in. So that's overall my interest and what drives my thinking. And I'll come to this slide because eventually what I would like to see is that AI systems understand enough about science to really gather knowledge and be able to write papers and make contributions to science. And in order to do that, we have to go beyond what we have today, which is a description of a protein, a description of the rainfall and the weather conditions in a location to having much richer interpretations of what scientific knowledge is and how people work together and so on. So I'll be giving you examples of this throughout the talk. I'm going to start talking about crowdsourcing non-scientific knowledge. That's something that we started to do earlier and just sharing some of our lessons learned and our findings there. And I'll move on to talk about crowdsourcing two types of things. One is collaboration tasks and collaboration networks. And then I'll talk about crowdsourcing vocabularies and standardizing vocabularies in science through crowdsourcing. And then I'll reflect a bit on what have been our successes and what remain to be challenges. So let me start with crowdsourcing non-scientific knowledge. Quite some time ago, and as I mentioned, we call this knowledge collection from volunteers. We had a pretty sizable project on collecting common sense knowledge or common knowledge about the world. This has been a bottleneck in AI. We were working on helping people with their to-do lists. So if you jot your to-dos, could an AI system help you organize them, prioritize them, give you a warning if you're running out of time to prepare for something? And in order to do that, the AI system had to really have quite a bit of knowledge about the world, so we went off to collect it. So we were collecting knowledge about objects in the world, like a copy machine, but also about tasks. So if you're going to give a talk, if you do a video conference, what do you need? And so on. And you can tell this is quite dated because I'm talking about LCD projectors, which is quite something. And a very useful kind of knowledge is if something fails, how do you repair it? So we were collecting all flavors of common knowledge about how the world works in an office environment. And through this, we had many kind of techniques that worked better. So for example, we wanted to be proactive in the prompts that we gave the user so that we would broaden the coverage. So if we say a bicycle has these parts that we know about, then that makes people think, oh, it's definitely missing a certain other part. We were validating a cross volunteer. So very early on, we noticed that if three volunteers, definitely four volunteers coincided in a certain assertion, we could trust it. We were also organizing the collection and the frames that we were presenting to the user by types of knowledge that our system could understand and relate to each other. So we collected quite a large knowledge base. And it was actually one of the largest on the semantic web early on with this kind of common sense knowledge. And the system was called Learner and then Learner too. But this taught us many lessons. Among other things, we learned that volunteers loved to teach this system new things, and they really came back and enjoyed it, especially we were raffling t-shirts for some reason. So lots of lessons learned and we have some papers about this. But we really were, you know, we could experiment online and we could really design what worked better and what kind of designs of these forms gave us better contributions. So this was a very interesting way to come to this. Another project that we did early on is when semantic wikis were beginning to be used more broadly and in fact semantic media wiki had hundreds of sites. We built on something called wiki apiary, which collected data from individual wikis to create what we call the provenance b wiki that really looked at where all of the content of the wikis was coming from and how it was created. And you can see on the front page that the content was going up and up and up over time as these wikis were being more and more used. But we also learned quite a bit of things and just to flash something that you hear, it was very interesting to see that creating classes and categories was not important to these communities. I think we were looking at almost 600 wiki's classes were not very important. Properties were used very, very broadly. The composition of the group was kind of also interesting. There were just a few people that edited properties. Everybody else provided assertions based on those properties and so on. So lots of interesting things about how how communities and crowds use semantics to capture knowledge. So I've always been interested in these topics. Jenny Randicic was in my institute for some time. And I remember discussing this idea that he had that he conceived about having a wiki with assertions about entities that was a little bit more structured than Wikipedia. And I remember having long discussions about what would be a fact versus what might be controversial and the importance of capturing provenance. And then, of course, this all eventually transformed into the wiki data project, which we use very extensively. We have a large project on using wiki data to do integration of data sources for science. And so this is also another example of some a page in wiki data that we visit quite frequently and we're always looking for more information to be captured there. So we've been working on understanding how to crowdsource and capture non scientific knowledge, this common sense knowledge for AI. Let me let me talk about scientific knowledge itself. And I'll talk about collaboration tasks and then I'll talk about crowdsourcing vocabularies. I'm very intrigued by crowdsourcing collaboration tasks because as you saw in those images, collaboration is so pervasive in science. And we don't understand it very well. So I do a lot of cognitive task analysis when I see groups of scientists collaborate, which is every day. And it's very dynamic, very ad hoc, very much of an emerging collaboration. It's seldom the case that they come into a meeting or a collaboration and they know exactly what each of them is going to do and they kind of have a plan. Everything kind of evolves as they understand each other. So it's kind of the blind guiding the blind on collaborations. And it's really a necessity. Any scientist that has data or that is looking at a problem is really reaching out to collaborate with others that have more expertise in certain types of data or understand some other type of model or some other type of physics or chemistry. So this is something that is only going to increase as we have more scientific data available. So we worked on a framework that we built on top of Semantic Media Wiki for collaboration. So we were interested in task oriented collaboration. So looking at how scientists formulate tasks, how they form groups to accomplish those tasks, how the group that works in a task relates to the group that works in a different task, how those tasks are related together. And very importantly, something that you all will resonate with, we were very interested not in the traditional AI frameworks for teamwork, where, you know, you're flying a group of planes. It's clear who is the lead pilot. It's clear who has to run the helicopter in this kind of pattern. So people play certain roles. And if someone leaves a team, someone else comes in and takes on that role. So this kind of prescribed teamwork was not our interest. Our interest was to enable anyone to come in and based on their expertise, their interests and the goals of a collaboration to find their way to tasks that they could contribute to and to creating on their own new tasks that they were interested in. So one of the major difficulties, I could talk about this for a long time, but just looking at a successful collaboration online, we really saw that it's very important to enable newcomers to really absorb what has been done before they came into the collaboration and what was the status of tasks, what were people working on without placing a tremendous overhead on the existing collaborators. Otherwise, it's kind of unmanageable. So so so we we really want to understand how the collaboration process is formed and to make it very evolving. And so we call this organic and organic data science as data science really captures this this interest that scientists have in in analyzing data together. So so what is the what are the tasks that they're working on? Who's working on each tasks? When do they work on them? How are the tasks accomplished? All kinds of informations about tasks. And I've worked for a long time on on planning and task representations and workflows. So I was really interested to see what we could capture from these. So you can recognize here kind of the look and feel of a wiki. And so we were representing on the left, we had a special area where we could see tasks and their subtasks and in the circles, you can see the degree of completion of those tasks. And I think the more reddish that they look, it means that they're a bit late over the schedule. But for each of the tasks, you could see in them, you can see in the middle, we had subtasks, we had an indication of how much of each of them had been accomplished. And then we had some basic metadata about who were the participants, what was the dates when it was supposed to be done, who owned the task and was responsible for it. And then what kind of expertise and focus it had. This is very important to connect people with the right places. And then the rest of the task page was devoted to how exactly it was going to be. And so if the task used a certain software library or model, then that would have its own page on this wiki and the collaborators could actually see a lot of detail about them. So so every task had a unique page, a unique URL and was connected with other tasks through this structure. So what was challenging for us is to build on the wiki architecture. I show this slide not because I can answer detailed questions about how we design the architecture of this platform. But just to show you that we had many functions that we built on the Semantic Media wiki platform at the time wiki based did not exist quite yet. But we were developing these APIs that allowed us to assert new facts, to query about facts, to compare the content of different pages. And then we added this idea of handling different types of categories. So if we were displaying a task, then the page looked a certain way and the metadata looked a certain way. And if we were talking about a piece of software or a data set, then it would look completely different. And a very important piece for us was the tracking the provenance. So who had contributed this? So for example, the software library can have authors who contributed. But then in the wiki environment, who was the editor that had provided a certain fact about a task or something else? And all of that was built as extensions of media wiki at the time. So our universe had these crucial types of scientific objects in the in the framework. And so the users could instantiate these types of pages. So so this really facilitated task coordination. You know, I show you a page that has a lot of green tasks. A lot of pages have a lot of red tasks. A lot of things that scientists endeavor to accomplish are too hard or, you know, they get distracted by doing something else. That seems much more interesting or important. And I think most of all, you can see, you know, this was my own page. But the places where I paid more attention or I contributed more or I had more contributions, we had lots of ways to reflect that on individuals. And I think that you can see here that there's a legacy of, you know, how was this task accomplished? And so we had tasks about organizing the workshop. We had tasks about how to write a paper about something. So in a lot of ways, this is knowledge about the tasks that we as researchers do every day that are really not captured in many places. It's kind of a giant group wide set of to-dos and ongoing to-dos and so on and so forth. So very interesting. So users were truly collaborating. A lot of people were viewing other tasks or, you know, many tasks, most tasks, 72 percent of tasks had several people signed up. But I think what's very interesting is that we started to detect what we were calling social task networks. So what you see here is the users being the nodes and the connections of this network being the number of tasks that two users had in common. So they were both signed up for the same task. And so what you can see is that there's there's already you can see some groups of users that are very connected. They probably have similar specialty or quite complementing complementary specialties. And then you see others sometimes, even if you see a user that's not well connected with thick lines to the rest of the collaboration, they might be really crucial in that they're bringing a very unique kind of expertise. So so very interesting kind of network and and we could see them evolve. So this is an example of a group that started really large and then split into two subgroups that were quite distinct over time. So over a period of several months. So really interesting to see that, you know, there's subgroups and sub collaborations and this can give fodder to those of you, many of you that work on social network analysis, very focused on collaboration tasks. So so it's very intriguing to me to study how humans collaborate and how they accomplish things together and how they connect based on different expertise. And I think that those kinds of social network tasks are greatly understudied and understudied and it would be very valuable to to understand them better. I want to spend time talking about crowdsourcing vocabularies and an approach that we call control crowdsourcing and the success we've had there so far. So so in, you know, I talked early on about environmental research and environmental sciences. And one of the challenges here is that it's very, very different from biology or perhaps chemistry and other sciences where the community seemed more inclined to standardization. So the situation that you see here, and these are pictures from a colleague of mine, Tom Harmon, from the University of California at Merced. His group sets up these little robots that you see floating in the river. They collect data about the river, water quality, et cetera, et cetera. They store it in their own database and they publish papers about the the water quality levels of these rivers that they have instrumented. They go on canoes when the sensors break, you know, in the rain or in the heat. I mean, they really work very hard to collect all this data. And when when other people ask him, you know, well, would you share this data so that we can see the trends in water quality across this region or the whole state? You know, he'll say, yes, sure, it's all here. You know, I'll open up a port and you can download it. But people don't typically want that. They want him to deposit his data in a repository. They want him to annotate it. And he's just not very excited, not just because it's a lot of work. But I hope that you you sympathize and you'll understand that sometimes you go to a shared repository and they start to ask you for, you know, well, what are the synchronicity periods? And like, I don't know what synchronicity period is. They start to ask you for things that you may not exactly fit and you may not know how you fit. And so they have this this very diverse, strongly diverse data collection situations. And so environmental science, if you ask about global trends, it's very hard to to use global data. So so this is our focus. You see on the top right, the famous hockey stick diagram for climate. And so you see that temperatures are raising. This is paleo climate. It's a community that has been studying the climate for the last thousand, the last many thousand years. And every time that they do one of these hockey stick diagrams, it takes them years. And the reason is that the data is collected by individual researchers. And it's really it really takes a long time to pull all this data together. And it's not very easy for them to describe the data in a way that it can be easily aggregated and analyzed. So this has been our target. And to do these hockey stick diagrams, to really understand the climate dynamics, they take samples from all over the world. They take data from samples all over the world. And so that involves many, many, many researchers. So the authors of these papers are consortia, like as you see in other sciences as well. And so we we started a project called Link Earth. And we told them that through wiki crowdsourcing, we were convinced that they could converge on a good way to describe data uniformly without any meetings. And they could just do that from the comfort of their own desk without talking to anybody. And so you could see the the two, three hundred people jumping up and down and saying, OK, let's hear about this. Right. So this was a big, a big, you know, hypothesis that we could do this and it was not clear. And the people in the room, you know, they'll go on on cruise ships. They'll set up drills like you see in the picture to get, you know, samples of coral that have been deposited over, you know, a thousand years and see in every ring of the core that the cylinder core that they extract. Every ring is a different time frame. You know, it's there's no absolute. So they're all proxies for the climate. Right. So if the coral looks a little bit more sad, then it was too hot for that species of coral. What is too hot for that particular species of coral? So these are things that they'll work on while you have another contingent that works on ice cores. So here they're looking at what's trapped in the ice. They're looking at, you know, the size of the bubbles. Other people look at lakes. So we had one single workshop at the beginning of this effort where they actually got in a room, discussed the process, agreed to the process. The main thing was to say, yes, let's do this. Let's try this out. And the champions for this are listed in red. They really, you know, rallied the community and the community trusted them as paleoclimate scientists to to try this new approach because taking years to do each of their global analysis is not very viable. So I'll go quickly over this. There's some papers on this that you can read. But but linked earth really tried to say, you know, as you are using your data set and you want to describe something unique about it, go to the Wiki, create a description for it. And if there's weird properties of a data set that you don't like, don't use them. But if there's a property that you would like to highlight, then add it on. And that way we would crowdsource how they would each like to describe their data. And then we created a very analogous process to what you see in Wikipedia of having editors that will oversee certain areas and and what would be accepted from these crowdsource properties. And the result of all this process, and I'll show you some a little bit of detail about it, it's a new standard for paleoclimate that resulted from just one meeting. So at least we have one piece of evidence for that hypothesis that we could do this fully online and it's being widely used today and increasingly broader adoption for paleoclimate data. So I'm excited to tell you that this kind of wiki crowdsourcing is helping science in new ways and helping them synthesize new kinds of scientific knowledge, which are these agreements to the properties. So this is the paper that they wrote about the standard that actually was selected for a centennial collection by the largest geoscience society. We're very excited about it. And this is how our wiki starts welcoming you as this tradition. And this is a very old picture. But I think I didn't have time to make a fresher one. But the idea is that if you're doing lake sediments and if you're doing coral cores, the properties may be very different and you may want to add your own properties as you are editing this. The data set that you're describing is really your ground truth. Right. You need this property because it's important for describing this data set. So it's very much bottom up. It's no one is just sitting on their desk and thinking, oh, wouldn't it be nice to have blah? It's more, you know, I'm working with this data set and gosh, I cannot find this metadata. And if it's the same as this other data set, so let me add that as I go. So pay as I go kind of semantics. And so and so every data set was downloadable. They had properties. Some of them were provided. Others, you know, the contributors did not provide them. We run into a lot of issues because some properties required separate pages to really provide all of the information. So that made each data set have a network of pages that that made it very awkward. So if we had to redesign this, we would do that very differently. We would allow them to define new what we call crowd property. So so after a few weeks, you would have a bunch of new crowd properties that were not in the core, which is the first gray rectangle that you see. So so as they were adding a new property, we would offer completions, as you can imagine, this would encourage them to adopt what was already there. They could see the descriptions of those properties. And they did not always adopt others. And so a lot of the things that we did was how do we get the community engaged? So we were trying to attract them to community discussions, to looking at maps so that they could see the number of data sets described grow. We would have ways to give authors credit and, you know, the usual kinds of things that you see. But a very crucial thing is that we would ask for their opinion. And so we had all these polls to decide whether or not a certain property was more popular than the group of people that had used it. And so we did a lot of polling to the community. So one of the successes that we had, one of the things that we did well is to start out with a very clean initial ontology. So this took quite a bit of convincing because from their perspective, they had a clean initial core ontology. There had been a researcher called Bob Evans who had created this core set of standard terms and all they lived by this. But of course, when you bring, you know, mathematical logic to the picture and you start to say, well, if this is really not measuring temperature and you're if you're really measuring the health of the coral, which is correlated with temperature, that's not really a measurement. That's not really observable. That's really something else. Oh, OK, well, let's call that a proxy. OK, is it a but it's an observation. OK, let's call it a proxy observation as opposed to a real temperature observation from from. I can't come up with the word a real measurement of temperature thermometer. That's what I was trying to find. So so we completely turned their clean standards over their head because for every term that they were talking about, we could find a community ontology. So we would use folk to talk about researchers and their papers and their collaborators. We would use Prove. We would use Geosparcle. We use schema.org. I think that we were one of the first possibly one of the first science ontologies to build on schema.org for data and really help them think through it. SSN, the Semantic Sensor Network, was extremely helpful. So that guided how we modularized and how we created that core set. And then we just threw them in and then they would, you know, add new terms. And periodically the editors would then merge the new terms into the core terms. And these editors were the ones making these decisions. If the upgrade was simply to, you know, a monotonic change where we would just add a term and that was fine, then we could automatically upgrade all the data set descriptions that they had been working on so far. But sometimes it required semi automatic processes. So so the papers described this pretty sophisticated way to let the contributors, you know, annotate data sets, suggest properties. And then these editors that you see on the left having different functions in updating and upgrading the ontologies all together. It's surprising to me that this kind of approach is not more common and and for us updating an ontology while people are using it. I know that the gene ontology has done that from the beginning. But we're always surprised that that this is not more present in a lot of projects. So so we once the new version was approved, we included it, we annotated where the terms came from. We never removed any term, things that now are increasingly common to do. So the report are these kind of subcategories of annotations. Some of them are particular to tree cores. Others are particular to coral and marine sediments. So so you can see that they they actually form these sub communities and these sub vocabularies. Some of them were applicable to all of the all of the data sets. And then there was something called chronologies that had temporal hypotheses. So this ring belongs to this period and they have these very flexible ways, depending on the uncertainty that they had to describe the dating of certain parts of the core. So that led to its own also common vocabulary. So very interesting kind of work. I mentioned how important it was to take votes. So we thought that if someone doesn't use a property, it doesn't mean that they don't find it useful. They're just not using it for their data sets at the moment. So we decided to do these polls that they could really answer. It was hard to do them often enough through the wiki because not everybody was on the wiki every day, but we tried to keep the community engaged. So we would tweet polls and that was getting a lot more kind of continuous engagements and so on. And and so, you know, they would get face to face every now and then just to comment on the process and so on. But no one got together to discuss whether this term or that term or any anything like that, more about the overall process. So as as they got together for other reasons or other meetings. So so they they were able to engage a hundred and thirty five researchers worldwide, as you see in the map voting on more than six hundred properties. And I think part of what they wanted to get is whether properties should be really, really required if they were really important or if they were optional and nice to have, because that reflects on the lift that they have to do to annotate new data sets or to annotate legacy data sets. And so so this is just to give you an idea. I mean, you know, for for each of the different aspects of the of the standard, this is one of the older data sets. And so the metadata that is really available is lacking in many areas. So for example, you can see the uncertainty aspects that the community considered as required and essential are really not present in the data set. And so they would have to contact the authors of this data set and kind of dig all of that out. So they wanted to be careful about what to require. So so today they are moving along. They're using this standard to decide how to include new data sets in the wiki and in their repositories. Some of the repositories are run by NOAA. They also work very closely with Pangea that I believe is out of Germany. And they also are using the ontology and the standards to guide the creation of notebooks and analysis tools, a lot of their software. So, you know, you just mentioned a property of data sets that you want to analyze. And so all the data sets have that property and they can be pulled together. So their ability to do aggregate analysis is is really amazing these days. So so I'm keeping my fingers crossed that I can tell you soon that the next hockey stick diagram maybe will be done in five minutes. So so that's a that's a big success story and something that's really making a huge change for for this community. We're working now with a neuroscience community. It's a worldwide effort. I think there's about 50 countries involved, hundreds of institutions, thousands of researchers. They self organized into groups. They share they don't share the data. They share the knowledge that they have data about something. So they'll say we have a study on schizophrenia on people from 16 to 34. Oh, we have a different kind of population. We have schizophrenia in older adults. And so they kind of share this metadata, not so much the data sets themselves. So but it's so self organizing. And if you're studying schizophrenia, you might have data about smoking and how that, you know, changes things with aging. So they might have another group looking at smoking and aging and they'll reuse this data. So again, this very organic idea of collaboration. So I apologize, these are very old screenshots. But we describe data sets. We describe the machines, the MRI machines that they use that makes a huge difference when you integrate the data sets. We describe the cohorts that come from each of the data sets. And it's a way to help them also create broad studies with very different kinds of populations. So so the crowdsourcing of vocabularies is kind of an ongoing process. You know, we really have to find the right way to engage. This is a much more fragmented and diverse community, but but very exciting. So I'm going to wrap up and leave time for your questions, which I'm very interested in. And I want to kind of highlight for you what I think have been successes and what are our challenges ahead, which are very, very many. And I hope that you appreciate that we're coming to wikis from a completely different perspective and with certain objectives and really benefiting from all the community's work on making these platforms available and usable for the rest of us. So this has really enabled a lot of research for us. So so I think, you know, as a summary of what I've talked to you about, I think you can take away this message that through crowdsourcing, we are really creating added value and synthesizing new forms of scientific knowledge. So you often see, for example, that, you know, there's a description of proteins and there's a database about proteins and there's many databases about proteins and you see wiki data helping identify that, you know, there's several ideas that refer to the same protein and can we create a single place where we can really describe this particular entity. So the description of scientific entities is super important, describing papers, describing data, all kinds of entities. We are kind of synthesizing new forms of scientific knowledge. So one is these tasks and what are the tasks, sub tasks and what are the workflows that the scientists are following as they do the science. So these are not prescribed workflows that they'll say, oh, yes, this is how hydrologists and agricultural modelers work together. And then they just download that workflow and follow it. But they're really creating a new one from scratch every time. And we're trying to capture that new form of collaboration that they are creating. I talked about the social task networks, right? What, how do people relate to one another through the work that they do and the tasks that they pursue in science? And then I've also showed you that through crowd sourcing, we can actually catalyze consensus vocabularies that did not exist before. And I think there's many, many other possibilities in science to capture new forms of scientific knowledge as we do this work. So in the case of the consensus vocabularies, which is something that I think might be of broader interest to all of you, I think we did a careful seeding of the core initial terms. That's really, really important. That was the same approach that Mike Ashburner took in the gene ontology and something that I would recommend for anyone. And then you can grow that core outward. And that was a key to making everything work for us. Also, if you're involving key experts, if you're a paleoclimate researcher, why would you be inclined to spend time really, you know, squeezing your brain to contribute to this effort? So we need to give them immediate benefit. And in our case, it was this immediate description of their data set and this immediate ability to query the data sets with these new properties. So this property of semantic week is to be able to exercise queries immediately with new properties was very important for them. And then really make it part of their workflow and their ecosystem. So they care about writing software to then access the data sets and analyze them so that integration has to be there. So all of these aspects really contributed. It's still very challenging for us to support this continuous evolution of vocabulary is if you've annotated 500 data sets and now someone convinces the community that it was all wrong to do this that instead it should be done some other way. It's really necessary to make that change, but it's really painful. And there's an entire process to doing that, that we need to really facilitate and make more agile. I think these areas of task-centered collaboration and how we help people work together in ad hoc organic ways is also something that we need to study better. And then finally, we're finding, at least in the neuroscience world, they need to do selective sharing. So in Wikis, we normally open all the content to everyone. Maybe some cannot edit, but everybody can view. And here they're really pushing for selective sharing. So they may not want to expose a lot of details about their data only to the close collaborators until the paper is out. So we say that selective sharing is better than none. And hopefully it's something that is counter to the way Wikis are designed, but maybe something that we can support for them. And we've been working towards that. I showed you this slide at the beginning, but longer term, this is really the vision. How can we capture a lot more scientific knowledge that has to do with processes and synthesizing ideas and expertise and eventually have AI be part of the scientific ecosystem and have humans and people not just play freestyle chess, but really collaborate to solve the hard science problems in the future. And I'll conclude here and take questions, Lila and Bob. Thank you. Thanks so much, Yolanda. Q&A will be managed actually by Christina and Tiziano, who have been collecting questions. Hi, Christina. So we have some questions. How much time do we have? Because we have Ismael with four questions. So maybe we can ask the first and then leave the other if we have time. Maybe Ismael, if you wanted to ask directly the question, otherwise I can read it from the chat. Let me know. OK. Do you hear me? Yes. OK, OK. Hi, Yolanda. I have read something about your works because I'm making my final project for my master thesis regarding data papers. So when looking for ontologies regarding that, I found your link paleo data ontology. But I found a previous one, which to me is really, really similar, the ecological market language. Why do you prefer to start a new one from scratch instead of reducing this other one? So we did reuse some of the community ontologies, as I pointed out. The trickiest part for us was to distinguish the many, many different kinds of observations and indirect observations called proxies. That was really, really tricky. There's lots of ontologies. There's one called ENVO that is used a lot in the life sciences. There's another one called SWEET that was put forward by NASA. Those have maybe a dozen terms that we can really connect to. But our challenge was really representing observations. So in fact, in the ontology paper, we mapped the ontology to ENVO for, you know, it seemed like a good idea to do that, to publish the ontology. But that was our main challenge, that the observations were not properly fleshed out in the other ontologies, and that was what we were focusing on. Thank you for your question. And thank you for clarifying. We also had another question from Adam, I think very relevant. So the question was, how do you handle this potentially sensitive data about patients? So do you store the data or just the metadata? And just in general, when collecting neuroscience data sets, how you ensure the protection of personal information? Yes. So are Wiki only, this is about the Enigma project, the neuroscience project. The Wiki only stores metadata. So we'll say, you know, the hospital in Postam has patients that are 18 to 35 in age range that are non-smokers, they have data from MRIs, from EEGs, from this, from that. So we describe the content of the data set. We never see the data set. In fact, when the neuroscientists analyze diverse data sets from separate hospitals and universities, they actually send out the workflows and the instructions to run the analysis. And the analysis are done locally. And then the results are sent to a central place here at USC. And then they are kind of aggregated. They do what they call a meta analysis. So the data never leaves the organization that collected the data. And what I'm trying to point out is that the data, of course, is sensitive and has its own reasons not to be shared broadly, but the metadata also. So they often are worried about publicizing their data very broadly because then everybody will ask for it or then everybody will tell them, you know, oh, can you take from your cohort this or that or the other? So sometimes they just don't want a lot of publicity about their particular data set. Sometimes, you know, they'll say this data was collected with this instrument. So if you have a certain kind of instrument, then a lot of people will want to talk to you about how you collect your data and your processes and so on. So in some sense, and I think it's pervasive in science, the more that you expose the more attention you get in some cases, and if your data is valuable, that's the case. So they're very private about their metadata even. So that was something that I was very surprised about. I assume that metadata was something freely shared, but they have many reasons not to. OK, there is another question from Bob about the limit of crowdsourcing. Basically, how far can we push crowdsourcing and what are limits that can be achieved by crowd in terms of science? Thank you, Titiano. So, you know, we call it crowdsourcing because, you know, it's over 100 people. That's a crowd. And in the case of Enigma, there's there's thousands. So that's a crowd, but it's a very peculiar kind of crowd. When we were collecting common sense knowledge, it was just, you know, we call them medicines or web volunteers. It was just anyone out there who could answer questions about, you know, how you go to a meeting or how long is a lunch break and things like that. Here, these are, you know, modified experts. They really have a lot of expertise in their field. So they're not just anyone off the street or any kind of volunteer. They have a serious job. They want to do science. And so it's a very different kind of crowd and a different kind of crowdsourcing. So it's kind of its own niche crowdsourcing. If you if you if that makes any sense. So what are the limits? I think not everybody has a community outward looking view of science. That's not the case with everybody. So that's one of our challenges, even though the benefits are clear and the needs are very clear, but not everyone has that. The good news is that I believe that there's more people with a community oriented view today than there were 10 years ago in terms of science and science, science researchers. So I think things are shifting in a good vector. And I also think that because the experts are busy and scarce and so on, finding ways to route to them the most controversial or delicate questions while getting the rest of us involved in science in more common ways. So our initial concept for linked earth and that's where the name came from is that it didn't need to be the paleoclimate scientists annotating every climate data set that existed thousands and thousands of data sets that they've collected over decades, but that we could actually get regular citizens to annotate them. And it proved that there's just not enough context and information to have regular citizens do it. So they ended up being just the expert crowd, but just finding pockets of science where citizens can contribute I think is one of the challenges for scientists because there's just too many hard questions and too much data and too many problems that opening science to other collaborators I think it's very important that that's a challenge that we have. You asked a very broad question and I'm sure you know there's many more answers but I'll start there. Yeah, but just to connect to that I think very related if just just in two sentences because we are running out of time. There was a quick question about broadly the barriers that hinder this type of research. What are your thoughts. What are the major challenges. You know this type of research is really a labor of love. So we could be doing what 99% of the community does which is to download data from Reddit or from Wikipedia, or from wherever and study the data that's already available there. I think we're doing a labor of love because it takes a long time to create these environments where we can collect proper new kinds of data and use them and, you know, doing that just takes a very long time so it takes years to forge these collaborations to establish and to explore it so link data was not born in a day and they were not convinced in a day so I took maybe two or three years of convincing and really going into the experiment so I think that's the biggest one. But what a nice way to put it where it's inspiring to end on that note. So I think we are now scheduled for the break. Yeah, I'll go ahead, I think. Yes, so Yolanda thank you so much this was an amazing keynote thank you for for spending some time with us. Yes, let's thank you. Let's clap. It was, it was great. You are welcome to say it up until the end of the work.