 Bring it now and we're live. All right, good morning, good afternoon everyone. This is Daria from Wikimedia Foundation's research team and I'm very excited to invite you all to our August edition of the research showcase. Today we have two guest speakers. I'm thrilled to have Sneha Narayan and Andrew Su presenting today. Sneha gonna speak first. She's gonna present her work on the Wikimedia adventure, an attempt to build and study onboarding mechanisms on English Wikipedia. And we're gonna switch in the second part of the showcase to Andrew Su for a presentation on the G-inWiki project. So without further ado, Sneha, the stage is yours. We'll be having a short Q&A at the end of your presentation and invite people to stick around after the entire showcase and there's also an IRC channel where people can follow the live discussion. Thank you. Hello everyone. Thank you so much for that introduction, Daria. Hello everyone. Thank you so much for that introduction. Hang on while I get my slides set up. Excellent. So hopefully I'm shedding my slides right now and assuming that I am, hello. So today I'll be talking about the Wikipedia adventure which is a gamified tutorial that is designed to help onboard new editors on Wikipedia. So this was a system that was designed and built by Jake Orlowitz and Jonathan Morgan from the Wikimedia Foundation. And they did this with the help of a number of Wikipedians. And we, the system was also evaluated by researchers from the Community Data Science Collective which is a research team consisting of folks primarily from Northwestern University and the University of Washington. And that includes Benjamin Macro Hill, Aaron Shaw and of course me. So the broad question that we try to address with this project is how do you mobilize and socialize newcomers in an online volunteer community? And this is a pretty important question because online communities that rely on voluntary contributions need a steady stream of newcomers in order to ensure that they continue to exist in perpetuity. So even when new people show up at an online community they go through an adjustment phase and during this adjustment phase they need to learn how to navigate the existing norms and practices of a community. And we call this adjustment phase the socialization process. And if socialization doesn't occur then newcomers often just get frustrated and leave the community. And this is a vexing problem on Wikipedia in particular. Prior work done by Aaron Halfaker and others has found that the number of active editors on Wikipedia has been in slow decline since 2007, largely due to the poor environment for newcomers on Wikipedia. And since no editor edits forever if newcomers aren't socialized and retained in large enough numbers we can expect this kind of shrinking. And essentially if we don't solve this issue of new editor retention it can put this really important public resource that everyone uses at risk. However, it's actually often pretty hard to be a newcomer on Wikipedia. And that's because over the years Wikipedia has become increasingly formalized and complex as a community. So if you're a new editor on Wikipedia there's lots of skills and even tacit knowledge that you need to pick up in order to be an effective contributor. So for instance, you need to first of all figure out what you wanna work on. And once you've done that how to interact with other editors there are wiki markup conventions that you need to understand and follow. And there's even also community wide philosophy such as the neutral point of view principle. And of course there's also finding reliable sources which as we know is a problem that extends beyond Wikipedia in this image. In addition, what makes this even more complicated on Wikipedia is that new editors are mostly expected to seek out all this knowledge on their own or at least with their own initiative. So this is an example of a welcome message that's sent to a new user on Wikipedia. And each of these links that you see here lead to a long essay which I'm sure many of you are familiar with. And this sort of orientation towards onboarding makes the socialization process quite challenging for many new editors. Just seeing an instruction manual like this soon after you create an account and start editing can be pretty overwhelming and also quite dry. So this like my co-authors and I have to ask would a more structured and engaging orientation help new editors contribute better to Wikipedia? And essentially this was the question that was the impetus behind designing and building the Wikipedia adventure. So for the rest of my talk, I will discuss the system design principles that we used to build the tutorial. I'll also show a few screenshots from the tutorial itself and I'll present the results from two different studies we conducted to evaluate the tutorial. So first we conducted a survey study where we surveyed a number of users who had tried the Wikipedia adventure out and essentially we found that they responded quite positively. They enjoyed the system and learned from it. And then so that prompted us to conduct a field experiment where we measured the effect of playing the tutorial on the aggregate subsequent contribution to Wikipedia. And spoiler alert, it turns out that that wasn't a noticeable effect from this experiment. So essentially that was no measurable effect from playing this tutorial on the aggregate subsequent contribution to Wikipedia. And we thought this was an interesting finding just because these two studies on the same system seem to tell us different things. So I'll follow up an explanation of these studies with a discussion of our results. So to start off with our system design principles, one of the things that we aim to do was create a structured orientation on Wikipedia. And part of our decision to do so was based on findings from organizational theory. Organizational theorists often distinguish between individualized forms of socialization and institutionalized forms of socialization. Individualized socialization is sort of like on the job training. It's often an informal type of socialization and it's initiated by the newcomer themselves typically. There's no set sequence in which the newcomer learns things and the experience of learning about an organization this way leads to a lot of variation between newcomers. And incidentally, a lot of previous socialization efforts on Wikipedia have sort of built on this type of socialization. It sort of expects that people will learn about Wikipedia sort of take their own path and efforts like T-House and Adopter User are sort of there to essentially be spaces where newcomers, like when they have a question, they have like a forum or a mentor that they can direct their questions to and they can sort of learn as they go. However, this individualized socialization tactic is contrasted with institutionalized socialization. And this is sort of more, this looks like more formal orientation. So for instance, like a week long college orientation the beginning of the semester is a form of institutionalized socialization. This is typically initiated by the organization. It's more formal in nature. The tasks that newcomers are exposed to are presented sequentially and the experience among newcomers in a cohort is often more uniform. And what we actually know from meta studies in organizational research is that overall the consensus from this party of work comes down in favor of institutionalized socialization. So a number of studies have shown that overall institutionalized forms of socialization have led to increased self-efficiency among newcomers, more social acceptance from the existing members of the organization they are joining and better retention of newcomers. So we definitely wanted to see what institutionalized socialization would look like in Wikipedia when constructing this system. And as an aside, it's institutionalized socialization isn't new to online communities. It's many citizen science projects use these approaches and as well as MMORPGs such as World of Warcraft that have newcomer only spaces that train people before they join the community at large. But we didn't just want the onboarding process to be more structured, we also wanted to ensure that it was engaging. So in addition, we drew upon the literature on gamification and learning outcomes. And essentially the consensus from this body of work is that when you look at the uses of different gamification elements such as achievements and badges, levels, narration and so on, when deployed in educational environments, these are shown to lead to positive learning and outcomes and increased engagement. So based on these principles, my co-authors Jake Orlowitz and Jonathan Morgan designed and built the Wikipedia Adventure, which is as stated earlier, a gamified tutorial that helps newcomers learn how to edit Wikipedia. They did this in collaboration with other volunteers from the Wikipedia community. And if you haven't heard of it or tried it yet, it's still up and running on the site. So definitely feel free to check it out. In case you haven't played it yet, I'll show you a few screenshots from the system. The tutorial is divided into seven different missions, each of which addresses a different facet of editing Wikipedia. So for instance, this is mission two, which focuses on learning how to communicate with other editors. And each mission asks you then to complete a series of tasks that you're prompted to do via little pop-up messages that you can see on the top right here. So for instance, here you're asked to edit a talk page and then the system responds by teaching you wiki markup and encouraging you to use it in a simulated environment. So once you're done with a mission, it then posts a badge to your user page. And this badge basically forms, it's meant to be encouragement for the new user, but it also serves as an indicator to veteran editors that the user that is being oriented on Wikipedia. It's essentially a way to demonstrate to the community that this is the person who's trying to learn, who wants to contribute to Wikipedia in good faith. So that's basically a little summary of how the system works. After building the system, we first decided to evaluate it through a user survey. And so to that end, we invited a bunch of new users through their talk page, a number of them played the game. And then again, we sent another talk page invitation asking people to take a survey and tell us a little bit about their experience. And we targeted the survey at English Wikipedia users who had created accounts recently, had made two or more edits and hadn't been blocked or given a level four warning, which essentially is sort of like a basic test of good faith editing. And the survey included questions about user confidence, user engagement and overall design satisfaction. So here are some results. We asked people to respond at what facets of editing Wikipedia they felt more comfortable with after using the tutorial and over 85% rated highly the ability of the tutorial to explain concepts such as making edits, viewing page histories, adding wiki links and the policy of neutral point of view. And over 90% agreed or straight as an editor or it helped me understand the other. Looked at user engagement in this, oh, sorry. In this case, we found that 80% of users highly rated statements such as it made me want to edit more or it made me feel welcomed and supported. And the survey also included a spot where people could give us open ended feedback. And then they said one user shared a statement that said the adventure was an easier and better way to learn the basics with UDS is trying to run it and just reading it. And this if you recall validates our decision to design for institutionalized socialization. We finally looked at design satisfaction. Overall survey respondents liked the gamification and design of the system. And one user specifically shared that they enjoyed aspects such as the challenges and badges that made it feel more like an educational tool or game rather than a lecture, as well as the way it recorded their achievement to date. And this as you can see validates our choice to use gamification techniques as a way to engage newcomers. So to summarize quickly, in general, we found that the survey results indicated that participants enjoyed the tutorial and it was overall well received. And also user responses validated the design principles that we used in order to build the system. So at this point, we could have declared victory and just gone ahead and deployed the system on Wikipedia. However, we wanted to know does the Wikipedia adventure affect subsequent participation by newcomers? And that's why we did a follow-up study with a field experiment. So in light of the positive survey results, we wanted to test the tutorial in the wild, that is to say in the environment that it would be used to see whether this intervention had the potential to reverse or slow the newcomer attrition problem that Wikipedia has. So to this end, we conducted an invitation-based field experiment that mimics the way that the Wikipedia adventure would be deployed on Wikipedia and measured the effect that it had on subsequent newcomer contribution. So specifically, we had a few hypotheses about how the Wikipedia adventure could affect new user participation. In particular, we hypothesized that after playing the Wikipedia adventure, a new user would edit more and have essentially contribute more edits to Wikipedia. We also hypothesized that they would increase the number of contributions to talk pages and also that contributions would be of greater average quality. So this is, in terms of the design of the field experiment, we identified 1,967 users to be a part of our study. We used similar inclusion criteria as we did for the survey study, which is essentially new users on English Wikipedia who'd recently created accounts and who had passed a basic test of editing in good faith. Users, we invited 89% via their talk page to play the Wikipedia adventure and the rest formed our control group and they were not sent an invitation but they were still, their behavior was still observed for the study. Out of those who were invited, 386 users actually played the tutorial. So basically the uptake of the tutorial just being invited to play doesn't mean that people actually played. So that was a separate characteristic that we noted. And after that, we tracked the behavior of people in all three groups over a 180 day period. We collected a number of dependent variables for all the users in our study. Specifically in the six months that followed their entry into the study, we tracked the total number of edits that a user made in that time. We tracked the number of edits they made to talk pages and we also calculated the average quality of edits they made during that time period. And specifically for edit quality, we used the content persistent metric developed by half acre and others, which, and the logic behind this is essentially that the longer an edit sticks around, the more likely it is to be of high quality. So we then conducted, we ran a couple of models to analyze the results of our experiment. We first estimated the effect of inviting a newcomer to play the Wikipedia adventure on the subsequent contributions they made, essentially those three dependent variables. And then we also estimated the effect of playing the Wikipedia adventure conditional on being invited on newcomer subsequent contributions. And essentially the distinction between the two is that since it's an experiment, randomization is important and randomization here occurs just at the invitation stage. The set of people who actually went on to play the game cannot be directly compared to the control group because that's not a random subset of the invitation group. So essentially we use a technique commonly used in economics called two-stage least squares regression that helps separate out these effects and identifies the effect of the actual tutorial on subsequent contribution. So then I go to what we actually found our results, which unsurprisingly, since I spoiled it for you in the beginning, showed that there wasn't an aggregate effect visible in either of our models on any of the dependent variables. So these are all insignificant effects that you're seeing here. And we were pretty surprised by this. So we ran a number of models and it turns out that this result is robust to multiple parameter specifications of basically all the models we tried. And we also conducted a post hoc power analysis, which assured us that the sample size was large enough to detect even very small effects. So basically this is very much an old result. So then the question is, why do we have an old result that was positive feedback from the first survey study? So why is there a difference in terms of the effect that the system had on subsequent contributions? So one of the first things that people ask about is maybe there was something wrong with the system, but as noted earlier, users responded positively to the system. And even if this was a problem, we don't really know what we would have done differently just from the design perspective. It was based on prior research done in the design space. So we still think that just this explanation of like the system was bad is possibly not the most compelling reason for this result. It could be that gamification itself has some limits. People in the first respondents to our survey said that they enjoyed playing the Wikipedia adventure, but there is a difference between playing the Wikipedia adventure and editing Wikipedia. So the engagement that they felt using the system might not have fully translated into editing Wikipedia as a community member. And that could be an explanation for why we didn't see an effect on aggregate contributions. And then finally, a potential reason why the system might not have worked is that the deployment necessarily required users to opt into playing the game. And perhaps not enough people opted to play the game for it to have an effect on the overall pattern of contribution to Wikipedia. So basically, even though this is a form of institutionalized socialization in that it is structured and that it is, and it is presented to all newcomers, it is or intended to be presented to all newcomers, it still requires the newcomer to engage with it and perhaps not enough newcomers seek it out despite being invited to do so. And as a result, it can only be weekly effective in the aggregate. So I also wanted to clarify for the purpose of this audience, like what, I don't think some implications of these findings are. One, as stated before, is that the system is useless. I don't think that's necessarily an obvious conclusion to me because the null experiments essentially reflects finding from a very specific kind of deployment. So I think what the experiment actually tells us is that if you deploy such a system and expect people to take it up on their own, that alone might not change the aggregate pattern of contribution to Wikipedia. However, especially since there are other environments that people could try this game out, especially since we've had positive feedback. And so for instance, I've used this, I've used the Wikipedia adventure in classrooms. It has potential to be used in hackathons and other such environments as a sort of like gentle overview to all aspects of editing Wikipedia. Informally, the students I've worked with have liked it. More research is needed to see whether such a system has positive effects in these kinds of situations. And I also wanna make sure that just because we have a null effect that it doesn't, trying to onboard newcomers is a futile task. We're still faced with the same major problem of needing to retain and attract new users and new editors to Wikipedia. And even if one intervention does not solve the problem, it means that it's still a very viable and compelling area of research in order to figure out what can work. And in terms of implications for the community, it seems that one of the things that we learned from this experiment is that there's a difference between the self-selecting nature of Wikipedia sort of contrast with the very centralized idea of what the community is. So even though we invite people to contribute and say that anyone can edit, there is, and there are many tasks that we can do, there's still a very sort of centralized and shared notion of what it means to contribute to Wikipedia as a merge. And these ideas with each other. And finally, this idea that there are many efforts that have been used to redesign the onboarding process. What can newcomers do in order to better Wikipedia? But what can, in what ways can the community change or the system change on Wikipedia in order to be more accessible to newcomers? For instance, redesign user talk, the message walls, and the work that can be done to see what effect those kinds of system-wide changes have on newcomers as well as current community members. Visual Editor is another example of this. Of that kind of change. And then finally, coming back to testing in the wild, no matter what interventions we try, it is extremely important that we absolutely try to see the effect they have on the users in the field, essentially, and we can only know how effective these are when we conduct experiments of the I presented. And so that is essentially it for my talk. Thank you all very much. Thank you, Sneha. Thanks for the great talk. I think we may have room for one question and then save the other ones for later after this talk. So Jonathan, is there anything you want to relay from the channels? Yeah, we have one question. Anonymous Wikipedia asked on YouTube, is mentoring a solution for users to stay on Wikipedia? Yeah, that's definitely one aspect of newcomer attention that has been a lot of research that's been done that are other systems like T-House and Adopt-A-Use been used in the community. But the reason we tried this approach is that not everyone on Wikipedia can find a, not every new user can find a mentor. There are thousands of new users that come into Wikipedia every day and not every, not every veteran community member necessarily makes a good mentor. So matching people to mentors is definitely an important part of continuing building the community, but this was very much an exploration of how can we scale up this process? How can we make it more uniform? How do we make sure that people don't fall between the cracks? So I do think there are many avenues to explore how we can respond well to newcomers, but this was definitely one of the things that we tried and that absolutely should be other, and people have explored other ways as well. Cool, thank you. Yeah. Thanks, Nia. So it looks like we have maybe a few more questions, but we're gonna save them for later. So I'm gonna switch to Andrew's presentation at his point. And yeah, Andrew, the stage is yours. Again, about 25, 30 minutes and we'll have questions at the end. Fantastic. I'm assuming everybody can see my screen. If not, holler at me. So yeah, your audio is not really clean at the moment. So try again, let's see if it gets any better. Fantastic. Can you hear me okay? It's not great. So yeah, you know, maybe it's Nia, you can switch off your video so we're saving bandwidth. You sound better now, Andrew. I think it's much better. Yes, thank you. Okay, good. Fantastic. So thanks, everyone, for the invitation to present. Let me close that. Okay, so I'm gonna be speaking about the Gene Wiki project. And I'm gonna start with a slide that acknowledges the great team that we have here working on this project. I lead an academic team here in San Diego at the Scripps Research Institutes that works on this. We have collaborators at UBC, Washington, Maryland, Belgium. Most of this work is generously funded by the NIH. Of course, we wanna thank all the Wikipedia and Wiki data contributors. And lastly, we are recruiting. This link goes to add for a postdoctoral fellow, but anybody who's interested in applying the principles of open data to biomedical research, we'd love to hear from you. Okay, so I'm gonna start with a few slides that really introduce the motivation for our project. And it is summarized by this slide. And the biomedical literature truly is massive. This is a good problem to have. But the rate of biomedical literature and publication is exponentially growing. There are now over 1 million new articles published every year, averaged over the whole year. That's roughly one new article every 29 seconds. And so it's clear that no scientist is reading all the literature. So how do we think about accessing the knowledge that is represented in this corpus of literature? Well, if you think about, here's one example, right? If you're thinking about one topic, say, fibronectin, which is a human protein. If you search for fibronectin in PubMed, which is the primary biomedical literature search engine, you can get up to tens of thousands of different documents about or articles about that. And that reflects sort of the document-centric view of the world. So typically, what we as a scientific community rely on are experts in the field to summarize the current state of knowledge into review articles. So instead of trying to digest 30,000 articles, we sort of see a snapshot in time of what one expert or a small group of experts thinks is most important. Of course, review articles, you know, they sound a lot like encyclopedia articles. And so the idea is, can we essentially apply the same dynamics of Wikipedia to summarizing gene function? And that was the birth of the Gene Wiki project. So in that neighborhood of 2007 to 2008, we created the Gene Wiki project. And the goal of this was to essentially create a collaboratively written, a community reviewed, and a continuously updated review article for every human gene. And so we created about 10,000 of these articles, and the basic structure looks like this. On the right hand side is an info box where we imported data from structured biomedical databases and we reformatted them and put them into these info boxes. And the hope is that that would encourage in the free text side of things, contributions from the community to summarize current knowledge. So again, this was 2007, 2008. At this point, we have again, about 10,000 articles. They get somewhere in the neighborhood about five million views per month in aggregate, and they get roughly 1,000 edits per month. So just to show you what one particular page looks like, this is a real in, this is a human protein involved in neurobiology. It's a really great article. Most of it was done actually prior to our involvement. So what we did is we added this info box on the right hand side that systematically displays content from structured databases. And it goes on and on and has lots of great information. And just to highlight one particular nugget, I'm gonna come back to later. Here's a sentence that really hits some of the real biomedical significance of this protein. It talks about how the expression of relin has been found to be significantly lower in schizophrenia and psychotic bipolar disorder. So the article is filled with great information and for humans to read, it's actually fantastic to get a summary of the current state of knowledge. So the basic model at this point looks something like this. We have biomedical databases in the field that as scientists we know and love and use. Again, we reform out of that, put it into Wikipedia. And the hope is that we have this two-way interaction that the community both benefits from what we've synthesized as well as contributes back to Wikipedia. So I'm gonna summarize very, very quickly because it's all published work. A few applications that we did that were based on the GeneWiki collection of pages within Wikipedia. So as one example, we did text mining within the GeneWiki pages, these Wikipedia articles to extract structured biological annotations. That's in this paper down here. We also developed partnerships with journals, in particular the journal Gene, where we do a dual publication model for invited review articles. Where in addition to submitting a peer reviewed article that would be published using traditional means, authors would also contribute a Wikipedia article. This is a way to align incentives for people to contribute to Wikipedia. And the third one I'll highlight is this somewhat ambitious, but in retrospect, misguided approach to embed structured data into the Wikipedia article using Wikipedia templates. And we call these semantic wiki links. So to illustrate what was going on here, I'm gonna come back to this little snippet about RELIN, okay? So RELIN has lower expression in schizophrenia. That's great for humans now, but that's actually difficult for computers to access. There's no way, for example, using just that free text for a user to query, you know, tell me all proteins that are significantly lower in schizophrenia, right? That's not something that Wikipedia was designed to do. So in an effort to make Wikipedia articles and sort of facts within Wikipedia articles, a bit more structured and something that in theory could be easily mined out and queried, we created these semantic wiki links. And you'll see, you know, this is the wiki text. Instead of just having a wiki link to schizophrenia, we would embed that in this SWL for semantic wiki template. The target would still go to schizophrenia. So it's displayed as a link as normal. And the type was essentially the nature of the relationship. In this case, RELIN has decreased expression in schizophrenia. Okay, so now in retrospect, right, this is a pretty convoluted way to represent structured data, but, and this was published in sort of what, late 2011. So thankfully, in 2012, we were rescued from this treacherous path we were on by this fantastic application called wiki data. I probably don't need to introduce it very much, but the basic idea is that Wikipedia is to text, what wiki data is to data. The mission statement as articulated by Denny Vendrichek, you know, provided database of the world's knowledge that anyone can edit. So this would be sort of a more proper way to represent structured data in a crowdsourced and community infrastructure environment. And then our gene wiki project really expanded to really make wiki data a primary repository or biomedical knowledge. So, and that's the transition to the second part of my talk here. So now if you, if we think about how we represent knowledge within wiki data, again, I showed you the Wikipedia page for Reland. Now let's look at what the wiki data page for Reland looks like. It's shown here and it's obviously small to read. So let me sort of summarize to the high level. The Reland page for wiki data is broken down into a series of statements. The subject of the statement is always Reland. And so Reland is a subclass of or a type of protein. Reland physically interacts with these other two proteins here. Reland regulates neural development. And down here, the ones we highlighted, Reland has decreased expression in schizophrenia. Because wiki data is fundamentally a database and not free text, all of these concepts from Reland to the targets that it's related to, to the types of the relationships, these all have unambiguous identifiers. The items have these Q identifiers, the properties have these P identifiers. These are sort of a semi-country of the types of ways genes can be related to other concepts. So of course, again, I'm showing here a screenshot of the web interface, but Reland is really, sorry, wiki data is more about being accessible to computers for computers to access. So in addition to the web interface, there is a, for example, a computer readable JSON version of this, at this URL up here, where this is all more tightly structured. And this structured data is available in a number of other formats as well. Importantly, I'll just note to that, again, this fact of Reland has decreased expression in schizophrenia. It also includes qualifiers. So we can see how, in this case, the qualifier is how that expression was determined, as well as the references. So we can see what paper, what scientific paper exactly, is cited to support that assertion. Okay, so this is the basic wiki data model. And so we were very attracted by this. And so the GeneWiki project then expanded a little bit, instead of taking biomedical database information from biomedical databases and putting into Wikipedia, we put it into wiki data instead, because wiki data has tight integration with Wikipedia. Of course, we could use that data to populate infoboxes. We hope that we would have still a two-way communication with the community. And here we'd be accessing and appealing to a different community where we have information scientists and data providers and things like that. And then we can, again, assemble biomedical data in wiki data. So that was the model. And this is sort of a snapshot of where things are right now. This is a sort of a class level view of all the data that sort of the neighborhood of the data that we have loaded into wiki data. Let me zoom in. And you can see in the middle here, we have the genes. So genes have different properties like their gene symbol, various identifiers, and links to other databases. And this says that we have roughly 505,000 of these gene objects that we've loaded or created or maintained within wiki data. So genes are related to proteins by the properties encoded and encoded by, or encodes and encoded by. And proteins then can be linked to things like structural motifs, binding sites, and protein families. If we come back to genes, genes can be related to different diseases. We've loaded something like 8,000 diseases here. And those links are from this database GWAS Central. Diseases can be related to chemical compounds. Chemical compounds can be related to pharmaceutical products. And chemical compounds can also be related all the way back to the proteins that they act upon. So everything in orange here are sort of classes of items that our team has taken a special interest in. And for the primary role in trying to organize information. In addition, you see some of these in purple where these are other teams in the biomedical research sphere that we've loosely coordinated with to load information on things like sequence variants, genetic variants, and biological pathways. So you have in green sort of efforts that were completely, we were completely unaware of until they started showing up. This was an effort by the CDC to load information on chemical hazards. This was the wiki site initiative to load information on bibliographic records that then can be cited. But together sort of really expand the ecosystem in wiki data that is very useful for biomedical researchers. And you can see that again, I'm just snapshotting but there are links out to various other entities within wiki data that also contributes to the richness of this knowledge network in wiki data. Okay, oh, so I should say that in total we're touching or maintaining about a million wiki data items at this point and trying to get the relationships between them as best we can. So in the same way as I did before, I wanna highlight a little bit here in terms of the applications. How do we demonstrate the utility of all the information we've loaded into wiki data? And the first one is actually just using the Sparkle endpoint to demonstrate some integrative biomedical queries. And let me walk you through a series of examples in this space. So suppose I just wanna do an example, that's just relatively simple data retrieval. Suppose I wanna find all genes with some genetic association with asthma. To make, to perform that query, we use the Sparkle query language, the details of which here I'm not gonna go into at all except to note that this is just a query language and you reference in those QIDs, so this is the one for asthma, this is the one for gene and the PIDs which were the properties. But in this five lines of Sparkle queries, we're able to execute this query and then you get this output table here through the wiki data Sparkle endpoint, which tells you very quickly at the snapshot in time, we have 39 genes that satisfy that. So suppose we wanna start refining this query. We wanna get all genes with a genetic association with asthma, where also the gene product or the protein is localized to the cell membrane. To do that, we add essentially these three lines to our Sparkle query that references the QIDs for membrane and then we can get those 39 genes down to these 22 genes. Suppose we want instead of specifically isolating on asthma we wanna generalize this to any respiratory disease. This is where we take advantage of the disease ontology which we loaded, which describes the relationships between diseases. Again, we touch these three lines of the Sparkle query and we get back up to 31 genes that correspond to eight related diseases. So again, asthma, we also see genes related to these other respiratory diseases as well. And finally, suppose now if we wanted to look at all those diseases and show what were the causative chemical hazards that were loaded? And this touches the data that was loaded by the CDC. Again, nothing that we had a direct involvement in loading. We add these two lines of code and we're able to get four diseases related to six chemical hazards. The take home message here is simply that you can start chaining together increasingly complex queries that span multiple data resources in a pretty concise and expressive query language. So all of that can be just encoded in 17 lines of Sparkle query. Stuff to look out, but once you get accustomed to it, Sparkle is a reasonably straightforward query language. So that I think is one way in which we are trying to use WikiData and demonstrate its utility to the biomedical research community. The second thing that we're really working on is building domain specific web applications. So why is that? So if you can, you know, if we summarize what we've done so far, we have loaded our sorts of data from biomedical databases into WikiData and we've encouraged other people in the field to also load their data into WikiData. But this alone, we haven't fundamentally changed anything enabled any new biomedical analysis, right? I as an individual researcher would be more painful, but I could have done that data integration in my local database. Where WikiData I think really shines is we can start to access data that hasn't previously been structured. And that is the knowledge that is contained in the brains of our domain experts, right? So domain experts might know that GeneX is related to disease Y that's not in WikiData, right? And the question is, so clearly individual users can add those individual statements to WikiData using the web interface, but we've thought a lot about how do we facilitate this in a more streamlined manner in a way that directly engages domain experts. So as one example of what we're pursuing here, we built this application called ClamBase at this URL. It's a portal specifically for the Chlamydia research community, okay? And so you can imagine there's dozens to hundreds of labs across the world that deal in Chlamydia and they need efficient ways to share knowledge between each other, between their labs. And so we've built ClamBase to try to organize all sorts of Chlamydia-specific information in a portal that's very specific and tailored to their needs. And we hope that Chlamydia researchers, for example, will find this useful. Importantly, right? Down below, I haven't shown it, but there are edit buttons. So you can say, oh, well, yeah, this knowledge here is great, but I know something in addition. And so if you click that link here, this is sort of where somebody would add something about a molecular function for a protein and people can fill out the exact ontology term for that, how to determine the reference and so on and so forth and behind the scenes, then that is writing directly to WikiData. And so we think this idea of domain-specific interfaces is a powerful way to engage domain experts to contribute data and really structured data and knowledge that had not previously been structured before. Okay, so that wraps up everything that we've done. And I just wanted to end very quickly on just some naval gazing and pontifications that we've had. I'm gonna go through this quickly because mostly it's conversation starters for later and for offline, but I wanna give a sense to people on what we're thinking about for the future. First is we think a lot about data incentives for biomedical data owners and how to convert them into data contributors. The way typically this works, right? NIH spends money and gives money to different resources to produce databases. These databases, if they're valuable and useful, get used. Siege actually then comes back in terms of citation and attribution and that's what the NIH uses to decide, oh, okay, well, I should give these guys more funding. In the WikiData sense is that the CC0 license removes this requirement for citation and attribution. And so we're thinking about, and so then obviously data providers are very worried that this cycle breaks and they lose their funding. So we're thinking about direct measures of usage. This ties into ideas like mining sparkle query logs. This is the topic that Marcus Kreuch mentioned a couple of months ago in his talk, thinking about metrics of network interconnectedness and we're definitely open to other ideas as well. We think a lot about how do we push WikiData content to Wikipedia and I think there's some recognition that we need better and tighter integration of the Wikipedia and WikiData edit histories, including sort of statement level filtering. So not just the corresponding, everything from the corresponding WikiData item, but just specifically the facts that get transcluded into the Wikipedia and then sourced across all the linked WikiData items, not just this references arbitrary access feature. There's a really illuminating discussion I think that goes into some of these issues that better integration would address in this link down here with the WikiProject medicine community. The last one I'll say is that we've been thinking a lot about more expressive data modeling and reporting. The biomedical data models get pretty complex pretty quickly and we need to figure out ways to share and as a community decide what those data models would look like. We put a lot of, or we're investigating quite a bit this modeling language called Shex, our shape expressions, this is worked by Andra Bogmeister and this is sort of an example way of expressing for example, genes must be an instance of a gene and this is one and only one, it must not be a subclass instance of a protein so on and so forth. So it's a way of expressing constraints that is complementary I think to the constraint system that already exists within WikiData. We think about how do we visualize and disseminate these models and it's a topic I think Dario has thought quite a bit about and then also about how we do reporting and violations so that they can easily fixed or addressed by the community. So that covers essentially what we've done and what we're thinking about for the future. Again, I will end on this slide because it is, I get to speak for the great work of many other people. With that, I'm happy to take any questions if there's time. Thanks so much, Andrew, fascinating talk and great overview of everything that the Zoom Wiki team has accomplished. So I wanna ask Jonathan, do you have questions that you want to relay from IRC? And I guess after this we can open up the discussion also about SNHL stock. No other questions about Andrew's talk from IRC so you are good to go, Dartar. Okay, so I'm gonna start directly with mine. Yeah, Andrew, I think that I have I guess one comment and a question like the point you're making about attribution in the problem with licensing. Licensing I think is a very important one. I guess we're still struggling to figure out ways of engaging expert communities in WikiData. And I think as someone who is involved in WikiData, obviously the benefit of CC0 licenses like vastly exceed that the issues of attribution under the assumption that attribution can be set by norms in the field. They don't need to be enshrined in the license level. However, I know that this is a serious issue every time you talk to lawyers from the IP unit of university will tell you, no, no, no, this is the reason why we don't do CC0. And this is probably one of the reasons also why database rights exist to try and protect the curated version of a knowledge base. So, yeah, I think it's a fascinating question. The comment is that I would like to try and explore better ways of addressing this concern. I think you're completely right. That is a big issue. The more specific question I have for you is to what extent this can be generalized to other fields. I know Gene Wiki folks put some thinking into how this model could be adopted by different communities. Can you speak a bit to what's specific to the biomedical community versus what's generalizable? Sure, so first let me just say my comment on attribution was in no way, I 100% am backing the CC0 license. I think it's great. So there's a lot of just education and outreach that we need to do. And I think us having engaged a lot of different data providers, I think we've learned at least some of the issues that are important to data owners in this field. In terms of what generalizes, right? I mean, it's a good question. I mean, we've tried to generalize our infrastructure as much as possible. For example, we developed this wiki data integrator, Python library that helps bought owners to load their data. Our automation, we have an automation and reporting system. For example, that keeps source databases in sync with their representation within wiki data, also an important issue. That one, we haven't really, it's not quite as mature yet, but that's something definitely we think is generalizable. And some of the discussions around data modeling, right? Other areas also have needs of more expressive data models. So these are all things we are happy and excited to work with people in other domains on as well. Thank you. Can I pass it on to Jonathan? So we had one, we have a question from two questions actually for you, Andrew, that have just popped up. The first one I just put in the chat sidebar. How does Gene wiki handle the volatility in life science results? My experience with personality genetics associated between personality and SNPs is that some claims in scientific articles are later revised in meta analysis. The current scheme with lists of properties seems to be grounded on the frequentist and p-value focus thought. Do you think of a more flexible numerical method? So I think wiki data, right? I think made a very important design choice early on. And maybe this applies to Wikipedia as well, but it's not to represent truth necessarily because truth is not always certain. It's to represent claims, right? And assertions on truth. So that I think is important so that we can represent conflicting information. We don't have to reduce it down to a binary yes or no. The latter part of the question, do we think a more numerical system where we're actually storing raw data? I don't know. I think there's potential value in that, but I do think wiki data right now, there's also value at that level of assertions and claims between that sort of a nice layer of abstraction where you're not dealing directly with the raw data. I really like that as a way to facilitate some of the integrative queries that I demonstrated. Awesome. We have another question for Andrew and then we have a question as well for Sneha. The second question for Andrew is, besides search, could one mind this network of concepts to discover new facts about genes and other entities? Yes. This idea of latent relationships in the network is key, right? And this is where you can imagine all sorts of machine learning algorithms to identify those latent relationships from existing drugs being used to treat new diseases, to finding new genes involved in diseases and things like that. Active area of research, but the thing is everybody has their own silo of knowledge on which they're using to train their data. And I think the argument is that if we do that in the context of wiki data, we collaborate on that piece and then everybody competes essentially on the algorithms, which is where the real innovation happens. Awesome. So I think this is probably our last question. This one is for Sneha and this is from Giovanni Cempaglia on IRC. He asked, did you consider other measures of retention? Like survival analysis is, I'm assuming that's, we basically, we did have a few other variables that we didn't report. I think we also looked at contributions to articles and so on. We didn't report them because they told very much the same story as the current results that we, that I did share with you. I do not believe we did a survival analysis, but it's also, the reason we did that is because it's given that this is an intervention that occurred sort of like at the beginning of a newcomer's tenure on Wikipedia, the longer you sort of go out from that point, the less attributable the effect is to this thing that they did like months ago. So, I mean, we represented this, we did, we chose to do six months just because we obviously tested the results for a number of different timescales after, and I mean, and again, it said the same story basically and we picked six months as just one, as one time scale that seemed to represent the results that we got. So yes, we did do some other stuff. We didn't do survival necessarily because the longer the time scale going out from that, it seemed not attributable to the Wikipedia adventure. Awesome, thank you. And that I believe is it and I think is all the questions we had, so. All right, thanks so much, Jonathan. I want to close by thanking again our speakers, Nair Narayan from the Northwestern University and Andrew Sue from the Scripps Research Institute. It was awesome. Thanks a lot for joining today. Thanks to everyone who was on the channel for your question and we're gonna see you again in a month from now with our next showcase edition. Thanks and have a good day. Bye-bye.