 All right. Good morning. Good afternoon everybody. Welcome to the July edition of the Wikimedia Research Showcase. My name is Dario and I'm very excited to have here today with me Andrew Holt from University of Minnesota to present on this current work. Andrew has been working with us at the Wikimedia Foundation to help us understand the value of structured data, specifically wiki data, and he's been doing some really fascinating work to understand how peer production communities work around the creation and reuse of structured data. So I'm really excited to have you here and here today. I think it's going to be a fascinating discussion for many people in our audience. Wikidata contributors, OpenStreetMap, people working on structured data on Wikimedia Commons. So without further ado, let's take you to yours. Thank you, Dario, for the introduction and thanks for inviting me and thanks everybody who is attending. So let's see here. Looks like we're screen sharing. So I have to say that everyone here is familiar with the massive success that has been peer production communities over peer production communities like Wikipedia and OpenStreetMap. Over the last 15 years or so, you know, Wikipedia has produced 40 million articles and in the case of OpenStreetMap, OpenStreetMap has produced nearly four billion nodes. So as Benkler pointed out, one of the key factors in the success of peer production thus far has been contributor freedom. People can generally do what they want in the way that they want. Recently, there have been some entirely new initiatives in these communities rather than focusing on largely unstructured data. These initiatives have put more focus on producing structured data. So these sorts of efforts have surfaced in the form of Wikidata and in the increased emphasis on tags in OpenStreetMap. So structured data in these contexts are essentially key value pairs. And if you ever use InfoBoxes in Wikipedia, you've used peer-produced structured data, or if you've used Google Knowledge Graph and Google Search, you probably have as well. And we're going to focus today on OpenStreetMap or OSM for short. So if you ever use Craigslist, for example, you've used structured data from OSM, Apple Maps and Foursquare, and others also use OSM's data. So OSM produces geographic data about roads and buildings and many other things. And this data can have metadata, which describe properties of those things. So metadata, structured data, and tag data are all really referring to the same thing in OSM. So these new initiatives towards structured data, such as the one in OSM are different because applications using structured data need it to be standardized in order for it to be useful. They really need it to follow rules. However, we talked about contributor freedom being important in these communities, and this potentially runs into conflict with this need for data standardization. And so what we decided to explore this tension between freedom and standardization, and so we asked a research question, how does OSM's strong commitment to contributor freedom affect its efforts to produce standardized data? And so here's what I'm going to do in my talk. So I'm going to tell you what you need to know about the OSM tagging standard, then I'll tell you about my method and results, and then I'll reflect on some implications for OSM and beyond. So in order for data to be standardized, there must be some sort of data rules. The OSM community has a tagging standard and it's actually a wiki. So here's a page for a tag in the wiki. This tag is amenity equals fast food. So in this case, amenity is the key and fast food is the value. Each tag in the wiki has instructions of when it should be applied. So we can see that right here. The standards in the wiki are global standards that are created through community consensus. I'll talk more about the wiki throughout this talk. So moving on, so to explore the tension of freedom versus standardization, we decided to interview OSM contributors and we performed 15 interviews in a semi-structured format. So we used the OSM mailing lists and an OSM forum and then also a snowball sampling to recruit. Our questions focused on the process of creating OSM's tagging standard and applying it. So after the interviews, we performed a data-driven analysis on our interview transcripts. This involved affinity diagramming and led to high-level results in the data related to freedom versus standardization becoming apparent. So what were the results? Well, so we arrived at three high-level concepts which were all, as I said, related to freedom versus standardization and we'll talk about two of these today. So we'll talk about code and correctness. Code relates to how data entry tools and OSM's data standard affect data production. Correctness relates to the factors affecting correct or incorrect application of OSM's data standard. Let's talk about code first and I'll talk about correctness afterwards. So let's get into a bit more detail about what I mean by code. So Lawrence Lessig argued that the code of a system constitutes an architecture that constrains human behavior and it does this by allowing or forbidding different actions. Our interview has talked about how OSM's tag data model constrains contributors' ability to represent the real world correctly. So as we know, a tag in OSM is a key value pair. Additionally, an entity you're mapping can have only one instance of a certain key and that instance can have only one value. Well, these constraints do not align with entities in the real world sometimes and this can cause problems during the process of tagging. One of our participants talked about this. So he used mapping a hotel that is also a restaurant as an example. He said tagging is hard in this context since it's a restaurant with a hotel upstairs so it isn't really two things. It's just one thing. To further shed light on this point, let me give you another example. So we see a couple of different US OSM Panera Bread locations right here and since these Panera Breads are instances of the same business and they're located both in the US and the same country, we think that they'd be tagged the same way. Well, the left-hand Panera Bread is tagged as a minute equals fast food and the right-hand one is tagged as a minute equals cafe. The result of these two different taggings is two different icons on the map. Which of these taggings is correct? Well, that's kind of a complicated question. I go to Panera Bread sometimes and I've bought coffee there and I've also had some quick meals. So it's sort of both a cafe and a fast food restaurant but because of the constraints of the OSM data model, only one of these tags can actually be applied. And just another example here plays out for say a Dairy Queen. I'd like to be able to indicate in OSM that I'm mapping, that a Dairy Queen that I'm mapping has a cuisine of both ice cream and burgers but the code in the data model constrains me here to just one tag. And the result is that applications using this data, this OSM data, are going to be provided with an incomplete picture of this entity. So moving on, as we said, Leszig argued that that code constrains human behavior and he was arguing that there's power in code and we can actually see this power in OSM as well. Excuse me. Contributors are allowed to create tools to... For example, some tools provide tag recommendations. These recommenders are called presets and they tend to really influence what tag data people apply. Our interviewees noted that presets actually have their own tagging standards. For example, one of our interviewees P8 said, there's not a single source for those presets. It's all manually done by the developers of the tool. So those who create the preset code constrain other contributors by creating de facto standards and importantly the contributor freedom of code creators limits the freedom of others to influence OSM. It also results in code creators not following community standards. And finally, of course, the Wiki itself which tells people what tags to use and how to tag objects is itself code. It is a constraint on behavior itself. So now moving to correctness, which is the other high level concept or theme within our results that we're going to talk about. So we found out that because OSM originated in the Western world, Western concepts and definitions have been embedded into the correctness standards within the Wiki. Because of this fact, OSM standards sometimes do not work well when mapping outside of the Western world. However, interestingly, we found that the OSM community's ethos of contributor freedom allows for rules to be broken in these cases. Incorrect data can then be applied. Specifically for roads, the Western assumptions and the tagging standard confused our participants and resulted in them breaking tagging rules when mapping outside of the West. For example, one of our participants, P11, she was a Westerner and she mapped remotely in Africa. Given her Western background, she was hesitant when applying a Westernized tagging standard in Africa. She said, someone says this is a highway and I'm like, I disagree. I'm really afraid to map that as a highway if I think, for example, a vehicle can't go down it. So someone from the US, when I think of, say, the concept of an average highway, I think of roads that are probably paved and have a couple of lanes for car travel, something like this, which is here in the United States. However, as our participant mentioned, mapping throughout the world with a tag that represents this Western world concept doesn't always work well. For example, in less developed regions, both of the roads seen here had the same highway classification tag in OSM. It's worth noting that OSM originated in Britain and our participants actually talked about how British terms and concepts have had a particularly heavy influence on OSM's tagging standard. And even as an American in the US map, there are scenarios where OSM's cultural assumptions are confusing to me. When I go to Walgreens or CVS, I don't think of these stores as a chemist. I think of them as pharmacies. So here's another example of freedom allowing for incorrect data application. This one occurred during a humanitarian open street map or HOT as it's called mapping efforts after the Nepal earthquake in 2015. In this example, they were mapping HOT, was mapping remotely using aerial imagery. So our participant told us, we do bend the rules slightly with humanitarian open street map. We wanted to help the aid agencies to reach remote places. So what we were doing online was trying to identify helicopter landing sites and what we did was we found the tag for helicopter landing site. Let's pause there. So here's the Wiki instructions for this tag. The tag is arrowway equals helipad as summarized in this picture. Well, as summarized in the picture, it describes a helicopter landing site as infrastructure explicitly made for the purpose of a helicopter landing. Let's jump back to the rest of this quote. And he continued by saying, and what we would do is look for a clearance of 30 meters that was level land near a village. And we labeled that as a helicopter landing site. So based on the tagging standard, they were clearly bending the rules here. The tag was designed for something like this top photo that we see here, but they were labeling areas that were more like the fields in the bottom photo. And it turns out that a half a month after the Nepal earthquake, the OSM Wiki was actually updated with a new tag to account for this different definition of helipad. This new tag was emergency equals landing site. So HOTS actions, humanitarian open street maps actions led to incorrect data, but eventually succeeded in driving the evolution of the global tagging ontology and OSM. Some work by Palin at all stated that HOT is quote a driver of OSM's evolution. And that is, this is a good example of that right here. We certainly see that. So as a final example of freedom allowing for incorrect tag applications, one of our interviewees even created a tag to born that quote pretentious pubs served bad in house ketchup. So, you know, we see that contributors are free to break road and helipad tagging standards and even to, you know, go so far as to create some ridiculous ketchup tag. And we also saw with the helipad example that HOT has the power to influence the standards that can use their freedom to enact change. But are there other groups that that can or cannot define what is correct in OSM? Well, so there's been some work by Monica Stevens that has discussed how new tag proposals for feminized or nurturing spaces are given less quote attention. Also, one of our interviewees mentioned that she had quote heard of women not being listened to or respected in OSM. More generally, hostility seems to also be commonplace in OSM and this may have the effect of limiting its victims influence and activity in the community and leaving just those who are hostile or who are willing to tolerate hostility. Another of our female participants talked about this issue. She said, sometimes the email lists be very toxic. People feel like they can say things that they wouldn't say to someone else's face. Here it's worth saying again that, you know, related to our high level concept of code, the code and tools such as presets established the facto standards and less in contributor freedoms to influence standards. So moving up a level while freedom can allow for contributors to break the rules, freedom when left unchecked may allow groups such as code creators or males or hostile contributors to limit others freedom to influence the community. So now that we've discussed results, I wanted to talk about some implications for OSM. So as we discussed, our interviewees talked about this problematic one key one value data model restriction. So it seems that the community should consider some data model changes to better accommodate the nature of of entities in the real world. I think an interesting direction for future work is to try to understand the process through which mapping projects in OSM interact with the global community. We saw that hot that humanitarian open street map broke and then updated rules. It'd be interesting to see how often this process occurs. I think it'd be also interesting to look into mechanisms that can facilitate how projects like hot negotiate tagging standards with the rest of OSM. And finally, I think that, you know, future work should look into ways to better incorporate these diverse views that we talked about into OSM's tagging standard. Now I want to briefly move up a level here and talk about peer production communities more generally. So there's this tension between freedom and standardization that we talked about in communities like open street map and in wiki data. And we've demonstrated some problems resulting from freedom in this talk. However, I think that simply lessening freedoms is not likely a good solution. Contributor freedom has played after all a large role in the success of these communities. And when you do start enforcing too many rules, there can be negative effects on the community. For example, you know, there's been work showing that too many rules hurts newcomer retention in Wikipedia. So some balance between, you know, freedom and rules, or you can say between freedom and standardization needs to be found. And I think that the issues that we talked about today that contribute to freedom affecting standardization are general problems. And they may likely appear in, for example, wiki data as well. Because in wiki data, there are after all global concepts and entities just as there are in OSM. And I'd like to reiterate that, you know, as I illustrated at the beginning of the talk, we all use this data. We all use peer produced structured data, whether in the context of Google Knowledge Graph or Wikipedia Info Boxes or other applications, Craigslist, Apple Maps. And that's why this tension between freedom and standardization in this context really matters. So with that, I'll wrap up the presentation. And I'd like to thank my collaborators at the University of Minnesota and Northwestern University and also at McAllister College. I'd also like to thank those who provided some feedback on this work and the NSF and the US Department of Education for funding. So thank you and please feel free to contact me at this email listed right here. And I'm also happy to take some questions now. Thank you, Andrew. There was a fascinating presentation. So I have several questions, but I want to ask first Jonathan if you can relay anything from the IRC channel. Yeah, we have two questions from IRC. One from user pigs on the wing. Question is what is being done in OSM to address these issues and who leads on it? It's a good question. I think that to my knowledge, a lot of these issues are, I'm not aware of what is being done to, for example, resolve gender gaps. And it seems like it's been a recurring issue for several years now at this point. And I guess it's a common problem across other communities as well, unfortunately. I think I may have misunderstood the question. I think that it's possible that pigs on was also asking. So the user observes, I know some women are who are involved in RL OSM events, but it seems to me a more predominantly male community than Wikimedia. And then so in that case, the follow question is, are there things being done at OSM to address the gender gap issues specifically? I am not aware of, I did not, from our participants, I was not aware of anything that was occurring. I do know that there is a large gender disparity in OpenStreetMap. It's quite distinct to us, maybe 96 or 97% of contributors were male, if I remember right from one study that I saw. So it's quite, it's quite, you know, polar, it's quite biased towards males. Yeah. One other question from IRC from Aaron Havaker, user Havak. Can we have our cake and eat it too? Is adaptability of the data model based on new understanding a solution or is the problem more social? New understanding. So I think he's asking, if we know that these contested tag values exist and the reason why they're contested, is making the data model itself support, say, multiple values for a tag, the solution or is there an underlying social set of challenges that wouldn't be addressed by a strictly technical solution? That's a good question. And I'm not aware of social issues that have, everybody comes in OpenStreetMap, everybody's coming from a different view. So there's oftentimes cultural differences regarding what something should be called. And we didn't talk about that so much in this presentation, but there's, in the interviews, there was a lot of discussion about, you know, somebody from Britain would say, well, this is called this. It was like a, for example, there was some sort of entity that they were trying to tag for boat supplies. And the British group of OSM contributors came in and said, this is a ship's chandlery or something like that, some word that's very distinct. It's not something that, if you're in the US, you would necessarily understand or other parts of the world, probably. And so I think that there would still be, even if the data model changed to allow multiple values, like something maybe more similar to wiki data, I think that there would still be a lot of debate regarding what those values should be in the data standards. So in some ways it's very much a social problem too. Thank you. That's all from RFC so far, or IRC so far. Okay, I'm going to ask a question next. I have a question and comment to consider the comment for later. The first question is related to the role that downstream consumers play in these decisions. Having worked with structured data communities myself for quite a while, I'm aware of the fact that very often what matters to downstream data consumers is consistency and correctness, often at the cost of diversity or incusiveness, which is funny because it ends up creating data tech data or major quality issues if they have data that is biased in the beginning. But I've seen communities of experts, ontologists trying to grapple with the idea that not just the contents, but even the data models could evolve in wiki data, for example. And I'm curious if in any of your interviews you had any indication of how to accommodate this needs that are legitimate in a way that is still allowing the community to preserve its freedom. Has there been any discussion in your interviews with people either directly being representing downstream consumers of data or people are voicing concerns from these groups? Yeah, we did talk to downstream consumers of the data. Well, one thing I think that not just downstream consumers of the data talked about, but pretty much everybody we talked to, or a large number of people we talked to, reiterated was open stream map data is not being created for, it's not, I don't believe a priority of open stream map data is to have something that's consistent globally. That's not the first priority. The first priority is making maps that are usable locally by people. For example, if you're mapping, we talked to some people who are mapping in Asia, and they wanted to map when they were going out mapping, they wanted to map not for the Western tourists or people who would be coming there from someplace else or some large application that might like to use this data, but they wanted to be mapping in a way that made sense for locals. For example, you're not tagging some sort of a fuel station in a country as a gas station if it doesn't make sense to be tagged that way. If it's not the Western definition of a fuel station, then don't tag it that way. Tag it in a way that the locals will understand. That was a really common theme that we saw open stream map is tag for local users. Does that answer your question? That's good. Yeah, this is something we could discuss for hours. There's one more question and it may have another comment later on. Awesome. So my question, this is a question for me. My question is about detection. So I know, Andrew, you've been doing research on wiki data as well, and I'm curious whether you have any insights or ideas around whether we could, what approaches we could take to potentially detect these kinds of tensions or conflict around structured data on wiki data. So we know that we can assume that they're because wiki data is a big project and it reflects like all technology does the biases of the people who created it that there could very well be these kinds of conflicts between the intended use or meaning of a tag, say. And in your study, you used interviews to kind of uncover some of these tensions. Can you think of other methods that could be used to kind of either retrospectively or even in near real time detect where conflicts like this are occurring? Yeah, that's a good question. And in the case of wiki data, I'm not as, I haven't talked to people who are making these decisions as much as I have in OpenStreetMap. And so I'm not exactly sure the processes that are going on there. That's a very tough question. And in OpenStreetMap, a lot of these discussions are going on in mailing lists. And it's sort of an informal process, a very informal process to talk about how something should be tagged. And our interviews often talked about how very few people actually were involved in this process. You might have 10 people involved in the process of talking about how this new tag should be used and what's it going to look like. And people didn't feel like that was authoritative at all. And that was a big problem in OpenStreetMap. And so I think finding ways to increase the authority of the process, to increase awareness of the process are important in both communities. And like I said, I'm not as familiar with the wiki data process of proposing properties or things like that. But that was a huge thing in OpenStreetMap. You'd have these big decisions that are being made and only a few people who are actually chiming in on them. I'm not sure if it was a question of just a remnant of the mailing lists being kind of hard to keep up with them. I mean, I think certainly some other factors played into it as well, including things like hostility and sexism that we talked about. And then as one of our participants talked about in the presentation, sometimes females didn't feel like their voices were being heard. And so not everybody is able to take part in this process unfortunately. So there's some underlying factors that need to be worked on as well in order to make this more inclusive. And when that happens, then I think it'll be a more successful process of defining these standards. Awesome. Thank you. No other questions from IRC? Okay, so I might just make this comment and ask this question. It relates to the comment that Jonathan just made about the minimum and the expectations that people have around community spaces where data modeling work happens. My experience has been that both with OSM but also Wikidata, most of the data modeling discussions tend to happen on a wiki-style like 3-form, you know, page, mostly text-based, and with a typical structure of a wiki. So a page anyone can edit, threaded discussions and so on. And it strikes me that when we're talking about, you know, the issues of including more voices, not just when it comes to cultural breadth, but also bridging the divide between experts and volunteers. It's another big topic that we've seen in Wikidata, for example. These spaces are not really suited for doing a job. In other words, the constraints on the channel, both in terms of design and norms of where discussions happen, themselves affect who participated and how they can participate. For example, it is really, really hard for a bibliographic metadata expert to be able to just make a technical comment on a specific data modeling solution if that discussion is part of a very complex thread on a wiki page where somebody doesn't even know how to interact militarily with them. So the comment that I have is something that I'm seeing more and more is that, one, all of these communities do the data modeling work using basically wiki-based interfaces, which to me is a suboptimal solution, and B, this creates barriers that I don't think have been fully articulated about participation in these efforts. So I don't have any thoughts on this. Yeah, at least one of our participants echoed that view that some of the decisions are made on the wiki pages in OpenStreetMap about tagging aside from the mailing list. And technical, you know, that you have to know how to edit the wiki pages, and a lot of people didn't really do that that we talked to. In fact, very few people that we talked to seem to put much time into editing the wiki pages, and oftentimes it's really confusing to edit these pages. And yeah, so that was definitely something that we saw. And we also, you'd sometimes see people who just did not want to get caught up in this process of defining standards. They just wanted to go out and map. They didn't want to, they didn't want to get bogged down and get into this bureaucratic process of figuring out, you know, what should we call this? I'm just going to go out and map all of, you know, all of these roads or trails or, you know, whatever I like. Oftentimes they're interested in a very specific thing, and they'll just go do that. They won't worry about the process. And so making something, streamline that process of making people aware of what's being voted on, so that if they're interested in that domain within OpenStreetMap, you know, maybe they're interested in bike paths or something like that, or railways or something. You know, there's a lot of niche areas that people are interested in, and if they're interested in those areas, they're probably going to have opinions on how those things should be tagged. And so I think facilitating them in that process of getting involved is very important and not being fully done at this point. Right. Yeah, totally consistent with what I've seen. Does anyone on IRC or in the audience who knows of research on this topic, basically how to increase engagement on collaborative data modeling efforts? I'd be really, really interested in learning more about that. Thank you. So Jonathan, any final comment, blessing, question from IRC? Nothing. Nothing at the moment. So, yes. Thank you very much, Andrew. This has been great. Okay. Thank you again for having me. And happy to continue the conversation on Wikis and see you all next month with our next showcase. Thank you, everyone. Thank you.