 All right, and we are recording, is that right? But we are not live streaming, we are butt recording. Okay, so take one, if this doesn't work, we're gonna do the entire presentations all over again. So hello, my name is Jonathan Zittron, I'm here with my dear friend and colleague, Joey Ito, and we're so pleased to have a chance to take stock of our unusual project now in its third year called Assembly, and thank you all for coming out in your various roles as people presenting tonight, people among the advisory board who advised on this project, children and infants for whom childcare could not be obtained, and others from our relevant communities who are not in the moment at Tunis at Wright's Con. Thank you, we know you had a choice in relevant conferences and we are grateful for your choosing this one. And in some ways I was thinking this is not Wright's Con, this is wrong's Con, this is thinking about everything that can go wrong, but maybe more accurately, the counterpart to Wright's here is not wrongs, but duties. And if there's one through line among several of the projects that are about to be presented tonight, the four projects, it's thinking about the duties, moral and otherwise, that might be attendant on those who are so deep into and helping with the rapid deployment of the technologies that we'll be discussing tonight that are loosely grouped around the words artificial intelligence, and thinking about what those duties are, who should be mindful of them, how to be aware of them, and then how to discharge them responsibly, that's the big question that all of us have been thinking through in various guises over the past several years, and for some of us, some of you, long beyond that. So I'm just so pleased to have a chance to see how these projects have turned out, to record it for posterity, so that 20 years from now, we can say at least we tried, and to have everybody here willing to experiment with a new form, playing with boundaries among institutions, among categories of institutions, academia, industry, the world of foundations, and other nonprofits, of government, international boundaries, having all of those things be thought of in new ways to bring people together from regularly scheduled lives, thank you for taking a chance to participate and help develop this new form, and one that we hope will continue into the future. So I should turn it over to Joey, who in this year received his doctorate, and if there is a way to kind of try to describe his work, which he happily describes that of himself and of the Media Lab as anti-disciplinary, it's around theories of change, and this is part of the theory of change, and Joey, I should just let you say what you're moved to say. So when we started, you started, but then when we joined the assembly project like three years ago now, so the idea of sort of ethics and governance was kind of a new thing, and now we've got a billion dollar effort at Stanford, we've got a secret effort in other rival institutions, and we've got this big thing at MIT called the College of Computing, which is trying also to think about ethics and governance, so I used to joke that singularity was when every day was an AI conference, and now singularity is when every day is an ethics and AI conference, but I think the biggest difference, the Media Lab is very much, we call it constructionists, where we learn through doing, and I think what's interesting right now, there's a lot of people doing without much thinking, and there are a lot of people thinking without much doing, and I think what's really fun about the assembly program in this year, I think we keep working on this, is the sort of tightly coupled relationship between the theory and the practice, and the engineering and the social sciences, and I think a key thing that word of duty, I think that a lot of, I was at a dinner a while ago now at, with some MIT faculty, and they said, well, we don't wanna be political, and I was like, well, when it's truth or not truth, if that's politics, then we have to be political, if it's science or not science, we have to be political, and so I think there's some duty right now to do what we at least think is the right thing, and so I think we can't not be political, and so I do think a lot of these projects have a political normative position, and I think that's okay, and so I want us to own the fact that in a sense we are political, and I think there's a tendency for academia not to want to be kind of advocates, but I think we are, and so I'm excited that I think the group also had quite a diverse group, and we had Navy SEALs and ACLU in the same room working on projects with post-its, and so I think that was also- Together they are the ACLU, am I right? Thank you. His primary contribution, as you can see, is if you go outside and you see a acceptance letter here, all emoji, all full of self-humor, that's his. Anyway, so with, and I'm not gonna congratulate you until after we see the project proposals, so with that can I hand it over? Do you have any last? I think there's just one other thing for us to say now, before we hand it over, which is the progenitor of this idea is here tonight, and somebody who's just been so helpful to us all along the way, and that is, hopefully being pointed to, is Mr. Jordy Weinstock, and we just owe him a huge round of thanks. And the person to whom we're about to turn it over has been somebody who has been, thank you for again pointing out the person in question in case I just have a moment and thank some entire stranger for this, but somebody who has really bound us together, kept us focused on our compasses and the common purpose we share, and really just been the sinew and muscle that's kept it all together. And so, Hilary Ross, I just wanna have all of us thank you for doing it. And with that, we turn it over to you, Hilary, to shepherd our teams through their presentations. Thank you. Right, thank you so much to both of you. So, as Jonathan said, my name's Hilary, and I manage the assembly program, and it's been such a joy to work with this cohort. There's 17 people in the group. They range from Navy SEAL to ethicist to product manager, and all in between. And we have been working since March, starting from the theory, as Joey said, thinking about what projects we wanted to work on to now when they're presenting their finished, which I think they might chide me for saying finished because they will keep working on them in different ways and carry them forward, but the projects that they'll share tonight. So, there are four projects. I'm going to introduce each one, starting with the Kaleidoscope Positionality Aware Machine Learning Project. The project interrogates the creation of classification systems, and had a five-person team who worked on it, Yawande Aleide, Google AI, Elizabeth Dubois, University of Ottawa, Christine Kaiser-Chen from Google Research, Chintan Palmer at Dana-Farber, and Frederica Shor at CityBlock, and been really a wonderful team, and Elizabeth is going to share more about their project. So, I am Elizabeth. I'm going to talk to you today about positionality-aware machine learning. We are going to start off with a question. Tomato, fruit, or vegetable? What do you guys think? Fruit, vegetable, right. Okay, it's a matter of perspective. It's also a matter of context in which you want to use that answer. If you're a botanist, you say fruit. If you are a nutritionist, you say vegetable. Lawyers and judges in the US have agreed vegetable. Computer vision researchers say it's miscellaneous. This idea of classification, it's the process of assigning a name or a category to a particular idea, concept, thing. It's a process that we go through in our daily lives continually. The idea of classification is also the idea of creating taxonomies for understanding the world around us and reducing the large amounts of nuance and detail that there are in the world usefully. We see it when we're trying to understand how diseases spread around the world. We use it when we're trying to understand what online harassment looks like. We use it to understand differences in race or gender. Gender is a particularly interesting one, particularly in the kind of Western societal context where at one point, we saw gender as pretty much agreed upon as a binary variable. There were two options, but now that's no longer the case. So for thinking about these classification processes and trying to embed them into the machines we're building, we need to be thinking about it critically and in the context that we're currently in and what might change moving forward. This is a quote that we had from one of the many user interviews we conducted with different ML and AI engineers. This woman is at a major US news organization and she talked about the idea of classification and when it might present problems in terms of harms it could cause. She said, there are some times when it just doesn't matter. It's not an issue to do with harm. She started with the idea of okay, our otter toner model for image recolorization. Well that's not gonna cause anyone harm. Paused, thought about it. Actually maybe it does. It's kind of a weird algorithm that may lead to whitewashing. And so this is something we saw time and time again as we were asking these practitioners to think about their classification choices and when they might be problematic that once they started digging into that problem they realized there was this potential for problematic decision making that on the surface wasn't an issue in the first place. And so this is where we come to our idea of positionality. Positionality is the specific position or perspective that an individual takes given their past experiences, their knowledge, their world view is shaped by positionality. It's a unique but partial view of the world and when we're designing machines we're embedding positionality into those machines with all of the choices we're making about what counts and what doesn't count. So this is a very, very simplified data pipeline. This is when we go from data into our ML model that we are trying to train. I'm gonna use the context of online harassment. Let's imagine we have a whole bunch of tweets and we wanna define whether or not those tweets are exemplifying harassment or not. Well, we would grab our data, we would apply new labels to that data, harassment, not harassment, and then we'd train a model on it so it could predict. This requires a really complex classification system, right? So we have a system to decide what counts as harassment or what doesn't. We have to train a whole bunch of annotators to literally go through the data piece by piece and assign those labels. Then they apply that and that's when we get to feed into that model. So thinking about online harassment, in a project I worked on, we started with three categories. That was our classification system, our taxonomy, right? We had positive, neutral, and negative. Every tweet was going to fit into one of these categories. Our annotators could not agree. Three categories did not work and it was because there was a bunch of boundary cases between neutral and negative. It caused tons of problem and we could not get good inter-coder reliability or inter-rater reliability, which is a common tool to use to assess agreement. We added a fourth category called critical and all of a sudden our annotators agreed the majority of the time. We had to redesign our classification system in order to respond to the actual data, the way that data was being presented and the way humans interact with that and understand it. So what we're saying is to interrogate these classification systems, we need to be thinking about what counts in what context. We need to be thinking about who those annotators are, why we've selected them, how they've been trained at what moment in time, and we need to think about the actual application of the classification systems and question whether or not there is sufficient agreement and whether or not our approach has been reliable. That was an example of a homegrown classification system for a very specific project, but this idea of positionality is embedded even in the very old institutionalized classification systems that are used around the world. So the international classification of diseases is a tool that's used internationally to identify and classify health problems and it actually underpins a lot of the US healthcare billing system. This is an example of the different codes you can use in the ICD for being harmed by birds, okay? So there is a code for having been harmed by a chicken or a goose or a parrot. There is no code for ostrich though, okay? Think about how big an ostrich is, then think about maybe living in Australia. If you ask an Australian, what is gonna be a more risky, harmful health situation being kicked by an ostrich or being bitten by a goose? Probably they're gonna think the ostrich is the more important thing to count, but the ICD, it wasn't developed with that in mind. The ICD was developed with its origins in the 1850s that's now maintained by the WHO and it was designed primarily by white men in Western Europe and North America. Their positionality, it's embedded in the ICD today and it will continue to be unless we routinely question what that positionality looks like. Ultimately, choices here are inevitable and this idea of removing bias, it just doesn't jive when we're understanding that these choices are gonna happen regardless. A lot of the conversation about de-biasing algorithms is about adding rows. If you just add enough data, you'll be able to get a representative view of the world, but if you limit yourself to only the columns for parrot, chicken, and goose and you don't have a column for ostrich, you will never capture how many ostrich kicks there were. So, if the de-biasing debate isn't helpful, what do we do instead? We argue that you could look towards being positionality aware and we suggest that there are three basic steps that machine learning engineers and others involved in the process can take. The first is to uncover positionality in your own workflows. Look not only at the classification systems but also the data and the models that you're making use of and think about where positionality enters. Keep track of it. Next is to try and assure there's context alignment. That's an alignment between the classification system and the context in which it was developed and the actual application scenario for the machine learning tool that you are creating. And here let's return to that online harassment example. We developed that for Twitter. Maybe we wanna use it on Reddit now. If you're thinking about just taking the model that was created for Twitter and applying it to Reddit, there's very few options for embedding a positionality aware approach. If you're thinking about well maybe if I just feed in a bunch of new data, I can solve the problem. So you trained it on Twitter data. Now you're gonna train it on Reddit data. That'll get you closer. But what you actually need to do is question that classification system. You need to go back and look at how you're actually assessing what counts as harassment and what doesn't because the way people communicate on Twitter is different from Reddit. On Twitter you have a short character count. You might use hashtags, at replies. On Reddit you are probably talking in very specific subreddits. You're probably engaging in particular language because you know there's a moderator watching what you're doing and keeping track to make sure that you're within the bounds of what that community has deemed to be acceptable. You have way more space to do it, right? And so the ways that we classify content for Twitter and Reddit, they're probably gonna be different. Certainly the ways we train our annotators has to be different because those approaches do not work when the content and the context are completely changed. The last step here is to remember that you need to be continually trying to ensure that that alignment exists. The models might change, the data might change, the classification systems themselves might change, the ICD, it's changed by the WHO relatively routinely and so if you're making use of it, you need to update your approaches. It's also important to recognize that the context in which you're building something might change whether you like it or not and so having a lack of control there kind of requires you to be aware of what's shifting in order to build a reasonable and responsible tool. So with all of this in mind, what we did was run a workshop with ML engineers and we've got a number of other workshops already submitted so we've submitted to Epic and Fatstar. We've created a white paper that's available on our website and plan to write a more detailed position paper that we can make available widely. I will let you go explore the website on your own but before I do that, I just wanna leave you with this. Right now, ML and AI systems kind of are like a one size fits all t-shirt. They fit very few people. A lot of us end up kind of unhappy but we can do better, we can harness this opportunity to be aware of the very specific context in which these tools can be deployed, think about how they can be tracked over time and find ways to serve the specific needs of the users and the developers in order to be aware of the particular perspective from which we are designing and that perspective which is embedded in all of the tools we're creating. Thanks. Thank you so much. From their team, I learned that there is a category called struck by a duck in our healthcare system which I did not know before so thanks you learn. Our next team is the Surveillance State of the Union. Their project highlights the risk of pursuing surveillance related work in AI. The team is B. Covello who works with partnership on AI. Carl Governelli who is working at the US Navy as a Navy SEAL as we said before and Peakscraft who is at the Oxford Internet Institute and I will let their team take it away. All right, thank you all so much. So you've been hearing a lot about the Navy SEAL and I just wanted to say that that is not me. Our team is interesting when Peaks unfortunately couldn't be with us here tonight but we're a small team although the image here might mislead you. We are small in part because we're working on a pretty controversial topic and that is the topic of algorithmic warfare that's what brought us together and in particular when people hear that term they think of a lot of different things they may think of the sky, the terminator, the sky net thinking about robots and here in Cambridge, Boston Dynamics but we actually wanted to challenge that paradigm a little bit. When we began this project we were looking into what is the role of AI in warfare? What is algorithmic warfare? And we came to the conclusion that there's kind of this interesting misunderstanding. So there's a really big effort underway to address the dangers of these kinds of robotics and it's real, right? That's a valid effort but there's also this very real threat that a lot of people kind of overlook which is that at the end of the day it doesn't actually necessarily rely on a robot being the delivery mechanism of violence if you're still being targeted and so that's really what inspired this project. We wanted to look at how surveillance, how these algorithmic decision-making systems and surveillance systems feed into this kind of targeting decision-making and in particular what we're gonna talk about today is the role of the AI research community, how that research ends up in the real world being used with real world consequences and then talk a little bit about what we found in our investigation of the space and what we invite you all to join us in doing as we move forward. So to take a step back there's a pretty diverse crowd here so I just wanna contextualize a little bit about what we're talking about when we're talking about this kind of algorithmic surveillance. Here you're seeing depicted a video surveillance system that has image recognition technology so it's identifying, it's putting boxes around various people and vehicles and so on. So we're looking at a couple of different types of systems here, computer vision systems, systems that can listen to voice or audio as well as systems that may look at social networks or things that we post on social media to come to conclusions about whether or not we fall into a certain group. Now in this project in particular we're talking about an incredibly rich and interrelated system and I'll talk a little bit about this data visualization in a second but I wanna first just contextualize that when we're talking about this research community we're talking about a lot of different players in a global ecosystem and as researchers we wanna share, we wanna collaborate and that's a beautiful thing but what can be challenging about this space is that these threads of connection could lead to places that we didn't originally intend and we may as researchers not have even been aware and so our work in particular is exploring how some of those outcomes, how some of the end uses of research that may be done with the thought of being theoretical or benign can actually have real harm in our world. So with that I'm gonna turn it over to Carl. I'm Carl, I'm the one they've been talking about. And I'm gonna teach you how to hunt people. So many practitioners in this space will often say that the technologies it just isn't mature enough to be a true threat. You couldn't use facial recognition alone to build this target deck and I promise you that grossly undermines the threat. How you hunt people is actually with a series of overlays. You don't use one individual technology, you use them to hone in and find the corollaries to build a smaller list until you have your approved target deck. So if I wanted to identify all of the protestors in Hong Kong yesterday, I wouldn't just use facial recognition of CCTV camera footage. I would take that and put that on an overlay of metro cards used in the subway system in Hong Kong. And that might give me a more accurate deck to begin our interrogations. So jumping off of that short example, quick deep dive. So I'm not exhaustive, but if you're not familiar with the Uyghur crisis in Xinjiang, China, I'll give you just a couple of sound bites. So since 2014, the Chinese people's war on terror has been a systemic or systematic effort to oppress the Uyghur population, which is an ethnic minority in China that practices Islam, roughly 10 million individuals. So 1.5% of the Chinese population, but also does account for 20% of all arrests in China. Consequently, there are over 1 million Uyghurs currently assessed to be interned in Xinjiang province in reeducation camps. These are the camps of Holocaust lore where there's killing, torture, primarily this method of assimilation via submission. It's a heavy workload, genocide is tough. So the People's Liberation Army has leveraged the tech industry. So this region, this small province and cross-section of society accounts for about $7.5 billion of the security industrial complex in China. And they leverage both facial recognition, voice recognition, and other forms of AI to apply lists of people who are either demonstrating Islamic practice or they demonstrate phenotypic features of this ethnic minority. Now, where you stand depends on where you sit and in the interest of positionality, we're gonna bring this home to MIT. A bunch of institutions in America here serve as key nodes in the surveillance supply chain. But here at MIT, through our research, we found that there's about 15 projects ongoing in some way, shape, or fashion for surveillance type of technology. So the nature of this ecosystem lends itself to this really complex interdependence. So the research ongoing here at MIT is done in affiliation with or with funding from both private and public institutions. So a private institution would be NEC is a Japanese based security firm that's basically the product is predictive policing. We use it here in the States, it's used here in the UK. Other companies that have contributed funds to this type of research at MIT include SenseTime and iFlytech. Those two companies are both implicated in the provision of services in Xinjiang province. Also one fantastic supplier, if you can use that term, funding for this type of research is the US government. So the same researchers that are linked to these companies are also receiving funding from DARPA, Office of Naval Research, Marine Corps Warfighting Lab and Army Research Lab. Occasionally they actually appear on the same publications in the acknowledgments section. So we wanted to present this and for some folks in the room, it's probably new information but we are not the first people to talk about this issue. The issue of surveillance tech is really important and it's been an emerging and burgeoning conversation here in the United States as well as around the world. There are actually folks in the room here who are experts in the topic but really there's this common theme here which is that researchers and product teams have an important role to play in actually helping to think about how this technology is used or prevent it from being used irresponsibly. And in particular I wanted to talk a little bit about what we've covered so far. So we've looked at, in reality, thousands of different projects and contracts from the US government or journal publishings. We honed that into a couple hundred lists of surveillance tech specific and then we ultimately came down to about 65 research institutions here in the United States with 49 funders, as Carl said, ranging from Chinese surveillance tech companies to US surveillance tech companies to our own US government. And they covered a broad breadth of different types of surveillance research including facial recognition, social network analysis and person re-identification which for me was a new phrase which is a little bit like what Carl was talking about at the beginning of the presentation here. So what we've been doing is putting together all of this information and really trying to literally visualize it. So I know this kind of looks like scribbles on the screen here but this is a data visualization we've been putting together representing all of those different connected nodes. And ultimately what we would like to do is to be able to move forward on a project that has a better understanding of the reality of what's on the ground. We wanna know is this data reliable? Can we get access to more information? For instance, today we were only using public sources and we also wanna really involve the voices of the people who are developing these tools. One of the common themes in the work that we were doing to research this project is that we found that a lot of the people involved actually didn't know how their work was being used in the world and that is both concerning but also a really exciting opportunity for us to educate and do better. But ultimately there are also a couple of things that even if you're not a machine learning researcher even if you're not in this kind of surveillance tech space there are a couple of key points that we would really like you to push on as you're having conversations with your representatives about what we can do about surveillance tech at least here in the United States. So this is where we go from fact to opinion so the opinions here in are not representative of the US government but they are ours. So call to action. Two points that through our research we feel like we can apply some leverage to this problem set and ideally make a difference rather than admire this problem. First is Department of Commerce. That's the office that holds the export administration regulations. So these are essentially regulations that govern the export of commodities to include dual use technologies and software. So in adding facial recognition and surveillance technologies to that list it would be a forcing function for anybody that's in this complex ecosystem to have to apply some level of risk mitigation and threat modeling to their decision calculus. And then finally we should probably call for a moratorium on US government funding for the research of these technologies. Not saying that the requirements are not going to be fulfilled by the US government but the acquisition of this technology should probably happen in a highly regulated commercial market as opposed to being directly funded because the complexity and the low density of these skill sets just don't lend themselves to ethically sound and transparent funding profiles. So let's take that basic research funding and put it somewhere else. Thank you. All right, thank you. That was an enlightening presentation and to Joey's point that sometimes our work needs to be political. So the next team is the AI blind spot team. Their project offers a process for preventing, detecting and mitigating bias in AI systems. And there are four team members are Anya Calderon from the Open Data Charter, Hong Q from the Shorenstein Center, Dan Tabor from Indeed, and Jeff Wen who's soon to be at Stanford. So on behalf of my group I'm extremely excited to be presenting our product which is known as AI blind spot. I actually wanted to start by just reflecting a little bit. Looking back on March, we started this program in the middle of March within initial two weeks. And right after that, there was an amazing flurry of activity where major tech companies were making all these bold declarations of major statements that they were making and what they were doing to address AI ethics. In the span of two days, you had Microsoft announcing they were gonna add an ethics checklist to their product release. You had Google announcing their now infamous ethics panel. You had Amazon announcing a partnership with National Science Foundation to study AI fairness. But as I was watching these headlines come out and you got the sense that this is what was really going on because right about the same time I was starting to have conversations with several tech companies with people who are involved in ethics initiatives where even ones who are really trying to do their best to implement changes to study bias in AI systems. And frankly, a lot of them just didn't know what they were doing as they openly admitted because there were just no structures or processes for assessing bias in AI systems. There were a lot of tools that had come out in the past year including those that came out of assembly like the data nutrition label and others like Mala Cards from Mala Reporting from Google. But you need both tools and you need structures and processes. And that structures and processes are what we decided to address with AI Blind Spot. AI Blind Spot is a discovery process for spotting unconscious biases and structural inequalities in AI systems. There are a total of nine blind spots. And when I say blind spot, I'm referring to oversights that can happen in the team's natural day-to-day operations during the course of planning, building and deploying AI systems. There are a total of nine blind spots you see in this diagram on the left. It all starts with purpose right in the middle because everything in AI system should always come back to purpose and what it's being designed for. And then if you start at the lower left and go clockwise around, the other blind spots are representative data, abusability, privacy, discrimination by proxy, explainability, optimization criteria, generalization error, and right to contest. And again, these are all cases where oversights can lead to bias in AI systems that in most cases are gonna harm vulnerable populations due to unintended consequences. But AI Blind Spot, it's not just a fancy diagram with lots of nice colors. We wanted to turn it into an actual tool that teams could use. So we created these blind spot cards. And we created these because we wanted to design something that was a little bit more accessible. There are also a lot of impact assessment tools that are coming out. And I can say from personal experience that they're very cold and technical. We wanted to create something that was a little bit more light and accessible that teams would be a little bit less intimidated by. By the way, this photograph is courtesy of our professional photographer, Jeff, and our professional hand model, Anya. I'll walk you through the layout of the cards. The left represents the front side of the card and the right represents the backside. So it starts on the front with a description of what this blind spot is and doing our best to phrase it in non-technical language so we can reach different audiences. And then on the backside, we have a high view considered section that talks about some of the steps you can take to address this blind spot. So in the case of explainability examples could include surveying individuals, users on whether they actually trust recommendations made by your AI system, considering different types of models that may be more explainable than others, factoring in the stakes of the decision. Are you just recommending a movie to somebody or are you deciding whether somebody's gonna get a home loan or not? And then potentially modeling counterfactual scenarios that would enable people to see what would have to change in order to achieve a more desired outcome. And then we provide a case study to give a real world example of where this blind spot arises and potentially or in many cases has harmed vulnerable populations due to oversights the company's made. And then there's a have you engaged with section that highlights specific people or organizations that you may wanna consult with due to their expertise either within your own company or organization or outside. And then there's a take a look section that provides a QR code that'll take you to different resources that'll help you address this blind spot. And then this shows our website that it's amazing actually that this video I record this morning that's now out of date because Jeff keeps making so many changes to the website. But it shows the card shows on his hand and then enables users to just explore the different blind spot cards. And if you click on one like explainability here it'll show you the same content of the card and it'll show you the actual resources that are behind that QR code. You can take the link to different places to learn more about this blind spot or how to address it. And then what is missing a button where you can provide suggestions? It's a good thing we added this because we've actually gotten a feedback already. We got our first feedback from somebody at the University of Washington who I think mostly had good things to say, fortunately. So with that we wanted to give an example of a case study of how this could be applied in the real world. This is a semi-fictional case study it may or may not have been informed by an actual incident that happened at a major tech company I may have mentioned earlier in the presentation. But so hypothetically, let's say there's a tech company that has a lot of internal data on their historical hiring practices. And so they wanna use AI to identify candidates for software engineering jobs. So they go to their data science team and they say, okay we want you to build a model that will help us screen through resume so we can fill these software engineering jobs. So the data science team does that they build a model, they deploy it but then they realize that they're just getting white men being recommended. So what happened there and more specifically how could AI blind spot have prevented this? So I'm gonna give examples of one card from each of the three stages the planning, building and deploying stage. Again it all starts with purpose and really asking yourself what are we trying to accomplish here? This would involve talking to the team about are you, why do you wanna use AI? Like are you just trying to get through resumes faster or are you trying to identify better candidates or are you trying to increase diversity and then really asking yourself is AI really designed to achieve all those three goals? If you just wanna get through resumes as fast as possible then AI may be able to help you with that if you wanna identify better candidates you would have to question your historical hiring practices and certainly if you wanna increase diversity AI may not be the right tool for that so we encourage teams to really question if AI is even suited to their purpose but in this case let's say that the team says okay it's number two we really want the best candidates and we really think AI can do that. So we move on to the building stage and then address the issue of discrimination by proxy. That refers to situations where you may not include features like race or gender or other protected classes in your model but you may have other features that are so highly correlated with race or gender such as historically black colleges or all women's colleges or sports like lacrosse that white men are more drawn to and features that are so correlated that ultimately leads to discrimination and we would encourage the team to consult with social scientists or human rights advocates who are just more knowledgeable about historical biases and can help you identify certain features that may be problematic that could lead to discrimination. So let's say the team has done that and now we move on to the deploying stage and in this case I'm even gonna give the company the benefit of the doubt and say that they actually want to increase diversity and they realize that AI can't do that so they realize they have to go back and fix their recruiting pipeline first by getting more diverse candidate pools and then maybe they think okay now AI can help us increase diversity but that's not actually the case because that brings up the issue of generalization error where if you have a history of not recruiting diverse candidates and now you do recruit diverse candidates the model that was built on historical data is not gonna be set up to evaluate new candidates with different backgrounds. So you'd have to consider something like maybe an anomaly detector that enables you to identify circumstances like candidates that have more unique backgrounds where AI is just not suited to where you do need a human to review. These are just suggestions but this gives you some ideas of ways the teams could work through the building planning, building and deploying of AI systems to identify their blind spots and understand and brainstorm how to address them. So with that said, what's next for us as a team? We have a lot of ideas of potential use cases for this some of which we got from peers in our cohort. I can see a lot of potential uses in different settings such as a product manager leading a design sprint and seeing potential use of the blind spot cards to help through the design thinking process. It could be a new director of data science at a startup where there aren't really structures or processes for how data scientists go about their job and blind spot cards could potentially help guide data scientists work. On the other hand, it could be a city task force that's responsible for auditing AI systems but has a less technical background and similarly needs some sort of guidance on what blind spots to be looking for. And there could be other potential uses as well. So our plan for next steps is to engage with users doing user studies to figure out what the best audiences were and then hopefully getting testimonials from organizations where the blind spot cards have been helpful in helping them assess bias in their AI systems. And then in the grand scheme hoping that this could be part of some certification process through an organization like IEEE where possibly in setting up some standards where say if an organization has processes like AI blind spot combined with tools like data nutrition label that could certify themselves as using AI responsibly. That's kind of our long-term goals that won't happen anytime soon. But we see a lot of potential for what this could do but we wanted to close with the Joker card. That's one of the blind spot cards. All of you, by the way, should pick up a set of blind spot cards. We have them on our table outside. But the Joker card kind of represents the idea of the unknown unknowns. We've identified these nine blind spots but there are other blind spots too that we probably didn't think of. We've identified potential use cases but there may be other ones that we haven't thought of yet that maybe some of you in the audience think of. So definitely come talk to us if you have ideas for where and how this could be applied because we really see potential to kind of help those organizations I was talking about at the beginning that really want to evaluate their systems for bias as best they can and just don't know how to do it. So with that, I thank you very much. Great, thank you Dan. So our final presentation is called Watch Your Words. They are examining the expansion of natural language processing and natural language understanding systems. And their team is Walt Frick from Harvard Business Review, Bernice Herman at University of Washington, Eric Ludwig at Indigo AG, Joseph Williams who's working for Washington State Government and as an advisor, Yassin Gabriel who's at DeepMind. All right, thank you everybody. So we are Watch Your Words and the premise of our project is really that we are surrounded by machines that are reading what we write and judging us based on whatever they think we're saying. The results of these systems can really matter. You can imagine a chat bot that's doing customer service or potentially even doing a job interview. These use cases are not necessarily new but what's new is that actually really, really powerful natural language processing systems, an older field concerned with understanding, computers understanding language, now any developer can pick up these tools and do pretty unbelievable things. And our premise is essentially what could go wrong when that happens. And so our first belief is actually quite a lot. So you could imagine a non-native speaker looking for medical advice from a healthcare bot, not being able to be understood and essentially going untreated as a result. You can imagine an employee finding out that they've been passed over for a key promotion because an analysis of their Slack messages and their email messages deem that maybe that they're a poor collaborator. These decisions have real weight and unfortunately we have good reason to think that they're quite biased. So as part of our project we conducted a literature review, finding evidence both that these systems work poorly for historically marginalized groups and also that they can pretty quickly learn very problematic stereotypes and potentially exacerbate them like the idea that some people are better suited for some jobs than others based purely on their gender. Beyond that literature review we also tested these systems ourselves and for that I'll turn it over to my colleague Bernice. Hi everyone. So what I wanna see here are that NLP services are brittle and what I mean by brittle is that if we give two things that we would consider fairly similar or innocuous they give unexpectedly different results. And for this is largely true for algorithmic systems but in the NLP systems that we studied misspellings even just differences in spacing and changing the pronouns or proper names within a sentence give different results. We chose natural language processing in particular because we believe that the misunderstanding of text may impact groups that are less studied so different than gender and race that we typically speak about in algorithmic bias and that's extremely interesting to us and important. So to conduct our analysis we query the natural language processing services of four large tech companies. IBM Watson, Microsoft, Google and Amazon. This is done using public endpoints which can be used by anyone including those with no machine learning or certainly bias mitigation expertise. And we pass sentences to these services programmatically using what's called an API. We focus on sentiment analysis here. A numerical value expressing whether or not an opinion that is expressed in the text is negative, neutral or positive. Okay so our first data set of two is of non-native English speakers. And this data set comes from the Tree Bank of Learner English. It's 5,000, a little over 5,000 sentences by adult non-native speakers during a certification exam for English. It was collected at the University of Cambridge but annotated with these corrections at MIT. The data set consists of an original sentence, annotations of things like spelling errors, missing words out of order words and corrected sentences and these annotations were done by graduate students at MIT. So the next thing we do is pass these to the APIs as I've mentioned. And what we find is that spelling and grammar mistakes influence performance in a lot of these cases. So for this example, we have two sentences that we would expect to be very similar. So the original sentence written by the non-native speaker was that was very disappointed. So they got a couple of things wrong, misspelled and maybe a slightly different form and so it was corrected to that was very disappointing. And what you find is that there's a large difference in some of these APIs and the results. And then what's very interesting is that those aren't even consistent across the different companies and services. For Google, they find that the corrected sentence is more positive but for IBM, Microsoft and Amazon, they find that the original sentence seemed to be more positive. So here we have another example and this is actually not a spelling error which for lots of reasons you might expect that natural language tools do not work well. This is similarly a grammatical error. So the correction changes the word satisfying and replaces with satisfactory. There's also a small grammatical error. And we actually see something we would hope to see for every single example in our dataset and that is that Microsoft and Amazon find the same sentiment for both sentences. Unfortunately, that's not the case for the other two APIs and in addition to that, they are also flipped. So Google finds the first positive, IBM finds the second most positive and if you look at the IBM example, it's by a large margin this difference. So our second data set is where we investigate these four proprietary services for the equity evaluation corporates. So this is an existing corporates that was building on research on gender and racial bias in sentiment analysis systems and we extend their work to investigate proprietary APIs like Google and Amazon which are not explored in this work. So they created a data set using templates like above. Person made me feel emotion and they have a list that they're replacing things like person with. So on the left we see a list that they use for analyzing gender. They might replace it with some gender subject, my daughter, this boy, she, he, him. And then on the right they are exploring both gender and race using traditionally African-American names and European names. So one example from this preliminary analysis shows that sentiment for a number of sentences with this particular template, really interesting I think is if you look at the right of this, sorry it's hard to see, my uncle has the most positive sentiment when you say my uncle made me irritated, my mom is next and with least positivity is she. She made me irritated. So this mostly illustrates just the brittleness and the messiness of the systems that seemingly very similar sentences that shouldn't really change between my mom or my mother have different results all the way across. And with that I will pass it on to Joseph to speak a little bit about the pipeline. Thank you. So who's responsible for this brittleness and this set of really odd results, right? Inconsistent results across everything. So I investigated through interviews with 20 companies who have revenue-generating operations in this space asking them what are they doing to take a look at how they build their models in terms of normalizing for bias and those kinds of results. And initially what we discovered was this is a very complex ecosystem. There's a shortage of NLP scientists that are out there, a severe shortage. So at the very top companies like Comcast and Hip Monk and Amtrak, they wanna build these things but they don't have the right people. So they're either motivated to build their own API engine or they're going to use the existing API engines that are out there, but even that is hard. And so we end up with a lot of platform vendors, we end up with a lot of third-party consulting companies, a lot of work for hire companies that are trying to help these companies develop chat bots and other types of vehicles. By the way, these are economically important because we have these rankings in terms of net scores that customer VPs are using for actually getting their bonuses and things like these in that promoter scores. And so this is a way to get the metrics to drive these MPS outcomes. So what we have is a very extended ecosystem, not a lot of expertise and a reliance on the API providers. And so when you ask, do you care about bias, they all sort of say, well, we don't really think about it. Our focus is on developing a chat bot or something that actually works, so it doesn't work. Functionality is more important than taking a look at bias. And so then when you interview more and you say, well, who should be responsible for bias? Is it you or whatever? They all do the same thing. They all point to the API providers and they say, well, it should be Google or Microsoft that we expect that they will de-biass. And so we don't really worry about it. And so what we ended up with is an ecosystem that really isn't thinking about this at all. And with that, I'll pass to Eric. Thanks. So I'm gonna just summarize this stuff and then give some recommendations because obviously coming out of this, I think we have some things that we would like to say and recommend for folks to do. And one of the questions I, as a product manager, always ask is, does it works? But for whom does it work? For whom does it not work? So our key findings here, three key findings. First, based on what we've seen and based on the articulation of harm that can happen from these, we believe that real harm is happening or can happen by using these systems blindly. And we believe that because, the second key finding, the APIs and the systems that we tested produced these wildly inconsistent and what we're calling brittle responses. So based on that inconsistency and that brittleness, going back to the first piece, we believe that there's harm that is happening. The third thing is that, as Joseph just mentioned, nobody's thinking about this and when they are thinking about it, they're assuming somebody else is taking care of it. That's not a good way to build a responsible system. So we have some recommendations. The first one is for these API providers. Number one, transparency. Could you tell us a little bit about your training data? Maybe you can't tell us exactly what it is, but can you tell us, is it about news and was that news collected, was that news corpus collected over the last five years? Is it Twitter? Where is it coming from? There's wildly different sets of people that use and create that training data and that will impact who's able to use these systems effectively or not. So tell us a little more about what's going on. Number two, give us some expectations of when the system should work or when you expect it to fall over. Like you have tested this stuff, you know where this is gonna work, please tell us a little bit about that. And three, please do some audit for specific biases and publish those results. So you can tell us this works well for these communities, this works less well for these other communities, especially when you're talking about a market with choice, help your customers make an informed choice. Second, third party developers, if you're anywhere in that stack above the API providers and you're doing engineering and development, here's some recommendations for you. Please be bias aware, understand that these API results can be biased and take responsibility for mitigating that in the products we build. So especially thinking about the language of the humans that are using the thing that you are building. So are those humans, are they English as a first language or English as a second language speaker? Do they use particular dialects or accents that may show up in their written language? Test against that. So go to the third one here, incorporate those vulnerable groups into your testing if you're building government services system for a variety of people. Understand what groups exist within that population and test against them. And so that kind of also incorporates a second one, think about your users, right? Who's gonna actually use this? And how that might challenge the APIs that you're relying on. And third, as researchers for folks who are in academic institutions, there's also recommendations for folks in this space. We would like to see an expansion of the machine learning fairness conversation to think about the full stack. Often, and I would say we did this to some extent, we look at a single layer of this. But really what you see with that stack is that the opportunity for bias to come in can happen throughout, and it may be not totally transparent. So we have to look at the whole system, we have to look at training data all the way to the users. And so we would like to see more of that happen. Potentially with our group, potentially, many other people can certainly do this. And then we would like to see some creation of templates for disclosure. So even if I work at one of these big companies, and I wanna tell the world about, hey, this is what our API is good for and is not good for, there's not a standard format for that. I think Data Nutrition Project did a great job of kind of putting something out there into the world. But there could be more of this, of telling and helping companies understand how they can talk about the things that they're building in ways that practitioners who are implementing this stuff can understand. So with that, I would just like to take my moment at the end of this, to give a big thanks to Hillary, specifically for guiding us along this path into all the MIT Media Lab and Berkman staff who've helped this program exist. And if you'd like to come talk to us, we have a poster out there, we have a little more data on that poster, we'd love to talk to you about our project. Thank you very much. So wow, I feel like this is a part of the game show where Joey said he was withholding his congratulations till the end. So it's like, well, Joey, what do you think? Is it... Yeah. He's assuming this is a trick question of some kind. It is, though, isn't it? Yes, it absolutely is. So first, like, huge round of applause. I mean, this is one of those moments where I think we just feel so privileged to be part of a program like this. We were fortunate to have a chance to reassemble and teach a version of the class that we do around this at the beginning of the program, as some of you may recall, and how much fun that was with a sense of possibility, with us, with a sense of humility, presenting some of the stuff we've been thinking about and felt that we'd learned. And I can already see so many ways in which we might revise what we teach when we're working with people in light of what you all have learned and come to and proposed. And to see the evolution over time of each of these projects in the checkpoints that the advisory board meetings represent is just, it's so gratifying to see just how far each of these projects has come. And I think in a lot of the work we do focused around impact and policy, whether private or public policy, there's always this trade-off between trying to identify the need for and figure out how to engage in systemic change, which is hard, versus solving the problem in front of you, which is practical, but at times incremental. And I'm just aware of how much each of these presentations and projects is so nicely threading that needle and balancing a kind of window into the systemic issues with solving problems in front of us. And I just see this as a conversation that comes to certain milestones and that makes various recommendations and then invites further conversation along with it. I'm struck just in the most recent presentation by the slide identifying the panoply in the stack of those institutions or firms that might bear some responsibility for some of the stuff going on here, which is normally where the corporate sponsorship slide goes with many thanks to Cisco for its responsibility for some of these problems. And I'm just so pleased to have an opportunity, institution to institution. And honestly, our institution should be on that slide, not as sponsors, but as ones that need to look at ourselves. I was. That was my gentle way of saying so and saying that, not only there, but for the grace of God, go we, but our institution belongs there as well. Sorry, you were gonna say more, no. But absolutely. So we're just really grateful to have been part of providing a platform, both to be thinking about these issues, to be broaching them so authentically and directly and practically and in all the best ways conversationally. And also to be as we can as now things shift into the next phase, try to be helpful as platforms to get these ideas out there amidst the cacophony so that you can project them forward and have them find their greatest audiences and those you wanna engage with. So thank you. And I guess just for my part, I was only half joking then. I was, I remember that when we first, when I first met you guys and there were all these posts on the walls. First of all, when I saw the list, every year it seems like we're pushing the edges of diversity thing. Well, let's get even more people who look like they would never get along. And so I was excited to see how much the teams call us. And the other thing is, it's been great teaching with Jonathan in his Harvard Laws seminar style where we read all these papers and we study and we learn. But the problem is I always end in despair because we see- We call that another satisfied customer. We see all of these problems, but not many solutions. And not that there are solutions to everything, but I think one of the reasons why I think Media Lab tends to be overall probably happier than the law school is that we actually get a sense that we then get to go out and try to build solutions. But as I think one of the problems of building things is that you can also build things that are bad. And so I think what was sort of, for me, very what was wonderful about this was that we started out with a, I think a moment of despair for you guys. I mean, you're trying to figure out what to do. It's harder than it was the previous years where you could just think, okay, let's just clean up data. Let's go after, I mean, the obvious stuff is all done. And so you had to find some non-obvious but still critically important things, which I think everybody did. But then to me what was really great and why it was a contrast to the, I do, I'm only half joking. I mean, I think the classes are great, but they do end with this sort of moment of everybody writes a paper and we go home and we still have this sort of thing like did we actually do something? And what do we do now? Whereas I think the presentations today were wonderful because they kind of had a happier feel because it felt like they showed a way forward. So I hope you don't end with just we did a presentation ready to go that either- That's why for the next hour, we're pleased to have time. So either literally or maybe metaphorically, trying to continue this stuff. Because I do think that the things that you've found are quite unique. I think the combination of skills in each of the groups is pretty unique. And I also think that you have some interesting paths that still is kind of the beginning in many ways. And so if you guys wanna follow up and I know we're gonna have some more or maybe two more and other times to do this, but I think people would be interested in either collaborating funding or otherwise supporting some of these things. So I'd love to have this be the beginning rather than the end of this project. Well, thank you all again for so much food for thought and for further action. And I'm so pleased that there is food for food outside with apologies to those watching this later on video. Just I invite you on video to just go eat something. We should have a notice at the beginning of the video, prepare. Prepare your own reception after the, yeah. But we'll adjourn to that reception and also have a chance, poster session style to engage with each of the groups and learn more about it and see what next steps can look like. So, thank you again for a fantastic year.