 All right. Thank you everybody. So we are watch your words and the premise of our project is really that we are surrounded by machines that are reading what we write and judging us based on whatever they think we're saying The results of these systems can really matter. You can imagine a chatbot that's doing customer service or potentially potentially even doing a job interview These these use cases are not necessarily new but what's new is that actually? Really really powerful natural language processing systems an older field concerned with Understanding computers understanding language now any developer can pick up these tools and do pretty unbelievable things and our premise is Essentially what could go wrong when that happens? And so our first belief is actually quite a lot so you can imagine a non-native speaker looking for medical advice from Healthcare bot not being able to be understood and essentially going on treated as a result You can imagine an employee finding out that they've been passed over for a key promotion because an analysis of their slack messages and their email messages Deemed that maybe that they're a poor collaborator These decisions have real weight and unfortunately. We have good reason to think that they're quite biased So as part of our project we conducted a literature review finding evidence both that these systems work poorly for Historically marginalized groups and also that they can pretty quickly learn very problematic stereotypes and potentially exacerbate them Like the idea that some people are better suited for some jobs than others based purely on their gender Beyond that literature review. We also tested these systems ourselves and for that I'll turn it over to my colleague Bernice Hi everyone So what I want to see here are that NLP services are brittle and what I mean by brittle is that if we give Two things that we would consider fairly similar or innocuous. They give unexpectedly different results And for this is largely true for algorithmic systems But in the NLP systems that we studied misspellings even just differences in spacing And changing the pronouns or proper names within a sentence give different results We chose natural language processing in particular because we believe that the misunderstanding of text may impact groups That are less studied so different than gender and race that we typically speak about an algorithmic bias and that's extremely interesting to us and important So to conduct our analysis we query these natural language processing services of four large tech companies IBM Watson Microsoft Google and Amazon This is done using public endpoints which can be used by anyone Including those with no machine learning or certainly bias mitigation Expertise and we pass sentences to these services programmatically using what's called an API We focus on sentiment analysis here a number a numerical value Expressing whether or not an opinion that is expressed in the text is negative neutral or positive Okay, so our first data set of two is of non native English Speakers and this data set comes from the Tree Bank of Lerner English It's five thousand a little over five thousand sentences by adult non native speakers during a certification exam for English It was collected at the University of Cambridge, but annotated with these corrections at MIT The data set consists of an original sentence annotations of things like spelling errors missing words out of order words And corrected sentences and these annotations were done by graduate students at MIT So the next thing we do is pass these to the API's as I've mentioned And what we find is that spelling and grammar mistakes influence performance in a lot of these cases So for this example, we have two sentences that we would expect to be very similar So the original sentence written by the non native speaker was that was very disappointed So they got a couple of things wrong misspelled and maybe a slightly different form And so it was corrected to that was very disappointing and what you find is that there's a large difference in Some of these API's and the results and then what's very interesting is that those aren't even consistent across the different companies and services For Google they find that the corrected sentence is more positive But for IBM Microsoft and Amazon they find that the original sentence seemed to be more positive So here we have another example and this is actually not a spelling error Which for lots of reasons you might expect that natural language tools do not work. Well, this is similarly a grammatical error So they've the correction changes the word satisfying and replaces with satisfactory There's also a small grammatical error And we actually see something we would hope to see for every single example in our data set And that is that Microsoft and Amazon find the same sentiment for both sentences unfortunately, that's not the case for the other two API's and In addition to that they are also flipped so Google finds the first positive IBM finds the second most positive And if you look at the IBM example, it's by a large margin this difference So our second data set is where we investigate these four proprietary Services for the equity evaluation corporates. So this is an existing corporates That was building on research on gender and racial bias in sentiment analysis systems And we extend their work to investigate proprietary API's like Google and Amazon, which are not explored in this work So they created a data set using templates like above Person made me feel emotion And they have a list that they're replacing things like person with so in the left We see a list that they use for analyzing gender. They might replace it with some gendered subject my daughter This boy she he him And then on the right they they are exploring both gender and race I'm using traditionally African-American names and European names So one example from this preliminary analysis shows that sentiment for a number of sentences With this particular template really interesting. I think is if you look at the right of this, sorry, it's hard to see My uncle has the most positive sentiment when you say my uncle made me irritated My mom is next and with least positivity is she she made me Irritated so this mostly illustrates just the brittleness and the messiness of these systems that seemingly Very similar sentences that shouldn't really change between my mom or my mother Have different results all the way across and with that I will pass it on to Joseph to speak a little bit about the pipeline Thank you. So who's responsible for this brittleness and this set of really odd results, right? Inconsistent results across everything So I investigated through interviews with 20 20 companies who have revenue-generating operations in this space Asking them what are they doing to take a look at? how they build their models in terms of normalizing for bias and those kinds of results and Initially what we discovered was this is a very complex ecosystem There's a shortage of NLP scientists that are out there a severe shortage so at the very top companies like Comcast and hip Munkin and Amtrak they want to build these things, but they don't have the right people. So they're they're either motivated to build their own API engine or they're going to use the existing API engines that are out there But even that is hard. And so we end up with a lot of platform vendors We end up with a lot of third-party consulting companies a lot of work for hire companies that are trying to help These companies develop chatbots and other types of vehicles. By the way, these are economically important because we have these rankings on in terms of net scores That customer VPs are using for actually getting their bonuses and things like these in that promoter scores And so this is a way to get the metrics to drive these MPS outcomes so what we have is a very extended ecosystem not a lot of expertise and a Reliance on the API providers. And so when you ask do you care about bias? They all sort of say well We don't really think about it our focus is on developing a chatbot or something that actually works So it's does it work? Functionality is more important than than taking a look at bias And so then when you interview more and you say well who should be responsible for bias Is it you or whatever they all do the same thing they all point to the API providers and they say well It should be Google or Microsoft or that we expect that they will be bias And so we don't really worry about it And so what we ended up with is an ecosystem that really isn't thinking about this at all and with that I'll pass to Eric Thanks, so I'm gonna just summarize this stuff and then give some recommendations because obviously coming out of this I think we have some things we would like to say and recommend for folks to do And one of the questions I as a product manager always ask is does it works? But for whom does it work for whom does it not work? So our key findings here? three key findings first based on what we've seen and based on that the Articulation of harm that can happen from these we believe that real harm is happening or can happen by Using these systems blindly and we believe that because the second key finding The API's and the systems that we tested produced these wildly inconsistent and what we're calling brittle Responses so based on that inconsistency in that brittleness going back to the first piece We believe that there's harm that is happening and the third thing is that Has Joseph just mentioned nobody's thinking about this and when they are thinking about it They're assuming somebody else is taking care of it. That's not a good way to build a responsible system So we have some recommendations the first one is for these API providers number one transparency Could you tell us a little bit about your training data? Maybe you can't tell us exactly what it is But can you tell us is it about news and was that news collected? Was that news corpus collected over the last five years? Is it Twitter? Where is it coming from? There's widely different sets of people that use and create that training data And that will impact who's able to use these systems effectively or not so tell us a little more about what's going on number two Give us some expectations of when the system should work or when you expect it to fall over like you have tested this stuff You know where this is going to work Please tell us a little bit about that and three please do some audit for specific biases and publish those results So you can tell us this works well for these communities This works less well for these other communities, especially when you're talking about a market with choice help your customers make an informed choice Second third-party developers if you're anywhere in that stack above the API Providers, and you're doing engineering and development Here's some recommendations for you Please be bias aware Understand that these API results can be biased and take responsibility for mitigating that in the products we build So especially thinking about the language of the humans That are using the thing that you are building So are those humans are they English as a first language or English as a second language speaker? Do they use particular dialects or accents that may show up in their written language? Test against that so go to the third one here incorporate those vulnerable groups into your testing if you're building government services system for a variety of people understand what groups exist within that population and test against them and So that it kind of also incorporates a second one think about your users Right who's gonna actually use this and how that might challenge the API's that you're relying on and Third as researchers for folks who are in academic institutions. There's also recommendations for folks in this space We would like to see an expansion of the machine learners machine learning fairness conversation to think about the full stack Often and you know, I would say we did this to some extent we look at a single layer of this But really what you see with that stack is that the opportunity for For bias to come in can happen throughout and it may be not totally transparent So we have to look at the whole system. We have to look at training data all the way to the users And so we would like to see more of that happen Potentially with our group potentially many other people can certainly do this And then we would like to see some creating creation of templates for disclosure So even if I work at one of these big companies and I want to tell the world about hey This is what our API is good for and is not good for there's not a standard format for that I think data nutrition project is a great job of kind of putting something out there into the world But there could be more of this of telling and helping companies understand How they can talk about the things that they're building in ways that practitioners who are implementing this stuff can understand So that I would just like to take my moment at the end of this and to give a big thanks to Hilary specifically for guiding us along this path into all the MIT Media Lab and Berkman staff who've helped this program exist And if you'd like to come talk to us we have a poster out there We have a little more data on that poster. We'd love to talk to you about our project. Thank you very much