 Hi, I'm Sean Gaggins. Welcome to this talk about machine learning ethics and use in open source software analytics. I'm going to start by sharing a little bit of story. When I was 16 years old, my best friend's brother was kind of in the cars, and he had restored a 1963 Corvette. And I asked him if I could drive it, and he said no. And that's because it's a very powerful machine, and I almost certainly would have wrapped it around a telephone pole because I would have tried to go really fast, and it was more machine than I could possibly handle. And I think when I think about machine learning as a computer scientist, it scares me like giving a 16-year-old a 1963 Corvette. It is a very powerful set of tools, and it can be easily misused if you don't understand all of its parameters. So with that, a little parable. But if anybody does want to give me one now, I promise I can handle it, or just take it for a drive. It won't be a problem. So I find this graphic kind of useful for talking about machine learning in general. We have the machine on the left and the person on the right. And when we think about machine learning, sometimes our expectations are that we are going to experience something, that we can train the machine to do some thinking for us. And I think what we can do is we can train the machine to filter through a bunch of noise and tell us where to focus our attention. I don't think it can give us the same kinds of answers. In a person on the right, we have disciplines. These are things that we become good at, and we develop those disciplines through experience. Anytime there's a new problem, we apply our disciplines in our experience and speculate about how we might solve it in the future. This is the kind of mental activity we do when we're debugging. So the human brain does that very well. There's a similar cycle with machine learning, but it has some danger points. First of all, instead of a new problem, you're just giving it new data. And the historical data, the stuff from the past is what it gets trained on. That historical data is limited by what's ever in it. So as new phenomena emerge, if you've trained a model on a set of phenomena or interactions that are not in the scope of the new data, it's not going to give you a very useful answer. And it requires a human to look at it and understand it to get a useful answer. And instead of a set of disciplines developed over a lifetime, machine learning is a model, a computational model, sometimes that's saved on disk, that's used to funnel new data through it based on models, the models created from the historical data. And then it makes some kind of prediction and the results are unknown. But it can serve that purpose of telling you where to look or where, in all the noise of activity, you can identify something useful. And I want you to just keep this distinction between how the human brain works and how machine learning works as we talk about the different kinds of biases and things that you can codify. Because machine learning can also be really efficient at multiplying and amplifying biases that we have in the current way that we do things. So if you're trying to create change or reshift an organization, machine learning can actually get in the way by causing you to emphasize or amplify the things that you're trying to change. I want to step back now and talk about the data in open source health and sustainability software and where the information comes that fills in the historical data that we're going to use to train models. So we have a tool called Augur and some community reports that we build based on mining, both GitHub and the software repositories. And we can produce a lot of really interesting statistics about them. And we also have all the conversations that take place around a project, which is where the machine learning algorithms that we apply come from. So there's a vast sea of data that we use as input. And we can take into account just about any factor that you could think of when it comes to open source software health and sustainability. Name a piece of data stored about open source software, and it's probably an Augur. And it can be employed to execute some kind of machine learning algorithm. So these are things like commits, change requests, issues, messages, dependencies, and contributors. And this is just a schema telling you a little bit more about it. Now, if I think in terms of the universe that I'm in, sometimes I'm thinking about what's happening in a particular repository. What are the messages, the issues, the change requests for that repository? And if I build a machine learning algorithm using the historical data from one repository, I'm going to get insights that are based on that training data from one repository. And that might be the scope of what I'm interested in, or it might not be. So in another example, if I have an ecosystem of 3,000 or 10,000 projects that are within the scope of my concern, I could gather the same data about all of those projects and train my models to identify anomalies within the projects but also across the projects so that I have a richer set of historical data that I'm using to train the models. This is important because like we talked about in the little brain thing, your historical data is how you build the model that then the new data that comes through gets pushed through. And that's what gives you some anomaly or other insight that tells you you should look somewhere, basically. And so the broader your training data, that may be appropriate or not appropriate, but if it's probably more useful than just training on one project in some cases. So coming back to this, it's the historical data that's being used to train. And you have to choose where you're going to get that pool of historical data. If it's from the projects within your scope, is it from all of them, or do you need to build different models for different subsets of your ecosystem? So I'll give you an example. Some open source projects are really chatty. There's a lot of comments. There's a lot of interaction. And everything gets discussed. In other projects, there's pull requests, brief reviews, and things just go forward. In those cases, there's less to train on. So how I identify the critical features on one project can be different than on another project. If I don't have a lot of conversation, I might be looking simply for the existence of any conversation as the signal of an anomaly. I don't even need machine learning to do that. And in the case of that category of project, I can get some kind of, hey, look at this because there's 10 comments on this issue. And there were zero on average on every other issue, as an example. And in other projects where people are very conversant, I can look at things like sentiment, valence of the conversation, what kinds of speech acts are existing in the comments, and things like that. So the tool that we did this in is called Auger. And it has a bunch of workers for collecting data and a set of machine learning workers that we use to produce clusters. So the clusters are going to take a look at all the conversation in a project and group projects together if they have a similar level and type of conversation. So it's a computational linguistically based clustering algorithm. So this would be how I would determine which projects have similar levels of conversantness, conversation in the issues and pull requests. And I can group those together and maybe build a set of models that fits that cluster. Discourse analysis provides essentially there's seven key types of speech acts, questions, answers, assertions, criticism, and a few others. And knowing the sequence of these different kinds of speech acts in a conversation or text can be helpful, because then you can identify what is the pattern of sequence on a project and how does that sequence of speech act types look different from project to project. So it's another way of creating a set of features that you can use to classify, characterize, and make decisions about what kind of machine learning algorithms to use on your projects. The message insight worker does two things. It produces a list of, it produces a sentiment score based on a software engineering specific sentiment analysis library, so it removes, it doesn't say things like bug or defect or negative words or an ordinary English speech it would be. So it's sentiment analysis defined for software engineering. And it also does message novelty scoring. So how unique or novel is a particular message in the context of the other messages in that repository, which again can be useful for knowing that there's something unusual about a particular message and that it can be flagged for you. And the pull request analysis worker is a simple predictive worker that looks at the properties of a pull request and the interactions around it and tries to predict whether or not a pull request is likely to be merged. So I can, this is another piece for one repository, signals of how welcoming a project is. It's not machine learning, but it gives you an idea that we can use the kinds of events that occur on a repository as another dimension or another factor in building machine learning algorithms for that repo. And ultimately, the goal is to build some analysis of health patterns in a particular ecosystem. So what are the kinds of interactions or the sequence of interactions that are occurring in an ecosystem that you think is pretty healthy? And you can use that to sort of compare other ecosystems or either other projects against. And it gives you some idea. So instead of simply training the model and taking the output at face value, you can identify projects that are exemplary in your opinion and experience. And train a model from that project and compare other projects to it so that you can start to see, you know you have this project that's really healthy and there's positive vibes and good interaction. All the other projects, how do they look in comparison? And it's one way of identifying in a large collection of projects which ones you might want to attend to in terms of advancing the level of communication or the quality of discourse or the kindness in the messaging. Characteristics, OK, so we talk about toxic messages. And toxic isn't the right word, but it's what I started with a long time ago with one of my collaborators who's sitting in the corner there, and I just continue to use the high Kelly. This is what I continue to use as the word. But I think in discussing this with people, it's better to talk about what are the signs of healthy communication as opposed to the signs of unhealthy communication. So focus on the positive instead. But it's so much more striking to be negative in discourse as everybody who watches cable news these days knows. So the characteristics of toxic messages can be identified or characterized using things like sentiment, novelty, level of activity. Many times, if pull requests has a lot of activity around it, that's a signal that it has some kind of problem the community is unhappy about. And highly active, negatively valence threads are other signals. And this is how we pull things together. So when I have these valence messages, they're stored in auger in a table called repo, which is where your repository list is, and message analysis summary. And the message analysis summary simply provides you with positive ratio, negative ratio, novelty. It's a table. So the machine learning worker, the message insights worker, runs over it and stores the results in this table. And it'll do that continuously for each new message. And it'll retrain the underlying model every 30 days. So you'll end up with analysis based on a model generated a month ago, two months ago, three months ago, and an analysis based on the most recent model. So this helps to account for the fact that I can't train a model on data that if I have a new phenomena, I can't train a model to anticipate that new phenomena. So continuously deciding at some period in time, we're going to retrain your models, assuming there are new kinds of problems, new kinds of discourse that are emerging. It's important not to just take a model and run with it forever. Every 30 days might be too aggressive, but I'm just experimenting. And so I'm not aggressive. So here's a set of scientific repositories, scientific open source repositories, with negative ratios and positive ratios. And the negatives are kind of high on these. And so that's why I'm showing them to you. And here's a set of corporate repositories. So one thing that you can see if I'm using, this is an algorithm, so it's not being trained. It's actually just identifying positive and negatively valence messages. One thing that jumps out at you is in the scientific open source projects, the number or the volume of negatively valence messages is significantly higher than it is in the corporate space. And anybody that's ever taken a computer science course might understand that sometimes there is much more freedom to speak freely in a computing research community, which is where most open source scientific software lies than there is in a corporate community. And so you do get a visibly lower level of politeness, shall we call it, in open source scientific software. Not across the board, but the peaks of negative valence messages is higher. And if I want to look at issues, this is just how issues are structured. So I've got a repository with a bunch of issues pulled from GitHub and then messages associated with that issues. And I just present this to give you an idea where the data that we're building machine learning models against comes from. So here again are message threads with negative valence. And you can see that there's a message count there. So these are ones with negative valence prior to collect and prior collection. And these are the corporate ones. And there are just more in the open source scientific space. And one of the things we can do is if we identify a message that we suspect might not be the kind of message we want to see, we can actually identify the specific message on GitHub with the URL that GitHub provides, which is a piece of data stored on the issue table. And it'll just tell us where the comments are. And we can go look at it and see if it's actually, quote unquote, toxic. And in this case, it's actually not. But it got flagged by the system. And so instead of looking at 100,000 messages, you maybe get 30 that you might want to go take a look at. And then you can tell the machine learning algorithm if it's not toxic. Train that particular kind of message a little bit differently next time. And these are the other kinds of data that I mentioned, the other kinds of analysis, the other kinds of machine learning. And here I've got a few slides on ethics of machine learning algorithms. So it's kind of the practice that we want to avoid. But it's the first thing you do. Like when you pick up a machine learning algorithm and you load a Python library that does some kind of machine learning work, you just stir the pile until it starts looking right. Well, that might be not exactly what you want to do. I would advocate that it is not what you want to do. But it is a common first move in machine learning just to see how it works. And then with allusions and props to the Google person, a sentient toaster that is striking out against you. So as I said before, machine learning can be dangerous. And so this is the cautionary tale that I want to provide. And the examples that I showed you are derived from 3,500 corporatized open source projects and 2,300 open source scientific software projects that I identified with the Chanzuckerberg Institute. I talked earlier about clusters. And so these are the different clusters of activity. So these clusters were generated looking at communication patterns on projects. So first of all, if a project is a certain color that has similar communication history in terms of volume and nature of communication than the other projects that are in that cluster. And projects that are considered to have sort of an instability technically happened to cluster in the red cluster. And projects that are more stable happened to cluster in the green cluster. And purple and blue are a little bit different. But the characteristics of the project help us to see or we see a relationship between the ways that people communicate and how reliable or stable a particular open source project might be. And this is just in one sample. So I'm not making a general statement about all of open source. I'm just saying in the particular sample that I trained here, that's a phenomena that we identified. You can also just graph things like the basic sentiment. So you can look at positive and negative valencing of sentiment and see in scientific software you've got a lot of over-the-right, more negative. And in corporatized, here's a visual that just shows you that a lot of stuff is really in the middle. And there's very little over-to-the-right. It's also interesting to step out of machine learning a little bit and just look at how many people are contributing to a project. This can give you an idea of what kind of community you're dealing with. And sometimes if you're going to classify projects initially and you don't want to run the machine learning first or you don't want to cluster them first, you can simply look at the number of contributors. And it's a good, useful, heuristic for an initial grouping of projects. And you can assert as a hypothesis that they are similar in some way. And the collaboration patterns, at least, are likely to be, they face similar collaboration tensions, at least, depending on the scale and the number of contributors. Yes. I'm looking at all time. This is all time. No, I haven't windowed this particular data at all. Yes. Yes, those are. So if you're not familiar with Augur, everything has a repo ID. That's how I can make a presentation talking about valencing and things without always naming the projects, although I did in a few slides, so bad me. That's a basic introduction to the kinds of things that we apply machine learning to in open source software health and sustainability metrics. And as you go forward, I just want to bring you back to the slide because it's hard to get your head around if you're new to machine learning and you haven't applied it to a particular domain before, what are the limits of what you're doing? And the limits of what you're doing in open source software health statistics and machine learning, looking at conversations in these projects is the historical data is what's going to drive the content of the model. And as your scope may expand or change, it's important to keep in mind that whatever your models are that you're running your new data through, they're going to be filtered in some respects and amplify the properties of the historical data that you used. So if the historical data is relevant, then it can be helpful and you can get useful insights from the model. As the relevance of the historical data drifts from whatever is happening on a project currently, it becomes more difficult to rely on that model. And so the predictions become less predictable and potentially less useful. So there's a human lens that you need to apply to what you're seeing to identify what's useful and what's not useful. When you've got thousands of projects, it's very helpful to have something like this to point you at where to pay the most attention because machine learning on open source projects is at least helpful at telling you to look at one collection of projects in a large set as opposed to another. It's helpful at that, but it doesn't answer all your questions or give you the answers to any of them. And sometimes you can have false positives for whatever you're looking for. And that the machine learning, I really what we call artificial intelligence is simply unsupervised machine learning. And I use this slide to point out that in my opinion, there just isn't really anything that is sentient in computing. And as people, we look at problems, we have disciplines and experiences, and we can apply our knowledge and our experience to solve things. And machines can do very highly repetitive tasks and help us to know where to look, but they can't replace us. At least that's my hypothesis. And I'll take any questions that you might have. Hopefully I've given you something to think about and you're not entirely confused. Any questions? Yes? The other thing they drew me in here because I've used the language analysis part. Mm-hmm. Because it was all these other aspects. And so the magic people question I asked earlier was, how do I figure out who gets things done in this fair enough project? If I have a thing I want to get out of an idea and I want to see it through completion, what's the likelihood? Who are you to put the shoulders towards a dapper? Like, how do you know who the really core maintenance are for something? Because a lot of projects don't really maintain this kind of information in a perfect process to figure out how to really get something done and be like, if I had an issue, and magic would happen, it was one of our techniques. Yeah. That somehow works, right? So, and it just occurred to me, I was just like, oh, you could be, as you were pumping through all of these issues in pull requests, you could see who shows up first and tends to, and then it would look, and so it's not just, it's just this other side of it too, is that a, something that could be. So, there's, I think, identifying the contributors who might be core, but aren't labeled as such on a project, read me. I think I wouldn't need machine learning to do that. I would look at who's making pull requests that get accepted at a higher volume than other people. Like, who are your high volume pull request creators? I might also look at who has permissions to merge pull requests, right? Because that's an indirect signal of people who've been trusted by the main maintainer group to handle some of that activity. So, if I'm looking for people who are already significant in a project, those are the things I would look at. I wouldn't even need a machine learning algorithm to do that. Where I think machine learning can be useful, is if somebody's brand new, when you get a newcomer, and if, so this is not something I've looked for, but I can imagine that you would be able to, I think you could apply some of these approaches with machine learning to look for, okay, what are the patterns of communication and interaction that have been demonstrated by people who moved from their first contribution directly into, you know, slowly into some kind of greater contribution role or even a maintainer role? You know, what are the properties of those people who've been successful at doing that before? And they might, I suspect they're different for every project, but I think there are some things that we already know are the same. You're more likely to continue contributing to a project if your first pull request is accepted. Like, as simple as that, you're more likely to continue contributing if somebody responds to your pull request or your issue in a timely manner. So, some of the things we would want to use to identify new contributors who are likely, you know, we can do things to actually cultivate new contributors, you know, responding quickly, working with them to get a first pull request merged. That's, none of that's even data. It's a practice that, and there's a small set of practices that we know work for drawing new contributors in. Yeah. And that would, who's responding quickly? What is the response? What are the types of responses? So, if I got to be able to go with general question of like who gets things done, but looking at each of the type of things, okay, who is the type of type to get merged or like getting reviews done since the title works great with giving feedback or commenting on issues, like kind of breaking it down a little bit more. Thank you. Is there anybody online? No, okay. Oh, okay, okay, okay. Okay, fair enough. Yeah. Yeah, I probably should have like obfuscated the names of the projects. Well, so one thing that I would do with this is like, okay, I know that I have more negatively valence message and open sort in a, in some subset of the open source scientific software, I have messaging that's significantly more negatively valence than the most negative messages in a corporatized open source project. And as a human being who kind of operates across both worlds understand why that is. Scientists just have less energy for being polite. Just trying to get things done and often they're probably communicating with peers or graduate students with whom they already have an existing relationship. So it comes across in a machine learning algorithm as negative, but the context I wrap around it is these are also probably people that, for example, I know and I might just be more direct with them. And we all hopefully know that when you're more direct often you're less polite. But if you have a relationship where you can be more direct you can also, it's also more efficient. And so I think scientists, people who create the open source scientific software operate and more in that headspace perhaps. That'd be my hypothesis. Yeah, yeah. And there's a way to do that politely, but PhDs aren't always good at that. I guess I don't know the last thing. Just to add to that, there's so much to talk about that if you move as quickly as possible you see a hole go for it. Yeah. You're blocking the other traffic basically by not going for the hole. That doesn't sound like defensive driving. No, I know exactly what you're talking about. It's how I drive. And I just, as I try to imagine how to do that with 15 year old twins in my house, I don't model the best behavior, but I want them to learn how to drive defensively. So maybe you don't want to give me your Corvette. That just reminded me of, is anyone ever lived or driven in Minnesota? Every place I've ever lived has or worked extensively has different little driving weirdnesses or anomalies. In Minneapolis, St. Paul, you'll get in a lane, it's two lanes that go straight or you can turn left or right. And people don't turn on their left turn signal until the light turns green. And it happens to me all the time there, it never happens to me anywhere else. But every time I drive Minneapolis, I encounter this behavior. So it's an anomaly of that local culture, like however that developed, it'll be fun to study to your point. So I know from my experience that when I go there, that happens. In Buffalo, New York, people go through red lights all the time. And it's because they have red lights where there's no traffic. So Buffalo used to be more of a going concern than it is today and they still have all the stop lights from 30 years ago in Buffalo head industry. And so people just drive through the red lights all the time. Yeah, so if I was gonna go beyond the sentiment valencing that you're talking about, I would definitely build computational models that are specific to each ecosystem because they have other different characteristics. So for example, there are significantly fewer contributors on most open source scientific software that people use widely than a similarly widely used corporatized project. And so you have smaller contributor communities, fewer maintainers, just in general. Like there's exceptions to all these rules, but on average, these are some very significant structural differences between these two ecosystems. And so it's not just machine learning that we can use to get that insight, but I think we would definitely wanna apply or develop models and train them against these different groupings. So I wouldn't like beyond the sentiment analysis, which is just an algorithm. It's not a machine learning model really. It's useful to know. But I would build models for things like discourse analysis and novelty that are specific to the two different ecosystems. Yes, I mean, yes, I would definitely do that. The number of contributors is lower. I mean, I haven't looked into that question, but I can tell you from experience talking to and working with people and across these different ecosystems that there's a, and also maintaining auger and some of the chaos repositories. As the number of contributors grows, the maintainer load, as you probably know, becomes much greater. And the possibility that I will be inordinately brief goes up. So I think the possibility of my not handling, individually, just talking about myself, the possibility that I would handle something less than perfectly increases as I've got more people making contributions because now I have a significantly higher workload. And I think one of the things you get, I think we're all concerned about there is maintainer burnout, right? Because there's a degree of success that you can have that ends up eating your life. And maintainers, I think we don't always know when is the time to bring on or to cultivate additional maintainers to help the project continue to be successful without me individually having to be its overlord. And I mean, I think that's, like if I'm thinking about how to educate maintainers to help their projects, ultimately not rely as heavily on them or to be more sustainable with or without them. And not every maintainer has that goal, by the way, that they would be helpful to sort of give them guidance, help maintainers develop a sense of how do they cultivate community. And so one of the things we do on the KS project is actively help projects understand how to cultivate the community. And that we look at community as kind of the center of things. And so if we cultivate a community and we have many people involved in different parts of the project, the load on a maintainer diminishes because we're also identifying maintainers. And that's a set of practices that have evolved from trial and error. And I think there are some projects that are very good at that. I think the more you get corporations and organizations that have done this before involved, the more likely you are to have that kind of project culture from the outset. If there are no other questions. Thank you very much. I really appreciate you spending your part of your last afternoon here with me. So thank you. Thank you.