 it's nice to be here and I would also like to tell you this shirt is perfect for this talk I'm very excited to be wearing this today and that's why my slides are yellow it's specifically to go with the shirt I'm very into that you know I got this at a thrift shop I was like they were putting it on the mannequin and I had just gotten this talk except I was like you take that off that is mine we are t-minus one minute I'm probably gonna like blast through these slides and then I'd really like to just open it up for discussion because I think that this is like one possible approach but I also know that I'm also one human being which means I probably haven't thought of the best idea there's a lot more of us in this room today and there's a lot more of us outside of this room participating in this conference and I'd like to see if we get better ideas for how to solve this problem with engineering so with that it's 505 packed room very excited today we are going to talk about how the gender gap in open source is a metadata problem and because I do believe that it is a metadata problem it's something that we can solve through the engineering of our interfaces and the way that we provide contextual information around developer decision-making my name is Sal Kimick and I do care about diversity and open source but I've never given a diversity talk before so this might be a little bit unhinged I'm gonna have a lot of fun but I'm like very excited to be giving a talk without a tech demo for once in my life if you want to follow along with these slides or keep them on your own you can go to tinyurl slash os metadata and reach out to me on twitter and if you want to continue this discussion I would absolutely love to before we get started I'm going to start with a call to action that's absolutely unrelated to this talk but also important so I care a lot about cyber security I also care a lot about incentivizing developers well if you are familiar by now we've been really talking about it there is a new open source security mobilization plan that is going into action now we are developing a tier plan for that there is both the printed document available from openssf.org and the running document that's openly available for you to look at the plans and the tasks that we are actively deciding on right now is at this github so the open SSF has a large plan of 10 tasks that they're taking on I'm specifically taking on one element of the education and that is the rewards and incentives for maintainers and developers because one of my agendas in life is making sure that we all say thank you to the maintainers who keep this all running I have been a maintainer myself and thank yous mean a lot it's a very hard job if that really interests you get engaged the more people that get involved the better the ideas will be so definitely look into it now that's the end on openssf for now here's what we're going to talk about today so I'm going to describe exactly what this problem is exactly what it scoped to it's a very specific problem I think we can solve I'm going to tell you why I care so much I'm going to talk about who helped me to figure this out and who's worked on this problem before the benefits to open source security what happens if we fix this and how we can start to think this way to build a better digital future in open source so what exactly is the problem that we're trying to fix here so I think that open source is a deeply human activity it can be it should be it must always be it's the community effort and the trust between individuals that allows open source to thrive I do not want to get rid of that but I do want to get rid of is unnecessary contextual information at one very specific decision point when you are first reviewing a pull request I do not believe that there should be any contextual information about the individual their image their handle their email it should only be the code that you are looking at in that initial review there's a couple of good reasons that we will discuss for this at the moment in which you decide to accept or reject a pull request all of that metadata should immediately become available you should know that individual is you should be able to interact with them normally but in that very specific moment we should be removing that data and there's great statistics to demonstrate why I want to do this because open source is deeply human but we have not succeeded in fully opening it up to all of humanity and that means we do not have statistically speaking the best decision makers in the room and when I really like to motivate people around this problem I'd like to remind you that if we only have six percent female representation in open source that means that we are missing 44 percent of the global population what is the chance that the best contributor to your open source project can't contribute today doesn't feel open to be able to contribute today possibly doesn't know that your project exists we really need to be able to engage with those populations and make sure that they are welcome in a way that they can verify and a little bit more about what I'm talking about today my background is in neuroscience so I love to look at the ways that our brain understands information what color is this can yell it out does anyone see this can as anything other than red well we have no red blindness in the room uh that can is not red there are no red pixels in that image this is called cortical fill in it is a higher level of processing in your brain that is filling in the contextual information because it has seen this object so many times that it is giving you the assumption the visual illusion that that is red I can send you this image file to zoom in if you would like to and this is to say that you are biased and even when I explained to you that that coke bottle was not red did you cease to see it as red no so when we're trying to take approaches to diversity and inclusion when we're trying to educate individuals about how to be more open and inclusive that's excellent that may work on some level but humans are very good at categorizing and prejudicing information which has been successful useful to them in the past that's excellent it makes us really good computers but it makes us really bad at new contextual environments and as we begin to better globalize open source we need to make sure that those biases that we bring with us from our lived experiences do not make their way into the decision making which should be exclusively about the quality of the code contributions that we take in I'd like to talk about what I'm not talking about today bigotry we can't solve it it's a pretty natural part of the way that human cognition exists there's positive and negative bias all over the place we're not going to be able to remove that from the human brain but we can contextualize the human brain in an environment which makes their decision making easier going to be talking a lot about cognition today about the way that the human brain takes mental actions or processes to acquire knowledge and understanding through thought experience and the senses not sure why that's coming up but what this said we're some sad statistics so some sad statistics about gender diversity when I got interested in this concept honestly I'll tell you the real reason why I was interested in it because last open source I was sitting in this diversity chat I was getting a lot of those you know let's hire more women let's train more women let's improve mentorship but no one was talking about how we systematically engineer for it and then I went to karaoke afterwards and talked to some of the individuals from the new stack and they said they were really interested in taking on this research so we started diving into it and this study really stood out to me this is a peer reviewed academic study from 2017 which reviewed real github pull requests from men and women and what's fascinating here is that when they removed the contextual information we're speaking to the image and the handle and the email of that contributor women's prs were accepted at a rate 4 percent higher than the average of men's when there's no identifying information now when you take those exact same pull request exact same code and you put that metadata back into it the decision making around that then inversed that female contributions dropped by 9 percent below the acceptance rates of men that is a 15 percent gap a 15 percent gap in the best contributions getting accepted in this case as we measure that by merge ability there's a couple of reasons for this women contributing to open source do tend to contribute at a later stage in their careers when they tend to have production experience so they are tending to provide statistically higher quality code just for that reason I found this incredibly fascinating there has not been another study looking at what it looks like to do this systematically reviewed we need to see this and we need to see this for other demographics I also think that we need to be considering masking metadata for what is really close to my heart a cyber security issue there is absolutely an increasing supply chain attack on open source projects where they are attempting hackers will take on the identity of a known contributor they will submit code as the identity of that known contributor because they are aware of the halo effect that code will not be as highly scrutinized as an unknown contributor they'll often provide a decent amount of code that looks like a typical commit and they'll add a little backdoor into it it merges just fine but if you weren't looking for that four lines of code that's your problem that's in your code now so if we remove just about one moment in time review the code on the quality of the code alone we can make sure that we can reduce some of the effects that are going both positively and negatively in this space there's other solutions around cyber security that I'll talk about later we should put in verification into all of these massive ecosystems but we need to put a band-aid on this bleeding wound right now we need to remove the ability for these hackers to use these social vectors with ease and so if we allow this to go into place this possibly will have a positive effect that we could track and I'd like to talk a little bit about why I care so much one a little bit because of where I came from so I am Native American I come from a very underrepresented background in technology only about 23 to 25 percent of us ever go to college and then almost half of us don't get out of college in order for me to get through college I had to get funding through all of these institutions meaning I was spending most of my time in college studying and writing grants so that I could continue studying it's a very different experience I eventually got into a PhD program through national institutes of health where I was studying a metacognition which is quantitatively understanding how excellent performers high performers were talking like Olympic performers how they know when they've gotten something right how do they make good decisions really really really fascinating space when I jumped out of that I had a lot of supercomputing experience I was pretty good at scalable kubernetes I then went to go work for missile defense agency castle run on really mission critical projects where we could not have failure so it was very important in those spaces to have very hygienic decision making across those developing that code I then made a permanent move to the UK and started working fully in open source I contributed and maintained the chaos toolkit and also reliably that's where I got my original maintainer experience and why I insist that we say thank you more often I now work at sonotype because I've become really interested in these supply chain attacks and seeing what we can do to engineer that experience to remove those at sonotype we take a really interesting approach to cybersecurity you cannot be doing this at the enterprise level for these supply chain attacks we instead were the original developers of maven central which is the largest repository for open open source projects in java we put an immune system around that we remove the vulnerable packages there so they never go downstream that is a really powerful approach and it's why I work with them now we're also doing the same for python which I'm very excited about because python's also a little bit on fire so you're welcome but before I really moved into tech these are the things that I was thinking about right I really care about how humans make important critical decisions in moments particularly under stress so one of the first jobs that I had was watching pilots uh in Boeing 737 aircrafts seeing how they interacted with the cockpits and with each other and finding what the missed information in those dashboard was that we would need to engineer into the new cockpits and I put this up because this is a fondant version that is incredibly accurate that took us like three days to make and I'm just really proud of it still so you all get to look at that now the metacognition that I was mentioning before this is a really interesting space where we are finding out that brains actually work differently when they are in a highly contextualized environment right this can be prs for example so if you have reviewed a thousand prs you're probably better at it than someone who's done it the first time you're probably relying on some implicit contextual information that you're not highly aware of this is all being done at the level of subconscious I'm not going to talk much more about my phd today but my advisor just put out this book and it's very very good and the last chapter of it goes into the implications for artificial intelligence and I just think everyone should read it it's very good I'm very biased very biased towards it but read it it's a good book and I also care about this because I accidentally solved this problem before so I'm not standing up here saying hey maybe this will work I've proved it once um and I proved it to myself by accident so here's what we did right so I was working on real-time uh fmri uh which means that my machine has to be working in order for me to get real-time data and one time my machine broke and it took three months to fix it and so I was like oh what am I gonna do I was like I have no data to put on my beautiful supercomputer I am so bored and this is what I thought about I had three months without my beautiful machine which I do miss I was in a field with 18 percent female representation and there were zero online training courses for big brain data and I was really interested in starting to get involved in the cloud so I was like all right fine I'll do it so I created a thing called the online brain intensive it was called intensive because I knew I had a limited amount of time so I scrunched all of this into a like two month period and made everyone do this at a really accelerated rate everyone complained but it was all volunteer welcome to open source um we were preparing them for a brain hack which is a really cool part of computational neuroscience we contribute a lot like primarily to python my first contributions to python was rewriting a bunch of r into python how boring but it needed done um and we got aws educate to fund it on the condition that those who completed the course would receive computing credits to be able to test their hypotheses now when I ran this the first year um this is what we got there were almost a thousand participants 748 from 42 countries at the end of it they consolidated into 12 teams uh and produced seven peer reviewed papers from this and in a field with 18 female representation my course had 56 percent um I found that was interesting but you know it didn't really highlight any curiosity in me at the time I was also interested with my signal processing background in seeing what we could do to make sure that the exchange of information between individuals was structurally as equal as possible so when we designed the slack I said for each of these experts even you have to be anonymous you can go in there have conversations with individuals but because no one knew who the expert was in the digital room those conversations were much more cordial they were much more interesting we then organized by topic we prioritized by value and I also wanted this to be as accessible as possible so I made sure that all of these educational resources were asynchronous um and they provided a video and a repository for individuals to use this is just to say like how cool this was like we got all the way up to like generative intelligence on SageMaker it was awesome um then we put the teams together but this is what really interested me every week I asked the experts who on this anonymous platform had the most interesting unanswered hypothesis right that's what we were trying to engineer to find and over 80 of the time they identified a female participant they did not know this I was the only one who did when I investigated this further at first I was like oh no what's happened have I inverted my statistics it should be 20% or 50% in this case but it was 80% because I had accidentally created as a PhD student a resource that individuals who were much more senior in their career either in their first or second postdoc were utilizing specifically stating they needed to use my pathway to AWS computing credits to be able to have equal access to computing to test ideas they were not receiving equal access to funding equal access to computing and when we made it equitable they were able to get this work done and move their careers so what's interesting here is like it wasn't gender equity that we got we were signal processing out in the way that we were traditionally doing computational neuroscience we were signal processing out the best ideas and when we removed those social biases we were able to surface them immediately so who helped me figure this out Jenny Riggins the best took this on was super curious about this and has been helping me reach out get interviews find contextual information we've interviewed people who have created versions of this already so Emma Humphries created zombie which is a monzilla integration Brian Allure created a chrome extension this is a problem that's been around people have known about it they've tried to solve for it and the new stack has been putting out a series of articles if you want to read into a little bit of depth on this we cover this on the new stack on can this this boost security and diversity there's one coming out just today all about that zombie and first attempt their motivation some of their findings from using it internally definitely go check that out in a couple of weeks we're gonna put our final part of this article out and I am putting a call out to anyone if you have any interesting data around this already any ideas let us know let's talk to you and let's get the word out so what are the benefits to open source security so the problem that we're trying to solve here is called malicious code injection it is when someone takes on the identity of that known contributor or a known maintainer it's incredibly hard to track if you're not using something like PGP or although even then you might have some issues now this is a good approach but again it's like a band-aid on like a big big wound I still want you on these major projects I still want you to actually be using best security practices definitely use Sigstore although Sigstore still goes to your email which means it's basically the same security surface area that we already have a problem with in GitHub although safe and signed get commits with PGP much better idea there was just a talk on this today if you missed it look at that name make sure you get that recording instantiate that on your open source projects my god please do our supply chain attacks are increasing but my problem here is people tell me that there's these solutions like I don't work in cyber security I'm aware but these aren't instantiated on some of the biggest open source projects we have we need them there but it also does not solve the Coca-Cola problem it does not solve the fact that maintainers are sometimes seeing red when there's no red to be seen in the pull request so if we want to solve this problem now we need to create this layer of social engineering on top of open source so that code quality is the focus now what happens if we fix this I want us to quantitatively test what it looks like to design these systems well to design systems I mean we all use GitHub and GitHub was designed to process code I mean 85 percent of us do but GitHub was not designed thinking about human cognition and as these systems get bigger we need to be much better in designing these if we can instantiate this on some major projects we can do a pre and post evaluation and I would like to see this amongst several other metrics will we see a improvement in the quality of community engagement of geographical diversity gender diversity and in the quality of code and merge ability I suspect that we will it also demonstrates in a qualitative level to your community that you are serious about inclusive community engagement right I also just got the statistic from this conference so I don't know exactly where to find it from but I'll find it but I think this is fascinating if we make these spaces genuinely more inclusive it means that more people engage we get only 10 percent increase in the contributors to open source we could see a 95 billion dollar well billion euro increase per year these kind of ways of removing essentially we're still just removing individuals from being able to participate in something even when they have great ideas remove that barrier people are going to get more engaged and stay more engaged in your communities so how to build this better digital future number one we need to engineer this this needs to be integrated directly into the github platform and the other platforms as well it's way too hard to maintain these external integrations and I say that having spoken with the maintainers of those integrations number two we need to implement these maintainers and community managers should take this on as a code quality best practice linux foundation is already interested in possibly instantiating this as a best practice that can be badged number three we need to reward so we need to make sure that we find the communities where they're taking social engineering of open source seriously where they're taking quantitative metrics seriously and rewarding them for being genuinely inclusive environments now I reached out to github several times and I'm not here to like yell at github but it took a change.org petition for us to be able to vote on issues do you remember this like eight years ago or like they just wouldn't do anything until we did change.org petition and maybe this is what we need to do to get the interface change but like is anyone disappointed that we can actually vote for issues like that needed to happen so we might take this approach to it but what I really need is maintainers I need maintainers to take this seriously to want to test this to see if this makes a difference on large open source projects and again this solution isn't intended for the six person project or even the 15 contributor project this is really for projects like Prometheus when you have so many contributors that you have to right now lean a little bit on your social biases while taking these ingestions in those are the spaces where this is going to have the best impact and this is not exclusive to automation let's put more automation in let's put these security best practices in but let's remove the coca-cola problem at the same time so this is what I need I've got two calls to action I really need your help I need your help in petitioning github to consider putting this into their platform so that maintainers have a button to push in order to put this best practice into place number two we need to revitalize the efforts on the extensions that already exist this is open source I hate seeing the creation of wheels when wheels already exist let's just blow up the tires on the wheels that exist so if you're interested these two repositories are the best ones these maintainers are interested in having more engagement and we can put these back into place and sort of extend their utility if you are interested in doing this I would love to talk to you either now I've made sure that we've got plenty of time to chat but reach out for me for questions for comments again I would like to not be the smartest person in the room it is very likely that someone else here has a better idea than I do and I want to hear it and that's my talk thank you so with that do we have questions comments concerns that comes from a slide from another talk at OSS I just thought it was a great statistic I don't even know where to cite it but I will find it and I can send that to you so the belief here is that right when we're working in open source let's just strip it down to what it is it's intellectual property it's the creation of new utilities right so when you are developing new utilities in this space it implicitly makes it easier for economies to produce new products so when we see these advancements these small advancements in open source projects and their interoperability as ecosystems that allows for new things to be developed new products new industries new sectors that's really what that is speaking to and my argument here is that we're not going to see that kind of increase if we continue to do diversity in the way that we're doing it we need to systematically rethink the way that we've structured engagement between individuals digitally yeah do you think that that like if they would be like all of the pull requests will be anonymized do you think that it would affect like I'm pretty I would guess that some people like like to do contributions because they get get internet points like they get some stars or some some extra credit from like like you mentioned social credit for doing that do you think well of course that's we can discuss if that's a good thing that people do it for that only for the social cloud but but do you think it would affect like the amount of pull requests that people would submit or like they like do you think some people wouldn't want to contribute because they would know that it ends up as anonymous contribution yeah so this is scoped to a very specific aspect of this I think that the moment you decide to accept or reject a pull request all that data becomes identified you should know exactly who you're speaking to you should continue that at a personal level yeah open source doesn't work unless we give each other internet high fives like we all know that yeah I'm not arguing for any kind of anonymization anywhere else immediately begin to have a conversation on an individual level as soon as you are ready to have that conversation but the statistics are showing us right now that that metadata is biasing that decision making away from the highest quality of code which I find really really concerning so no do not make the internet anonymous just remove the data in this literal one snapshot of a decision nowhere else so I am excuse me I apologize here this may be a little slightly outside of the context of the the current talk but you mentioned you know 86 percent of us use github obviously github is a lot but it's not everything there are other uh fewer projects but significant projects like the linux kernel that use other development mechanisms that are a lot harder to be anonymized like mailing lists do you have any ideas or similar concepts we could apply to fixing the problem in that space sen you know how I feel about mailing list um so as a signal processing engineer I think that mailing lists are incredibly inefficient there's a lot of information loss and handling intellectual property as a chain instead of a centralized source there's also benefits to that right that decentralization is part of the argument for it in those spaces yes possibly I well it's a little bit different I mean we've got to speak to the kind of communities and the kind of communication that's typical in them so when I'm really premising this problem on are very large distributed open source projects primarily cncf I'm very biased um but when we're speaking to the mailing list for the kernel or you know the mailing list for a lot of the apache software projects for example those tend to be smaller clusters of highly embedded highly engaged individuals so anonymization may not even be as effective in those spaces right this is not going to be very effective if you know the digital handwriting of the 17 contributors that are always contributing to your project um but that is something that I should think a little bit more about I imagine it could be right when you instantiate the first issue or discussion around production um possibly anonymizing that is valuable but then immediately you're going to be in like a long email thread and then inevitably someone's going to lose that email thread and it's just a very inefficient way to work thank you for the talk um I have a data question you had mentioned that you to anonymize it would be email handle and the picture um does it have to be all three could it be um just the picture because if you pull the data from um profiles that maybe have a default photo or an avatar could that potentially tell you something about um the current state without the other two yeah I mean in my dream of dreams when we're doing this metadata masking we replace those with like nonsense uh like titles like giraffe emperor and like a fake cartoon and just like make it fun to engage with um but I think that's what we started out with like avatars yes right yeah exactly um so it's interesting I've uh for years I uh put out a call about two or three times a year uh to take on two or three early career data scientists this is meaning they are not yet they are still in school um and some of them haven't even started up their githubs yet and when I speak to them I do have to encourage them to like if you have a strongly femme identifying first name just include your last name do not put a picture of your face in there because you are statistically significantly less likely to have your first pull request accepted I really do not want to have to keep making that recommendation so um yeah I mean make your images whatever you'd like to but right now if you have a female identified face you're going to have a 15% less likely chance of getting your pr accepted than a male identifying face and it's for no reason I was also thinking about if the github is enough to make the change because I also participating in the links corner using mailing list and from the sense I was thinking that if would that make sense to make the only most change in not the github but in gith itself what do you think yeah that's an interesting idea that's a pretty good idea um so still being able to track the chain of intellectual property but removing all of the metadata at all um that's a really interesting and like a terrible way to put it I was gonna be like hard typed anonymization um but uh yeah I think that might be a reasonable approach but I mean at the end of the day git is just a versioning technology um the reason why we've built these platforms on top is to embed the social element right I don't want to remove the valuable conversations I do not want to remove the Socratic method that follows for days and weeks after you submit a questionable PR um I think that that's super valuable it's why I participate in open source like a few years ago I was working on these like massive distributed Boolean networks and I had like contacted this maintainer like five years before and had some like weird weird questions about the way it was working um and then I contacted him like half a decade later and I was like hey I'm working on this and it needs a hundred nodes is that even like possible and he just sends me a thing and says oh no it's you again I was like yes and he's like good to see you hope you're doing well uh yes technically but give me two weeks um so yeah that's what makes open source fun to me I don't want to remove that and if we remove that people won't I wouldn't participate in that um and that's why the second that you decide to ingest a snippet of code you go right back to it being humanity based that's really what I want to see and experience and continue to have being there verifying that trust um but also verifying it with PGPs please um but yeah I I don't think I would argue for that level of anonymization um although it's conceivable and testable right it would be interesting to see what that does to a contributing community we're all good are we all going to go do this on our open source projects now yeah cool we'll get in touch with me let's see if we have better ideas than me um and let's see what we can get done