 Hi, my name is Colin Eberhardt and I'm going to talk to you today about the software crisis or what the financial crash can teach us about open source. Apologies for the slightly less than cheery title. By way of introduction, as I mentioned, my name is Colin Eberhardt and I'm the Technology Director of a company called Scott Logic. We're a growing UK based software consultancy that works predominantly within the financial services community and I've been working there for many, many years. And some of the experiences that I've had from working with our financial services, our banking clients have very much reflected in this talk. But also to tell you a little bit about myself, this talk is very much an intersection of my day job and my personal passion. Open source is something that I'm heavily involved with. I've been working in open source for more than 25 years now. I think my first project or my first open source project was one written in PHP about 20 years ago and before that I used to write a lot of freeware and shareware software. I also really enjoy data mining and exploring data sets and as you'll see in this talk some of these interests are certainly coming together here. So back to the topic of the talk itself, what the financial crash can teach us about open source. What I want to do is tell you a bit of a story here. For a while I've been getting a little bit concerned about the overall complexity of our open source software. And it's not just open source software that's getting more complex. Software architectures in general seem to be getting more complex. We're gravitating more towards microservice architectures where we take a monolith and we pull it apart into numerous components. Also cloud architectures tend to be really quite complex involving a great many different components and generally speaking this is a good thing. However complex architectures, complex software can be hard to understand. It can be difficult to use safely. Before I use a particular open source project or a library I often ask myself who writes this code? Who maintains it? Is this a sustainable project? And ultimately am I going to regret using it? Is it going to start to fall apart underneath me? And that does happen all too often. So that's the complexity and also the fragility. Complex structures have weaknesses. They fall apart. Security and maintenance problems are prevalent within open source. And finally sustainability. And I'm not just thinking about the sustainability of my own projects that use open source. I'm thinking about the sustainability of the ecosystem as a whole. How do we create a sustainable open source economy? So I decided to do a bit of an exploration myself. I wanted to learn more about this problem. And the best way to learn about it is to start exploring the data and exploring the community. So I decided to do a bit of a deep dive. And the project that I picked is Express. And the reason I picked it is because it's a project that I've used a number of times myself. If you're not familiar with it Express is basically the de facto web server for Node. If you want to create an HTTP server with Node, you'll almost certainly end up using Express. It's clearly an important project. On GitHub it has 50,000 stars. It's downloaded 14 million times every single week. It's also 10 years old. So quite a mature project. So my initial impressions as I started this journey, it's fair to say I was clearly happy at the beginning. I'm a repeat user of Express. And the reason for that is I'm happy with the functions and the features that it provides for me. So what I wanted to do was start learning how is Express constructed? What are the component parts of Express? So I started to look at its overall composition. I started to look at its dependencies. So Express is composed of 49 separate modules. When you install Express, the top level module is installed. And this is accompanied with effectively a bill of materials. And another 13 modules are installed. And then further transitive dependencies are installed. And the total composition of Express is 49 separate modules. This is really quite typical. And it's not specific to the JavaScript ecosystem. We see similar things happening in Rust, Python, and other sort of modern software tool chains. So I was also interested to see how this complexity evolves over time. So I downloaded all of the 163 releases of Express over the past 10 years. And as you can see, the complexity of Express in terms of the number of dependencies that it has, the number of modules that form its total composition is growing over time. If we look back at version two, which is almost 10 years ago, it was composed of four or five modules. Version three, more like 10. And then the number of modules grew and grew and grew over time. Interestingly, at version four, clearly there was some sort of consolidation effort. The number of modules drastically reduced. And then it started to increase once again. And this is reflected throughout the open source community. From my personal experience, our open source software is becoming more modular. It's more sort of composed of disparate parts. So this is clearly the way things work at the moment. But I must admit, I'm starting to feel a little uneasy. If I want to gain an understanding of the maturity or the sustainability of Express, I'm not checking just one project anymore. I'm checking 49. Now, if I want to do something pretty simple, say for example, I need to check that the licenses for each of these modules comply with my own policies. That's relatively straightforward to check. I can automate that. And I must admit, I've used automated license checkers on a number of projects. And it's quite surprising how often you'll find one that perhaps doesn't have a license at all or has a particularly problematic license or is public domain. However, license checking is relatively easy. The answer is somewhat binary. And yes, you can use this. No, you cannot. Measuring quality is a lot harder and is very challenging. I couldn't conceivably check the overall quality of all of these 49 component parts. However, this is effectively the tip of the iceberg. Whilst Express has 49 dependencies, it actually has 195 dependencies in total. The rest form what are known as the development dependencies. These are the ones that you install if you want to work on Express as the product itself. If you want to alter or change Express, you need to download 195 dependencies. And for other projects I've worked on, you'll find more than a thousand development dependencies. So why should I care? Is this a problem? Well, in some senses, the tools you use to create a framework or a library like Express are very much reflective of the quality. You need to pick quality tools to create a quality product. And it can be challenging with so many different dependencies. But also, probably a bit more worrying is the scope for what are known as software supply chain attacks. Express being an HTTP server is something that if I were a malicious individual, it would be an interesting project for me to target. Now, I could go after Express directly or I could go after one of its 49 dependencies. So that already gives me a number of different places that I could attack. But something which is becoming more prevalent at the moment is supply chain attacks. So these are attacks that insert themselves earlier into the overall software development lifecycle. If I were able to attack and deploy a vulnerability into one of the development dependencies, I could insert a vulnerability or malicious code into Express at compile time. And this, as I said, is a problem that's occurring more and more frequently. So yeah, this complexity is starting to make me a little bit nervous. But let's go a little bit further down the rabbit hole. If I install Express, what code is actually being downloaded? And if you install it, do you get the same code? To understand that a little better, we need to understand how these modules are downloaded and resolved. And an important concept here is a thing called semantic versioning. When you first download Express, it comes with effectively a bill of materials. It says, I depend on these other modules. And the way that it declares these dependencies is through a thing called semantic versioning. It's a concept that was proposed about 10 years by one of the GitHub co-founders. And semantic versioning has a provider of formality to how versions are expressed. Major version increments indicate backwards incompatible changes. Minor bit version increments indicate the presence of new features. And patch versions are incremented when you fix bugs. So this is a way of having a formality regarding how you version your software. And as our software composition becomes more and more complicated, a concept like semantic versioning become really quite important. However, I've got some of my own personal concerns about semantic versioning, which I shared in a blog post a little while back, but that's not for now. However, software products like Express rarely depend on specific or explicit version numbers. Instead, they tend to permit version ranges. So I might declare that I want to use this particular library or tool, but rather than saying I want to depend on version 1.2.3, I'll use this semantic range version, this carry 1.2.3, which is equivalent to 1.2.3 and above, but no higher than 1.3. So what that means is the author in this case is explicitly saying, I don't mind if this upstream dependency adds new features, I'm happy to bring those into my builds and I'm happy to bring in bug fixes. However, I don't want breaking changes. I don't want major version increments and this is incredibly common. I must admit, I'm not entirely sure why you would want to permit new features because software being software, a new feature in a dependency doesn't magically result in a new feature in your own product. Typically, you have to change your code to accommodate that new feature. This use of version ranges is really quite prevalent at the moment. So I did a bit of analysis. Looking at Express, I looked at a couple of version numbers of Express itself, version 4.16.4 to version 4.17, the next release, and there was a seven month period between them and I discovered that there were 33 different configurations of Express itself over this seven month period. So whilst Express had only moved forward one version increment, there were 33 different versions due to the semantic versioning of its dependencies and their dependencies. So basically my Express version 4.17 might not necessarily be the same as yours because of the sort of slack or loose nature in this versioning. Yeah, I must admit I'm going to get a bit scared now. So we have complex dependency graphs that are ever changing. Also, another concern is with this ever changing mix of code who holds the keys? Who is it that is allowed to publish these modules to public repositories that are then downloaded onto my machine according to this sort of recipe? So again I did a little bit of digging with an Express and I found that of the 49 dependencies there were a total of 88 maintainers. So what this means is there are 88 different individuals who can create releases which will ultimately affect the software that I install when I download Express. Also, you'll see a funny little colouring scheme here. No Bob and includes Bob. Now what I mean by this is Express has a single core maintainer and rather than calling out explicitly I'm just going to call in Bob for the sake of argument. And what this shows is that for a significant number of the dependencies of Express Bob, the maintainer of Express is also a maintainer. So what this means is Express is complicated. It's composed of 49 different modules. However, the maintainer of Express also has an element of control over a number of these dependencies which I'd say is a good thing. However there are some issues here. There was a survey done where the results were published a little while back and only 9% of MPM maintainers enabled two-factor authentication. So what this means is for 91% all you need is their username and their password and you can create a new release which is a bit worrying. Also as a result of this analysis I have the email addresses of all of these 88 maintainers and I took the first one and I typed it into Troy Hunt's very well known Have I Been Pwned which is Hackerspeak for Have I Been Hacked. I typed it into his well known website and found that our email address had been subject to found in a great number of different vulnerabilities and data breaches. So the very first maintainer email I picked I could find that that email address and an associated password have been leaked as a result of a LinkedIn data breach a couple of years ago and we all know that most people are not that good at creating new and unique passwords and considering that only 9% of these people are potentially using two-factor authentication it probably wouldn't take me to take too long for me to find a reused password amongst these 88 maintainers and this is without even looking at the development dependencies which are four times as numerous. Yeah this is getting a bit scary isn't it? I think we've gone far enough down this specific rabbit hole we've learned about the software bill of materials the process that determines the code we download and execute is really quite complex and we've looked to this complexity and I guess by virtue of that there's an element of fragility and experience has shown yes it is fragile it does fall apart you've probably heard of the left pad incident or the event stream incident I just thought I'd look at the most recent incident I can remember and here's one of them there's a well-known package called is promise and the author made a tiny little change an honest change this wasn't a vulnerability but there was a small error in that that change and that error broke a huge number of other packages or modules it broke fire-based tooling Angular, AWS Create React app possibly many many more but this one change was felt widely now in the article that reported this they made a very good point the bug didn't crash existing projects which is a good thing just because a new module is released doesn't mean everything will stop working immediately so there was no actual downtime but it did prevent developers from compiling new versions of their projects which is actually a bigger issue than you might expect it just prevents developers from compiling new versions it also prevents continuous integration so CICD pipelines from compiling the project so what this means is your whole software delivery sort of life cycle can come crashing down as a result of this tiny little error and this tiny little change so whilst this article was to a certain extent downplaying the issue I do think it is a significant issue coffee time so as I said I think we've gone far enough down this particular rabbit hole it's time to come up from there and look at something else so what about funding Express is a valuable project it's used by a great many people and a great many others how is it funded, how are the people that work on it rewarded many high profile projects are backed by large companies, corporations TensorFlow for example is a Google project Electron is a GitHub project React is a Facebook project there are numerous examples so what about Express nothing if you look at Express there's no obvious funding model and if you look at the commits and the contributions it's clear that it's pretty much maintained by a single individual and as you know beforehand I just called them Bob for the sake of argument I looked deeper into the dependency graph of the runtime that the 49 dependencies and the development dependencies out of all of those dependencies I could only find a single project that had any obvious form of funding and that's a project called ESLint ESLint is part of the development tool chain of Express it's a tool that helps provide consistent code formatting again it's a very popular and very useful tool. Now ESLint uses a project or website called Open Collective which is probably the most popular online donation platform for open source I've used it myself it's a decent website I quite like what they're doing I managed to raise about $50 a month for some of my open source projects which just about pays my AWS bills ESLint is the fourth most funded project with an Open Collective so it's quite successful so does it work? Well again I dug into the data I looked at the 30 most funded projects with an Open Collective and for each of them I turned their annual budget into effectively a full-time employee equivalent you know if I was using that budget to pay my bills and support me or others as a developer how many people would it buy and as you can see from the graph on the right which is on a log scale you can see that ESLint the funding for ESLint pays for maybe one and a half full-time equivalents and ESLint is one of the most well-funded projects on the platform and the most funded project can pay for approximately six full-time equivalents now there are more than 2,000 projects an Open Collective and there's a long tail who have you know similar experiences to myself and I'm not I'm not complaining I'm not attempting personally to use this to fund my work it was more as an experiment however there are a great many projects on there as a way to fund their work and unfortunately for the long tail which is the vast majority of them they get a little more than enough money to buy the odd cup of coffee so really this system isn't working when it comes to making a sustainable ecosystem and as a result people are trying other models some open-source projects collect tips or try to use adverts and again advertising is something that's been tried and has caused quite a stir however funding isn't the be-all and end-all to open-source sustainability there's a much more interesting and tricky and challenging side it's a more human side of open-source and I'm just going to give one little example again on express as a project and I've again I really don't want to point the finger too much to the maintainer who thanked out his name even though I guess it's somewhat futile you could find his name this was an incident where someone raised a security issue against express and there were to cut a long story short there were some pretty big differences of opinion around how it should be resolved and there was also a certain amount of security theater from vendors that there were some fairly aggressive flags effectively placed and black marks sort of placed against express and there was a significant amount of hostility as a result the maintainer on the screenshot of the right basically publicly said look I've had enough I'm sick of the abuse I'm not going to put up with this anymore I'm turning everything off for the weekend I'm deleting my emails taking a deep breath and coming back this is a real issue and yeah this really is quite worrying and I honestly didn't expect to find this much that concerned me and to a certain extent upset me when I started the journey express really was something I picked out at random I wasn't expecting to find such worrying complexity I wasn't expecting to find such fragility and most worryingly I wasn't expecting to see some evidence of a maintainer really struggling project so I guess the only conclusion I can really come to is that the only reason this all works is because the vast majority of people are good it astounds me that this actually works in practice and I think the only reason it does is because most people are are good people most people are good actors but we don't make it easy for them so this this conference where this is the open source strategy forum most of the people listening are not not full-time open source maintainers most of you are from banks so what part do you play well before I get onto that I'll show you the part that we currently play I'm lucky enough to live on the border of Northumberland which is a beautiful part of the country we've got lots of fantastic castles there and this is Annick Castle you've watched Harry Potter you'll have seen part of Annick Castle it was used for Hogwarts now these structures date back to the sort of medieval times and in the medieval times you'd have the poor people the poor working outside farming the land and you'd have the rich folk sitting within the castle and the produce from farming the land would be taken into the castle and you'd keep the riffraff out and you'd try to keep a clean sanitary environment and let the riffraff you know keep themselves themselves outside of your castle and I think I know you might think it's a little bit damning but to be honest I think our relationship with open source is quite similar what we tend to do to tackle these problems is a process of sort of sanitizing and sterilizing we use security scans license checking once we've cleaned and cleaned and cleaned we then take that code and we place it in our internal repositories where it is now safe it is again it's kind of the medieval model it's the Wild West in open source and the best we can do is pluck a code base out sanitize it check it to death and then say yep it's safe we'll use that interesting quote I read from someone an open source maintainer who was also struggling he was at an open source software and bumped into one of the many security scanning vendors and he realized so this means that they charge a 50 person start up a whopping $30,000 a year to help them feel safe using the code that open source authors like me have given away for free it's a crazy situation also some of these scans are now pushed out to the wider community they're not just something that happened within the four walls of some sort of big corporation Dependabot is something which is turned on by default within github what Dependabot does is it goes around hunting for people that depend on modules that have got known vulnerabilities and then it helpfully creates a pull request when it spots one of these vulnerable packages potentially being used and as you can see here it's you can see the semantic version range issue going on it will let you know that you have to bump it from one version to another however there are some big issues here firstly I get lots of pull requests from Dependabot every week but the bigger issue is all it can ever do is demonstrate the potential for a vulnerability now for a vulnerability to manifest itself you have to use a particular module in a specific way if I have a vulnerability in a templating engine but I'm only ever using it in my deployment pipeline within a containerized build that's not the same as if I'm using it on the front end of an application used by my end users it's completely different and unfortunately in my experience of Dependabot not once has it raised a single genuine security vulnerability all it's ever done is to be honest frustrate me and again another quote from that article I briefly mentioned if it's not fun anymore you get literally nothing from maintaining a popular package and this is a growing concern within the open source community that the day to day struggles tend to get a bit too much for some people so what is the solution I think probably the most important thing to do is gain a better understanding of the problem and I must admit I gained a better understanding of the problem through my investigations into Express and writing this talk previously I was of the opinion that funding was the answer and I'd explored with Finos different ways that we can accelerate funding but it's not the answer you have to understand the ecosystem you have to understand the actors and their motivations and one thing you also have to understand is the open source community itself has changed considerably in the past five years and one of the main reasons why it's changed is Github Github has created a centralized community and actually if you really want to get into the details of this I would thoroughly recommend this book which is screen-shotted here it's Working in Public the Making and Maintaining of Open Source Software by Nadia it's an amazing book and I'd recommend that everyone read it what she points out is that Github has created a centralized community in that way it's much more like YouTube for example she also described that the stadium model of open source is becoming increasingly prevalent and to give you a simple explanation there open source projects are more and more often the work of an individual and the stadium model is effectively you have one individual up on stage tens of thousands who are effectively in the audience and whilst Github makes it easier to contribute as a result relationships tend to be more transient you get people having kind of fleeting relationships with projects rather than sort of long-term sort of collaboration and association and probably the most important thing I learned from this book is that attention is the most prized asset of an open source maintainer they have a finite amount of time that they can work on the project and they will seek to optimize that time or will at least wish to optimize that time and gear it towards the things that they fundamentally enjoy doing the most and I've got a great example of what happens when you don't understand this ecosystem so every year DigitalOcean has run a thing called Hacktoberfest a very well-meaning concept every year they incentivize people to contribute back to open source and they incentivize them by giving away t-shirts which sounds great on the surface it sounds really good however this year it fell apart completely and these are these are concerns that have been building for a long time and this article briefly highlights why the author says so far today on a single repository myself and fellow maintainers have closed 11 spam pull requests each have generated notifications to 485 watchers of the repository and each requires time to visit the pull request page evaluate, close, tag it effectively their attention is being consumed by this activity in a highly negative fashion they basically call Hacktoberfest a distributed denial of service attack on the open source maintainer community I really don't want to be too harsh on DigitalOcean they were fundamentally motivated by the right things but clearly they misunderstood the community and how it worked and the whole thing fell apart so what should we do it is challenging firstly don't focus on your own walled garden one thing I'll point out here is I haven't said don't create your own walled garden there is still a need to keep your own and maintain your own copies of open source code for various different reasons but don't focus on that entirely also don't focus your time on sanitizing and securing again I'm not saying don't do it but don't focus on these two activities exclusively without considering the wider impact on the community do learn about and better understand the open source ecosystem read that book that I suggested learn about the maintainers learn about the people behind the code that you're consuming learn how to effectively contribute and again Hacktoberfest is a great example of what happens if you don't contribute effectively there is a considerable amount of negative contribution going on in open source at the moment and the popularity of use of github unfortunately has increased the noise and again github is a fantastic platform and they are trying to tackle this but it's a real issue finally help the maintainers maximize their attention and some great ways you can do this is not by opening up a big pull request with lots and lots of code you can help them maximize their attention by taking away some of the activities that are less desirable to them answer questions on stack overflow you can create examples or help with documentation you can triage issues you can fix some of the gnarly bugs that might have occurred all of these things are significant valuable contributions that allow maintainers to focus on the thing that they want to focus on a lot of these things are better than just adding a $50 tip every month to some open source collection finally allocate time and budget for this and the money that you're spending on some of these tools for creating your sanitized wall garden perhaps you should reinvest some of these in actually helping the community directly and again allocate time and budget it could be budget or it could be time you could just allocate a small fraction of time if for example you're using express as an example of one of your projects built in something into the backlog let's spend a little bit of time helping fix some of these issues rather than creating our own sanitized version. Thank you for listening I hope that's been informative and I hope you've learned something from it I'm going to go back to my day job and back to deleting the dependabot issues thanks so much for listening, goodbye