 Our next speaker is a real treat and a fellow nonprofit executive director, so I'm very, very excited to introduce her. Stella Biederman is lead scientist at Booz Allen Hamilton and also the executive director of a Luther AI. Over the past several years, Luther AI's work has focused on making cutting-edge AI technology more widely accessible through supporting the release of several of the world's most powerful generative models such as GPT, NeoX, Bloom, OpenFold, and more. Recently, her focus has shifted towards building a better understanding of how and why these models actually work. It turns out we don't actually know that, and that would be good information to have in order to instill desirable behaviors and limit undesirable behaviors in these types of models. Drawing on her experience, developing several of the world's most powerful open-source generative AI technologies, Stella will discuss why open-source values remain essential for artificial intelligence and what role the open-source community can play in the continuing unfolding of this technology. Please welcome to the stage Stella Biederman. My name is Stella Biederman, and I run Luther AI, and I work as well as at Booz Allen Hamilton. I'm here to talk about generative AI, and like was just said, what we need here as a supplicant to ask the broader open-source community for help, because there's a lot of issues that the AI community has been struggling with, that the open-source community has been working on for years, if not decades, and there's a lot of room, I think, for us to learn from the lessons that you've learned and build together to build more accessible and more widely available technologies. So I'm going to talk briefly about what generative AI is, in case anyone doesn't know. I'm going to talk about the existing and now starting to thrive open-source ecosystem for large-scale AI, and then I'm going to talk about some of the challenges that we're currently facing and how I hope we can work together to build a better future. So first, a little bit about me. I've spent the past two and a half years developing large-scale language models, primarily, but also large-scale AI technologies in general. I got into this in the summer of 2020 when G2B3 had just come out and nobody really had access to these technologies. You couldn't even pay open AI to let them use their models, to let you use their models. And it became very clear to a number of people I know that this was going to be a very impactful technology and that we wanted to be able to study it, but we couldn't do that if we didn't actually have access to it. So we started building code bases and data sets and eventually training models to actually get our hands on these technologies and learn how they work and be able to use them to effectively study this new paradigm in artificial intelligence. So the past two and a half years, Lutheray has released a wide variety of language models, some of the largest language models in the world, as well as multimodal models. VQGang Clip is a text-to-image type model. OpenFold is an open-source replication of a protein-folding algorithm that DeepMind created. We've created open-source code bases and libraries, as well as released public data sets that have become de facto standards for training these models. So generative AI is this new and powerful class of AI models that kind of differentiates itself from previous types of technologies because it is semantically controllable. And what I mean by that is you can put in a description of what you want in text and it will give you or at least try to give you that same thing. So we've had, for example, technologies that have been able to generate photorealistic images for five, six years now. But it's only been the past two or three years that we've been able to write down the sentence, a photograph of an astronaut writing a draft, and actually get the AI to take that as an input and produce that specific image. And this opens up a lot of usability and a lot of applications to this technology that were previously inaccessible. The current state of the art is kind of weird. It's primarily controlled by large companies. There are a wave of newer startups, but almost everyone in this space is a corporation. And unfortunately, at least from my perspective, most of these technologies are not ready for production deployment. That is an unpopular position, but yeah. So there's a couple of classes of these models. Most of my work is on text models, which is also mostly model models that take stuff from one domain and transfer it into another, such as text to image and text to audio. People have had a lot of success developing domain-specific models. So these are models that don't work on normal kinds of inputs, but generate proteins or computer code or formal specifications for design. One of my colleagues recently put out a paper about text to architecture. So these are kind of the classes of generative AI models out there. And over the past six months or so, there's really been an explosion in the open-source ecosystem. For a while, there were a handful of organizations that were doing most of the work, but nowadays there is a wide variety. So the kind of typical timeline of developing one of these technologies is you start off with data sets. You have to train a large model. You then fine-tune it or adapt it to a specific task or a specific context. You need to evaluate it and determine whether it works appropriately, and then you need to deploy it. And nowadays, there are many options for every stage of this pipeline that are open-source and widely used, which is really wonderful. And one of the big benefits of this is that we've been able to share knowledge much more readily than really was ever possible for. When I was first getting to this field, I spent many, many hours scouring over footnotes and appendices and papers trying to figure out what exactly people did because they wrote papers describing their work and then didn't always release code or didn't always release reproducible code. But now there is a lot of knowledge sharing between open-source-motivated groups, as well as sharing resources. There's a lot of organizations that share computing resources. For example, it's extremely expensive to train these models, but organizations with more computing resources share those with others to train their models. And another really wonderful feature is we've built an interoperable open-source ecosystem where people are able to take stages in that pipeline, develop by other organizations or models or data sets developed by other organizations and just plug them into what they're doing. So this is a very small graph of a piece of the ecosystem that really shows how different organizations are building on and using technologies built by previous ones. So each stage here uses the things above it, and you see a half dozen organizations, a mix of non-profit and for-profit companies representing five different countries. And so there's a large amount of interoperability in the open-source ecosystem now, which is really wonderful. But there are definitely some issues that we're struggling with, and that is, I think, really the key to why I'm here and what I'm hoping that we can achieve together. Broadly speaking, there are three issues, or three types of issues, that the open-source AI community is struggling with right now. Those are challenges in maintaining code, challenges in ethics and deployment of these technologies, and challenges in law, policy, and regulation. So most people who work in this space right now are researchers with an academic bent, whether they're actually at universities or nonprofits or companies, as well as a growing number of hobbyists. And neither of these groups have a particularly seller record for maintaining open-source projects and for a long period of time. So speaking from my organization right now, one of our widely used code bases is maintained by less than one FDE of work. It's used by a half-dozen companies being run on the US government's largest supercomputer-trained language models. We use it almost every day, and we really haven't had the bandwidth or the contributions to kind of maintain it as much as we'd like simply for manpower reasons. Similarly, code is often, you know, even when companies are the ones pushing these software libraries, they often have internal and external versions, which is another big problem. So when they finally get approval to release their things, it's something that they've built on top of what was open-source previously for six months. And then you have to figure out how to make what everyone else has been building work with the new version that just came out. And this is one of my personal nightmares. And another interesting thing is that whenever I talk to people about this, there's a strong perception that there's a big barrier of entry, that you need to know a lot about AI technologies, about large language models to contribute to these kinds of libraries and to help maintain this ecosystem. And that's really not true. There is a huge amount of work to be done in code maintenance, in evaluation that doesn't really require much or any expertise in large language models. A lot of the people who maintain critical pieces of the open-source ecosystem are undergraduate students. People who maintain critical pieces of the ecosystem are hobbyists. And if you have expertise in Docker, there is a concrete issue that I have not been able to solve for six weeks because I just don't have someone with the right Docker expertise on my staff right now. So there's a lot of issues that are addressable with a wide variety of software development expertise, database expertise, Kubernetes Docker, et cetera. And so even if you're not an expert in AI, if you're interested in helping us out and maintaining code, there's a lot of different things that people can be working on in ways that you can make a real contribution. There's also a lot of debates around kind of the ethics and how you should deploy these models. So there are debates and lawsuits about licensing, about ethical use. The studies on how to evaluate and what kind of standards for reproducibility that people should ascribe to are very hotly debated and really need a lot of work so that we can actually get to a place where people can take each other's work and adapt it and use it reliably instead of having to spend weeks or at times even months trying to figure out how to make my code do the same thing that someone else's code did. And really like open source, the ethics of open source has played a really critical role in these conversations and I think it's really important that the broader open source community through these kinds of questions for many, many years kind of share their learnings, their expertise and their perspectives on how to license, use, deploy these technologies ethically and what standards we should be holding ourselves to as a community. And kind of the last area is about law and regulation. So governments are slowly starting to notice that some people are getting excited about these large language models and related technologies. It's been a slow process to begin paying attention, to begin regulating and often times the regulation is not aligned with what I would view to be kind of the philosophical perspectives of the open source community at large. So as one example, there's a law currently working through the European Parliament that doesn't draw distinction between open source releases of AI models and commercial deployment of large scale AI models which in terms of the regulations and compliance and rules that are kind of put on you. And this is a really big issue and especially as larger companies, organizations with a lot of political power and money are typically centered in these conversations, it's really a challenge for open source developers, for academic researchers and for hobbyists to do the kind of advocacy in the political arena that I think is really essential to having the kind of robust legal structures and support that we need to be able to continue to make these technologies widely accessible to the public. And, you know, this is something that the open source community has struggled with for many, many years and is an area where I'm hopeful that organizations like the Linux Foundation, the open source initiative and other organizations and foundations that kind of support open source development are able to play a critical role and advocate for the needs of our community and help us kind of reach out to policymakers, lawmakers and get our voices heard. Thank you. You know, I think that the world of AI is in a really interesting place right now. Six months ago, there were very few people training in publicly releasing language models or related technologies, generative AI technologies and nowadays there's close to a dozen. There's been a huge explosion in public interest in this technology and there's been a huge explosion in organizations that are willing to invest in the open source ecosystem. So I really think that right now we're at a critical time in terms of the developmental work in terms of the norm building, community building and I think it's really essential that we don't go off and try to invent all this ourselves. Most of the people working in this space right now are, like I said, researchers and hobbyists, people who don't necessarily have the years of experience in open source development and software development and maintenance and being able to learn from you all the lessons that you've learned over the past couple decades is, I think, really essential to making sure that we end up in a place that is sustainable and operates for the public good because I very much don't want this to end up being a commercial technology that's hidden behind APIs that individuals, that researchers don't have access to and don't have the ability to build on and use and that it's really essential that we find ways to make this a sustainable movement and promote its continued existence and success. Thank you. So first of all, great talk and you came to the right place to ask for some open source help. You know, I was struck by a lot of developers telling me, well, you know, can we really keep up right with these big commercial efforts and proprietary efforts and all of that and speaking as someone who's been working at the Linux Foundation for nearly 20 years and every single one of those years, I heard you'll never keep up. It'll never be a big thing. It'll never be commercially successful as an open source open, totally transparent community. I give you hope on that one. So, you know, backstage we were talking about what's open versus not open. I just want to do one thing. I promise that the Linux Foundation will continue to remain open that we won't raise venture capital and become a private company. Can you promise the same for other AI that you will never go private, so to speak and remain open as well? Yes, definitely. Yes. All right. Thank you. Wonderful talk. Thank you so much for having me. Have a great day.