 Hey everybody, I'm super excited to be here. So today I want to talk about the drive principle. Don't repeat yourself. I think most people have heard about it and it's one of the best known design principles for creating software libraries. At Hagen-Faith for Transformers, we took a different approach. Actually, in fact, we took almost the opposite approach that I like to call here, do repeat yourself, star. And in this talk, I want to challenge the drive principle a bit for creating libraries in the sphere of machine learning and then when creating libraries for the open source community. And a quick word about me. I work at the Hagen-Faith open source seminar almost since four years. I'm a core maintainer of the Transformers library and I recently also created the Diffusers library. It's two very popular libraries for the machine learning community. And yeah, let's talk about Transformers real quick. Maybe who has worked with Transformers and who knows Transformers in the audience? Maybe a quick check. Okay, all right. Then for the others, maybe I'll explain it real quick. So Transformers is a machine learning library for machine learning architectures for models. It provides very easy access to thousands of free-trend models. And it covers, I think by now it covers pretty much every task you can imagine in machine learning. So going from NLP or things like text classification, text generation over to image with image classification, object detection. And we're also in an audio now, and they're present with speech recognition models like Whisper and Weftback 2 are there as well as audio classifications. Transformers is highly used. It has up to 750,000 daily PIP installs now. So that's more or less the same actually as PyTorch or almost TensorFlow. We are over 100,000 GitHub stars now. I think Transformers is the second library that crossed 100,000 GitHub stars after TensorFlow, the second machine learning library. And since, I think, two weeks, we have over 2,000 external contributors. So quite an active and highly used library. Now let's talk about the derived principle. So the derived principle originates from the book The Pragmatic Programmer. It's a great book, by the way. I highly recommend it. And I think the derived principle can be best described by two rules. So first rule, every piece of knowledge must have a single, unambiguous, authoritative representation within a system. Let's dissect that a bit. So every piece of knowledge, the way I understand it is when you think about libraries or software libraries is an abstraction. It can be a function, it can be a class. A single, unambiguous, authoritative representation essentially means that, well, just don't copy the abstraction if you write it once. And every abstraction in its sense and its definition should only exist once. Authoritative means it should enforce its API on rules across the library, the system. Second rule, elements that are logically related all change predictably and uniformly. Now first, kept in sync. So what that means is that the abstractions that you created, they do relate to each other. And if you change one abstraction, you should make sure that you change the other abstractions in a way that it still fits with the overall design of your library. Great, all right. Okay, fresh out of college. I want to apply it right now to my cool machine library. So I want to write very clear, very precise knowledge abstractions from a library. What's the problem in machine learning? Well, the problem is that the reality is that the field is evolving way too rapidly. I think in the graph below, you can see that the number of machine learning papers has been exploding. So I think we're up to like 5,000 papers per month now for machine learning, which means that every month new definitions come out. And every definition builds on previous definitions. So when you build an abstraction, then you have to essentially every day rethink whether you have abstractions to make sense or not. So it's extremely difficult to define unambiguous clear representations in a machine learning library. All right, second rule. Elements are logically related or change predictably. It's essentially every component in a machine library relates to each other. And if you change one component, you change all other components. What's the problem here? Well, actually, machine models, they're mostly static. You don't really have to change old models when new models come out, or you don't necessarily have to change old models when new algorithms come out. As an example, I think the easiest example is maybe GPT, right? So like, oh, may I release GPT1? Then they release GPT2. When GPT2 was released, then we don't necessarily want to change GPT1 because GPT1 is a model in itself, and all improvements don't have to be applied to the code of GPT1, but we just write a new model in GPT2. If you read the paper of GPT1, you don't necessarily want to see improvements of GPT2 in your code. So what that means is we don't necessarily have a bidirectional relationship between components. Models are mostly static, and you can say that newer models depend on older models, but older models don't depend on newer models. Great. Okay, we covered machine learning now. What about drive for open source software? Open source software, especially for us, means that it's built by and for the open source community. So what that means is open source contributions are crucial for us. We rely on open source contributors. We cannot keep up with all the new things popping up in machine learning, and we rely on people that are excited about new methods. We want to implement that in our library. Well, the problem with DRI is when you've written a lot of abstractions, it's extremely difficult to contribute to that, actually. If you want to contribute to a library that has a lot of abstractions and you want to contribute a small thing, you still need to understand most of the abstractions. And in machine learning, abstractions, again, are very difficult to understand and can be also very ambiguous. So not everybody understands abstractions the same way, right? There's a lot of discussions about what do different things mean in machine learning. To give you some examples is we have natural language processing, we have natural language understanding, we have natural language generation. How do they relate to each other? People understand that a bit differently. Self-attention is a mechanism that most ML models are now built upon, but there were new improvements to self-attention. We have flash attention. We have chunked attention. We have long sequence attention. We have cross attention. How do you bring all these abstractions into one thing and how do people understand that? Very difficult to define, so we don't want to define that. The next thing is if you want to contribute to a library and you're a first-time contributor and we see a lot of new people coming to the field of machine learning, it's extremely demotivating if you change one thing or you add one model and you see that a thousand or a hundred or a thousand tests fail in UPR. So what we want is we don't want a lot of dependence between components. We want to make it very easy for people to enter the field of machine learning and make it very easy for them to do first contribution to our library. So Transformers is very active and in this graph you can see the average commits to main per week. So I think we're pretty stable around 50, 40, 50 since two or three years. It's also for us maintainers, what that means is if we have 50 contributions to main per week, that means we have around like 100 maybe PRs that we have to review per week, which is like 20 PRs per day. We're not that many people maintaining Transformers, so we cannot always rely on every review of every PR being perfect. And so instead what we have to do is we have to trust contributors that they don't break everything and we have to trust that the code that is added is independent and doesn't break core components of the library. Great. Another very important thing for us is that and I think that's very different when you compare open source software to in-house software, is that you have I think three type of in some sense customers for open source library. You have the people using the code. So PIP install use it, I don't read it. We think about if we have 100,000 people using the code, we have around 1,000 people then reading the code. So the people reading the code, they don't just use it, they actually look into every file and tweak it maybe, like clone the GitHub repository, adapt something. And then we have 10 people writing the code. So every day writing the code, which I would say are the maintainers, mostly the core maintainers. What is very important for us that we don't write the library just for the people writing the code. So we don't want to only make it easy for us writing, maintaining the code. We really want to take into account also the people, the 1,000 people reading the code, tweaking the code and adapting the code. So what that means is we highly, highly like expressive and readable code and we prefer this way more than a character efficient, you know, like dedicated lambda function like code. So for us it's very important that code stays readable, it's not abstractive. Again, so why is that a problem when you have abstracted code? Well, again, because not everybody understands knowledge abstraction the same way. If you want to take a model like GPT and you want to tweak it, you just want to see what's in the paper. You don't want to see, you know, like a new knowledge abstraction that we created. You want to see a very easy to understand code. A second thing is also when you tweak and adapt code or you quickly want to understand code, you don't want to switch between, you know, 100 files. You ideally just want to stay in one file, have a quick read and then you can go on. Now here's some statistics. So Transformers, we have over 1,000 people watching the repo. So that's a bit proof for us that, well, we have a lot of people reading the code on a daily basis. Also, Transformers is forked 20,000 times, which means a lot of people don't just use the code, they adapt it, they tweak it. Great. So what do we do in Transformers? First thing is, well, essentially, we copied a lot of code actually. So we have a principle which is called single model file policy. What that means is that every model has all its knowledge in one code file. And when you run the model in a forward pass, you all you need to read is one code file. You don't need to jump into different files and different abstractions. One code file is enough. The second thing is we just don't do abstractions. I think we only have two layers of abstractions in Transformers. So every model has, I think, only one abstraction, which we call pre-tran model, and all that is is loading and saving the model. We don't even try to categorize models into things like sequence-to-sequence or classification. We just say, well, every model has only just one abstraction that is saving and loading. That's it. All the rest is implemented in this model file. To maybe also to give you an example, so sequence-to-sequence was the paradigm for translation maybe like 10 years ago or five years ago. And I think someone could have said, all right, sequence-to-sequence is a very concise, easy to understand model abstraction. I'm going to build a big sequence-to-sequence model that all the other models depend upon. Well, but the problem is, now you didn't use sequence-to-sequence for translation, use all kinds of model. GPT is not a sequence-to-sequence model, it's another architecture, so you run into problems. Classification, in the beginning, you would have thought that only bird-like models can classify text, and you don't use text-generative models to classify text, but what happens now is that people use GPT 2, 3, 4, a lot to classify. Next is, so Unify API is not enforced by abstractions and transformers, so a big problem is when you copy code, is your API of every model might differ, and that's bad, you know, we want the same API across models. So what we do here is, instead of, well, enforcing that with an abstraction, we just enforce it with tests, so we just, you know, test that every model has the same API. Why do we do this? Because I think it's the best of both worlds. On the one hand, we have high readability for the users, on the other hand, we ensure that we have unified API, which is super important for users and for production use cases. Now, the last missing piece here is, we copy-paste a lot of code, and that can be problematic if you find a bug, right? If we've copy-paste the function, let's say, like 100 times, we find a bug in the function, we fix it in one function, then 99 functions are not updated, that's a big problem. So what we do here is we have code generations, because, as I said before, is newer machine models depend on older machine models, but not vice versa. What we do is that if we copy-paste code from old models to new models, we add code generators. That means if we update the old model, the new model is automatically overwritten, the function is automatically overwritten. So in Transformers, you see a lot of these copied from statements, and that ensure that our code stays in sync. So the question now is a bit... Well, have you taken a bit too far? Because, I mean, there's obviously drawbacks to not respecting the drive principle, and maybe just to show a couple of stats here, we have almost 400,000 lines of code in the Transformers code base by now, which is arguably quite a bit. We run... I think, yeah, I think I took the screenshot two days ago, so we run over 60,000 tests on a daily basis. There's a lot. This is also very expensive. And you probably don't have to run that many tests if you build nice abstractions. But we test every model, and every model has its own code, so we test 60,000 tests every day. And then the last snippet I want to show you is just, I guess, how often we write more or less the same abstraction. So if you search for self-attention and module, so a self-attention class in Transformers, I think we have now like 123 self-attention definitions in Transformers that doesn't respect the first principle of drive at all. And it is a drawback, because if you want to add a little tweak to self-attention, it's not that easy to update anymore. I wrote also a nice fun blog post about it, which, well, same title as this talk. Do repeat yourself. Actually, well, it wasn't crossed out there, and which you can find on our HuggingPhase blog. And I'm very happy to answer questions also after after the talk outside, and also really like the topic, so very interested in discussing. Thank you very much for listening.