 And hi everyone, I'm very excited to be here today. Although I prefer to be here physically. So today I'll be talking about building models, the way we build open source software. And here is a little bit about my background. Steven Kolaoli, an elite private researcher, an thermal collective, and I'm based in Nigeria. I also do machine learning engineering analytics, where we try to do opinion mining, real time opinion mining from social media platforms for crypto assets. Yeah, so I tend to go a little bit philosophical, especially when I'm post-treated. And then, yeah, a little bit of a new sensor when it comes to food. All right, moving on. So here's a little bit of background about this talk. In late last year in December, Colin Raffer, who is a professor at UNC, came up with this very revolutionary idea, and is about a call to build models the same way we build open source software. In fact, most of this talk, and even slides were based on Colin's original talk on this particular topic. Yeah, so basically what I'll be doing here is more of a survey of the idea itself, of the ongoing work in that direction, and the questions that are left for us to answer that are currently on our side. Okay, great. So, yeah, let's walk a little bit through history so we understand where we are coming from and where we are going to. So, yeah, this is the first piece, and it's an antiquated photo of our machine learning engineer fine-tuning an SG-Boost model, you know, back then, MLR Coedings were, you know, powered by steam technology. Okay, okay, I'm trying to bullshit you. I don't know if any one of you got fooled by that. Okay, yeah, so to be exact, the first deep learning model was, I think, 1962, 1963 by a Soviet mathematician, and it's inducted, it's like an MLP, a multi-layer acetone, but without the back propagation. So it makes use of inductive statistical learning to make predictions. And of course, there weren't many advisements between 1960 to 1990, but two important factors that I want to point out, one is that there was the AI winter in which condensers for AI projects were stopped. I mean, we're sort of called drastically, and there are a lot of factors that we put into this, including, you know, the overhype of AI, the economic factors due to the cold war, but those are like factors that, you know, want to protect for the winter period. Well, one important factor I want to point out is the lack of the unsuitability of the computational resources that we had as at that time, right? So there were two wellings powerful enough to power the kind of projects, the kind of AI experiments that are needed to be run. So between 1990 to 2000, there were like very little innovations, but two of the most notable ones were SVM and LSTM, and then in the 2000s, we started experience with the AI spring, still the phase in which we are right now. And what was the biggest change in the AI spring evolution? I would say that the biggest change is that we get to faster computers, we get to our faster computational resources to run the kind of projects, the kind of experiments that we need to run here. Okay, so now in the modern era, there are a lot of stuff I want to outline here, and just for consistency sake, I would focus on the end of the branch of AI. So in 2013, we had a work effect which is simply providing tests on structural datasets to vectors, right? And then we put a classifier on top of the vectors, and this was like the standard procedure for a very long time. But in 2015, a team from Google worked on this paper on semi-supervised sequence learning, which means that with semi-supervised, we are not just providing a vector and then matching them to their levels. So in semi-supervised, it's also a little bit different. And then in 2017, we had the first unsupervised sentiment, and as at this time, not many attention was paid to it, right? And it's just the same, we are not providing labels and we're sort of allowing the model to select its own labels and make predictions based on its own selected labels. So in 2018, this was when the big stuff started happening. ULM Feet was introduced by Jeremy Howard and Sebastian Ruda, and what ULM Feet was basically doing was taking semi-supervised learning and performing a lot of tricks that made it give us very useful, very, very impressive outputs and results, right? And then a little while after Elmo came out and Elmo was simply clicking the LSTMs, LSTMs models, and then performing a bi-directional LSTM technique. And of course, it achieves better results than the ULM Feet model. And then we had GPT-1, which was sort of different because the author was like, okay, instead of using LSTMs, let's try and use Transformers, right? And of course, Transformers gave us very nice results. And Pet came along and Pet was like, okay, instead of using just Transformers, maybe we can make it bi-directional and impressive results should have achieved. And since the Pet's figure out till now, I could say that we've had what I would like to call an explosion of models. Yeah, I mean, we've had a bit of stilt, we've had the go-better, we've had orbit, and yeah, lots and lots of models have, you know, have been, you know, we've just sustained that. That's why I sort of label this piece a lot of stuff, right? And there are a lot of factors, there are a lot of, you know, pain points attached to this thing that have a lot to point out. So let's assume that the first paper, an earlier paper, proposed an unsupervised technique called fancy lane and the paper be proposed another, with any technique known as fancy lane, right? And then, and achieve better results. And most times the differences will just be the kind of data set that will assist. Maybe paper A is Wikipedia and then paper B is using Wikipedia and BBC data, right? So the only difference here is just the data set. So in the second scenario, the only difference might just be the parameters. Maybe paper A may be sort of 100 million parameters for the model composition and paper B is using 200 million parameters in its architecture. So that means the paper B is achieving better results solely based on the number of parameters and another scenario is paper A is trained on 100 billion tokens of unlabeled data and paper B is trained on 2 billion tokens of unlabeled data and of course paper B is achieving better results based on big tokens. And finally, the differences might just be the add-and-multimize or between the loss functions and paper A might use add-and-multimize and paper B is just a STG we're running for and paper B is sort of achieving better results solely based on the loss function, the kind of loss function that is being used here. And then, of course, if you have a look at the list of models and organ phase, of course these are some of the different variations of the same technologies, the same idea, you know, and the only differences are the differences that I just outlined. And what will you notice? Do you notice that most of these models are, you know, are built by large corporations? Of course, they are sped by Google, back up by Facebook, and then OpenAI, STPT, DeepMind, Tascoba, and Huawei, Microsoft, and Google as the switch transformer and GLAM, right? Yeah. And then there's this, there's this code from Lambda Labs which states that STPT, has about one 75 billion parameter and then the model will require, the model will require, you know, the model then being trained on V100 will require about 355 GPUs and that is going to cost about 4.6 million for a single thing. And this means that suppose I'm a company and then I'm spending about five million dollars building a model, then I might not do the general thing which is to elicit a model to the public for researchers to, you know, examine and improve upon and, yeah, I might just hide the model behind an API to mitigate for some of the costs. In fact, not taking the costs into consideration for anybody to work with all these large models, even if it's open source, it's a little bit, it's going to be a bit difficult to do, right? Because whoever is trying to work and improve upon this model needs to take into consideration the competing cost that is going to, you know, be needed to effect this change. Yeah, so most of these models are sort of hidden behind API and to make use of them using APIs. Okay, yeah, so that's the pain point. That's the first pain point and it's about the large cost of building models. I mean, doing the same thing over and over and over again with just little variations. Okay, so secondly, I want us to examine another pain point originally applied by by calling Raphael and using this is, using T5 as a case study for this, T5 was a model built in 2020 and I think it was the first model to model lots of NLP tasks as a test to test as a test to test of problem, right? So which means we're using the same technique to solve different forms of of NLP problems. Yeah, so when T5 was launched, it sort of went viral and a lot of robots have been built on T5 model, right? Yeah, so first of all, there is a unified QA which is T5 being trained on additional mixture of Q and A data set, question and answering data set, right? And then we have another model which is more core, which is put trained on, which is which is taking unified QA and improving on it improving on it for the generating data results using additional data set. And then we have a Unicorn which is like T5 trained for the common sense data set, right? Okay, and those are just, you know, additional trainings. I mean, we have the modern and then we're sort of training them on additional data sets so that they can generate a lot of space. So over here, we have a T5 T5 1.1 edition and what T5 1.1 edition of Australia state basically taking T5 and then it's being extended, you know, trained on additional data sets by the by the original hotels, you know, and by producing more efficient results. And then we have T5 plus LM, which is T5 1.1 which is T5 1.1 trained for that on next step, additional data set and it was actually better result than T5 1.5. And then we also have T0, which is like T5 plus LM you know, improved for that and then we add a T0 also we have Mt5 which is like a monthly version of T5 because it's trained across the larger data set of over 100 languages and then we have Mt5 and you know, it's also performing better than T5 on languages and this inspired by T5, which is simply T5 on the byte which is T5 1.1 extra like data set, we have code T5 which is T5 on code data set and we have code T5 which is T5 on for coding, for coding synthesis right here and then and one thing I want to point out is this it sets you sort of following the evolution of T5 over time, then it sort of gets very hard for you to keep track of the changes understanding the lineage which model is first and which model is factor that, right? I mean, if you're looking at T5 and T0 without having any background knowledge on T5 on the T5 evolution, you'd most likely think that T0 is the first model for T5, right? Yeah, and this sort of makes things a little bit confusing because there is no specific tool that you know, have all these all these variations in one place where you can see, where you can very easily track the different changes they made to models over time, right? Yeah, now contrasting that to our software that we developed in open source I don't know if you might be able to recognize some of the some of the open source tools, yeah, but this is a Python event, so yeah it was likely to be able to recognize this, yeah. So let's take Python for example, when Guido Varuzon created Python and then he made it open source right here, and then lots of developers of different parts of the world sort of, you know started contributing to Python, so just the changes and the maintenance of Python, you know, are the laws to accept a particular suggestion right and that way Python has lots of tools that it didn't have in the beginning speaking of Boolean, speaking of Generators the very recent one, Python yeah, and I think that's like the strength of our open source, which is a collaborative or a collaborative or community for the development of tools that we all use universally, right? Yeah. And of course yeah is a linear distribution and you can of course this is not very clear and you can check out the references there, but in here you can see the different changes to the different patches, the different distributions that, you know, are spawned from earlier distributions and I think this sort of makes sense because if you're not okay with a particular distribution there you could sort of move on and use another distribution and if you're not satisfied with the distribution where you're moving to you could simply just fork an original one of the distributions and make the changes that you want to make nobody's going to forgive you for that Okay. Yeah. And over here I thought of, you know, break down all these steps involved in building open source software and then how we can relate them to the way we build the way we build models. Okay. So initially let's say we are about a developer and then the developer is using virtual control to, you know a virtual for two gates and maybe get up to open source its project, right? And it can make changes to the local files and of course these are communicated in patches and anybody that is making a copy of this project is local machine and easily, you know, see the patches and of course go back in time to work with another patches another patch, right? Yeah. And let's say a new developer wants to contribute to the project and then he or she could easily copy the opposite three and of course make the changes that he wants to make, he or she wants to make and of course the maintenance of the project to check out the project and run checks run tests on the project and of course he or she could agree, you know judge the efficiency of the changes and agree if this is like the sign for okay and agree to match the changes, right? Okay, but let's say that another developer comes on board and he or she makes changes and the maintainer checked out the changes and the maintainer personalized the changes of course there's no add feelings that they they new developer who simply just, you know people are maintaining his own branch of the project, right? And then, yeah and of course the the anybody that's jumping on this project could easily go back in time to maybe work with an earlier version of this project and I think that's the beauty of open software development right, yeah and but this is not the same for the way we build models, yeah I think that's evident by now but some of you might feel that maybe we already have tools that are doing this version control stuff for models already I mean you have models on organ phase you can use weight and biases you can use ML flow, you can use format you can use DVC but I would argue that they are not really providing the most important functionalities that version control and open source development gives us. Taking DVC for example DVC is like a platform for you to sort of maintain your features maintain your trajectory of a particular model over time, right? So let's say you have a data, you have features you have models and you're saving them in memory, right? Of course you can update the features you can update the data sets and in fact you could adjust the input parameters and then you could adjust the input parameters you could add new features, add new data sets and of course over time all these changes are saved you could easily go back in time to work with an earlier version of the same project, right? But it does not work we really want because this way you could work with earlier versions and you could another developer, another data scientist could join your organization and make changes and it could have its own version of the changes right here. It sort of keeps a record of it but when it comes to what makes open source open source I mean having other developers other communities or other part of the world might need changes to that same project right? And there's back work compatibility you don't add that on DVC I mean DVC cannot merge updates together let's say we have a different model being built on in that same model DVC cannot merge that updated model with the existing version of that model, right? So you create a new update for that changes that has been made. So that's why DVC is not suitable for the kind of open source functionalities that we want to build open source models and so the big question is how can we be a new collaborative and continue development of machine learning models? So one of the two most important the two most important answers is that we need to be able to cheaply communicate patches and merge updates from different contributors so the patches are like minute changes makes to you know a segment of that model right? And then we want to be able to merge the updates I mean the minute changes being made by different and you know merge all of them into a single update without a question in performance, right? So let's focus on my patches for example which are the way the way models the way between our models we have we have a bunch of parameters and we're computing the graded distance using loss with respect to those parameters and we are updating these parameters all the parameters like yeah and this is how it happens about you know the amount of training steps that we're training for but maybe if we can find a way to instead of you know updating all the parameters maybe we are selecting a top few changes the top few parameters and making changes to those top few weights without affecting the performance of the order with I mean leaving them on change like yeah and there is this method by one of our calling after students and then it is learn a feature in this spatial changing mask and what it does is that it's sort of computing the top few parameters that are needed to be changed and then it focuses on trying to change these parameters to perfectly performance of the order of the order of weights right and this actually works and then it's you can see how well it performs compared to other method of other method of you know trying to work with trying to work with a parameter capacity and if you could see how well it performs the because you can check out the paper okay so the second point point is how do we manage updates from different contributors and I want to explain a little bit how with fending and downstream it works I mean how we sort of improve the model and then work with that improved model, improved version of the model initially we have a deep trained model which is already a two part large data set and then we have the downstream task where we we find something model a smaller set of data sets a smaller data set right to perform specific tasks and then there's another approach to this which is we take the trained model and then we train it on an intermediate task first and then we take the intermediate task model from our final downstream task of fine-tuning and hopefully this sort of achieve better performance than the earlier approach but maybe we can do this in another way and maybe we can have the different models side by side this was our downstream task and this was our intermediate task and maybe we can fine-tune we can fine-tune the model separately we fine-tune it on the intermediate and then the downstream task and we sort of combine both of them to form the downstream task model this should theoretically work better than this or this and this way we can sort of take the model let it work better on several downstream tasks and then find the way to merge all the downstream tasks into a single model and that way we're able to merge updates from different parts of the model that has been worked on and combine all the different updates into a single more effective model so there's this paper by Martina and then it's sort of where he used the official width averaging which is sort of which sort of match all the all the sort of outreach to changes of course the different models together when making sure that it doesn't it doesn't the global minima of all the local minima are outreach together is more optimal than the different model we said and of course this also performed extremely well on the benchmarks that it was tested on so these two problems are sort of solved but there are still lots of other problems other questions that we might need to discuss too and one of them is we need to be able to rapidly evaluate proposed changes to the model to make sure about comparability how do we do this, I mean usually in open source software we're able to evaluate we're able to have tests and then any time it changed has been made I mean the test is run automatically and it evaluates the proposed changes but it's not affecting what a particular library can do before and the question is how do we do this in our model development how do we rapidly evaluate the changes and then make sure that none of the functionalities that the model can perform before has not been affected aggressively so one other question that there's no answer to yet is how do we combine the model components of the different models to provide a new excuse let's say I'm trying to build a library that one of its functions is taking a JSON file and passing a file and maybe doing something nice with they pass the beta for something like that I'll most likely want to import JSON and I want to work with JSON the JSON library to pass that particular data for me so what I'm basically trying to say is that I'm taking different components different models of existing libraries and then using them to provide new skills and capabilities and how can we replicate these for models I mean how do we take how do we take different chunks, server chunks from different models and combine all these chunks together to form a new model with new skills and capabilities so unfortunately I have more questions than answers on this talk but the main aim of this talk is about awareness I mean when I saw the original revolutionary idea from Polygraph fell back then in this plot post I was sort of picked on it and been thinking about it a lot since then and this is me trying to bring it to a larger population maybe if more people are thinking in this direction in this particular direction maybe we can collectively find answers so these questions that we don't have answers to yet alright so it's finally given to Polygraph for the revolutionary ambition we cannot talk on this slide thank you very much for taking your time to listen to this talk yes thank you Stephen for your talk