 So, hey everyone, I'm Talia McCormick, and today I'm here to talk to you on Mono repo madness dependencies, license compliance, and automation. In my day job, I work at Fossa, where I'm thinking my voice up properly. Yeah. Okay, good. So, in my day job, I work at Fossa, where I write software to hunt on vendor dependencies in your projects on your behalf. In particular, I spent a couple months working on automated scans for AOSP for the Android open-source project Mono repo. So, I'll make a couple references to AOSP as an example throughout this talk. So, to start us off with an overview of what we'll actually cover. First, I'll look at the current state of open-source compliance and automation. Then, what exactly is a Mono repo? Why we care about them? Then, the challenges that Mono repos present. Then, take a slight detour into build systems and how they interact with Mono repos and Mono repo automation, and then wrap us up with the evolution of AOSP of the Android open-source projects built system through time. So, open-source software compliance and automation. Right now, open-source compliance has four main steps. First, you need to figure out what your dependencies are. So, what your actual projects that you're relying on, what their versions are, both your direct and indirect dependencies. So, if your dependency has another dependency that it relies on, you need to know those. Then, you need to look at how you're actually using them. Is it something you're using for just testing in local development? Are you running it in production? Maybe you're packaging it up and distributing it to customers. Third, you need to take each dependency and then look at the risks associated with it. So, is there a copy left license that affects how you can distribute it? Is there a security vulnerability or concern? And then fourth, once you have these pieces of information, you know what your dependencies are. You know how you use them and you know the risks associated with each one. Then, you need to actually identify your mitigation steps. So, maybe you need to stop using a dependency or you need to be distributing your source code properly. Maybe you need to reference it on a distribution notice, whichever that is. So, doing this manually really, really sucks, as you can imagine. There are a lot of dependencies in most projects. There's a lot of repetitive time consuming lookups that you need to do. And in particular, even if you get all of this figured out, you know all of your stuff. And then, a developer goes in and changes the version of the dependency that you're using. They add a new dependency or maybe there's something that they've been using this entire time. You've looked at it, you go, all right, they're only using it for testing. And then, they start distributing it within your distributed binaries and you go, yikes, I needed to be aware of that. So, what if we could automate some of this process? The good news is that we can automate some of this. And in particular, I hope many of you already are. So, the first three steps specifically are what is easy to automate. So, easier to automate. So, you identify your dependencies, how they're used, and the risks associated with each one. The fourth step, so actually knowing what you need to do about all of this information, you do still want an engineering professional or a legal professional to look at it and make that final decision. So, if automation's so great and it makes this process suck so much less, why aren't we automating everything already? Well, automating has a cost. So, specifically, each type of project requires an automation that's specific to its build system, to its dependency management system, to its file layouts, to its naming conventions, and so on. And each time you change one of these characteristics, you need to either create or alter your existing automation for it. And this can add up, and engineers are expensive. So, right now, a lot of people will use an existing external source. So, that's what my company, Tofasa, does, or they'll home grow their own systems. You'll do part of it manually. You'll have some sort of script that works, but this can be messy and error-prone. So, how does this automation process interact with MonoRepos? And what is a MonoRepo? Well, Mono, historically, we've used PolyRepo setups. So, historically, you've had one code repository and then one project per code repository. And then, because of this, it means each project tends to be very individualistic and distinct. Whoever's working on it goes, well, this is the way I want it to be laid out. They decide that. They decide their build system. They decide how they handle dependencies, and so on. And when one of these projects uses another project, it'll pull it in in some way. So, a developer might copy, a recent release into their project. They might download it during the build system. Maybe they'll just copy the other project source code in. A MonoRepo, in contrast, moves away from this. So, in a MonoRepo, you have a code repository. And this code repository contains multiple projects, often related projects. And because these projects live in the same place, engineers look at them and go, well, there's a project here already. I'll make mine look similar. So, they'll use the same build system. They'll use shared infrastructure, shared libraries, and so on. And importantly, if you have projects in a MonoRepo that rely on each other, it's fairly common for them to be able to import resources from one directly, and you can see these relationships within the code repository itself. MonoRepos have become a lot more popular recently. A lot of large companies have been known for using them. Facebook, Google, and Uber are the big three examples I have that come to mind. So, we know why we care about automation. We know what a MonoRepo is. Why do we care about them together? Well, probably, repo projects require a lot of work to automate. Like I said before, when you create an automation, that automation has to be specific to a project's build system, to its dependency management system, to its naming conventions, and so on. So, if you have a bunch of different projects, you may need to create like a new distinct automation for each one of these. And that, as I've said, is expensive. In contrast, if you have a MonoRepo project, they tend to, the projects within the MonoRepo, they tend to be a lot more similar. So, they'll have a shared build system, they'll have shared dependency management system, they'll have shared file structures. And this happens because engineers often default to match existing conventions. If you have a project already existing with a build system, and you can take five minutes and add on to it, or you can set up your entirely brand new build system, it makes a lot of sense to just use the one that already exists. And the other reason for this is that if you're working on multiple projects, it's a lot easier to swap between different projects when they're more similar. So, if you know, oh, this is where the files are always laid out in one project, you can just go over to the other one and know where all of the files are or know how the build system works. So, this is good for engineers, they like it, and it's also good for the purposes of automation, specifically because the key characteristics that your automation relies on tend to be a lot more similar. So, the build system behaves the way you expect it to. The test files are in the same place where you expect them to be, all of your externally published packages follow the same format and that type of thing. And so, because of this, we can reuse the same automation. So, to be clear, it's not that a specific project within a Mono repo is easier to automate in and of itself. It's that the projects within a Mono repo can often share in the same automation or the same work for automating rather than having to do it individually for each specific project. So, if Mono repo projects are a lot easier to automate, why are we here or why am I saying that Mono repos are actually interesting or hard to talk about? Well, Mono repos actually do have a second piece to them, which is their granularity, or rather, you need to have a lot more, a more granular detailed understanding of how the projects within your Mono repo work. So, in particular, your dependency graph for your Mono repo is a lot more important because if you have one small project in a repository somewhere, you have your list of dependencies, you can just look at all the dependencies, say, yep, all of these dependencies they tie to like the single output of this project and you know exactly how those dependencies and how that output is being used. You know if your project's being distributed, if it's like testing, if it's running in your own production instance, whichever. But in a Mono repo, you can't just take the global list of all of your dependencies and say, yep, all of these dependencies are used in all of these projects. So, for example, let's say you have a copy left dependency in your Mono repo, which means any projects that use that dependency need to have their source code distributed. But then you have five projects in your Mono repo and one of them you really, really don't want to distribute your source code for. Then you need to know if it's using that specific dependency or not. And similarly, you don't want to like pull out a dependency that's being used in a safe manner if it's only being used in one project. You want to be able to leave it as is for that one project. This also makes both automation harder and it also makes like manual analysis a lot harder because as I've said before, you have a lot of dependencies per project. So if you have a hundred dependencies, you can sort of work your way through the manual and go through by hand and it'll be tedious and it'll suck that you can do it. But let's say you have 2,000 dependencies or 5,000 dependencies or 10,000 dependencies across your project. It's not just that it's hard to do, it's that there is no reasonable way to do that manually and not be making mistakes and keep up with that workload and not have all of your like engineering team or legal team leave because they're really frustrated about having to go through that process. So yeah, so MonoRepos are harder because you need to have a better understanding of your code base and dependencies and because they're a lot larger so you can't just use the same systems or strategies that you've used. The unfortunate news here is that most out-of-the-box automation systems don't actually support MonoRepos. So you can do a strategy where you say, all right, I know these are the types of projects. I use this tool for my PolyRepos setup where I have a single project. Let's just run it across the entire MonoRepo and you'll get your dependency information but you won't know which dependencies necessarily relate to which projects. So it's not quite as useful to you there. Oh, quick break for water. All right, so I'm going to walk through an example with AOSP now with the Android open source project or a general layout that's common in most projects. So here we have a file tree here. We have a folder A, in folder A you have B and C. B itself has a notice file like a legal piece of information and it also has some source code B. In folder C we have notice file C, source code C. So you know that the notice file and B and the source code and B relate to each other and similarly the notice file and C and the source code and C relate to each other. Second, we know that the source code and B actually uses the source code and C so it depends on C or it imports C. So now we have two pieces of information. We know this dependency graph information between B and C and we also know which notice file relates to which source content. So we need to be able to take these two pieces of information and know that the notice file for source code B relies on the notice file for source code C or should contain the same content from there. So that's the general overview of MonoRepos and automation. I'm going to go into a quick aside on build systems and MonoRepos and how those also affect automation. So not all build systems are created equally or behave in the same way. They have, all of them have different trade-offs which can make them better or worse in different situations. So the first type of build system is an imperative build system where you give instructions on how to run a build. You tell your build hello. Here is like a set of tasks. You go through step one, step three, step three and then the build system goes and actually does those tasks in order. You can be compared to a recipe. So you go, hey friend, I want an omelette. Here's an omelette recipe. Your very lovely patient friend takes a recipe and then they take the steps. They do step one, step two, step three, exactly as you've said and they make you a tasty omelette. The second type of build system is an artifact-based build system or a declarative build system. In this type of build system, the system itself, you describe the shape of your build. So you say, this is the things that you go into something. These are the binaries or the results that we expect out of it. And it can be compared more to a menu where you know these are the ingredients going into your omelette and this is what an omelette looks like. And taking the same analogy as before, you go to your lovely friend who's very generously willing to cook for you. You say, hello lovely friend. Here are some eggs, here are some cheese, here are some green onions. Please make me an omelette. Friend knows what an omelette looks like and they figure out the steps themselves. So to be clear between imperative build systems and declarative build systems, it looks a lot like a spectrum. So build systems vary from like very imperative to like almost declarative to actually being declarative. But a build system is explicitly one or the other. It can't be both. So two examples of this are make or make files which are strongly imperative and basal build files which are strongly declarative. So I do care about build systems. Why have I brought you all here to listen about Monoripo Automation and then started talking about build systems? Well, there's a relationship between the type of project that you have or the type of like repository layout that you have and the build system itself. So in a poly repo build system, each project can use whatever build system it wants. It can be imperative or declarative. It just depends on whoever's working on the project and what they prefer and what your needs for it are. But in Monoripo, say almost always use declarative build systems. And the reason for this is that declarative build systems work a lot better on a massive scale in general. So for example, if you're working on the scale of an omelet or three ingredients, it's a five minute process. It doesn't really matter how you do things. But let's say you go to your very lovely, very patient friend and say, hey friend, I want you to make me a five course meal. You're going to make me macaroni and cheese and pizza and nachos and omelet and this and that. They want to be able to take all of the things that they're doing, all of the food that they're making or all of these like build artifacts that you're building and like reorder or like optimize this. And because they know what they're doing and they have all of this flexibility and freedom, they can build, they can like cook or they can like build your build system in a more efficient way or in a more effective way. So once again, why do we care? This is tied back to automation. Well, imperative build systems are really, really hard to analyze with automation. And in general, you pretty much have to run the build itself in order to see what it's actually doing. If they can depend on like environmental dependencies, they have complex control flow. And in generally just like I said, you can't statically analyze them. There's an urban myth floating around at work about someone who ran into a make file where this make file invoked a curl command that curl command went to some university professors website, pulled down some C and C plus plus source code, compiled it into the build and then deleted the original like source files. So it's really hard to just like look at a bunch of scripts and go all right, what weird jinky things can people get up to like this? In contrast to that, a declarative build system, the type of build system that is almost always used in monorepils tends to be a lot easier to analyze with automation. In particular, this is because the build system is responsible for its control flow. So it's like, you know what the inputs are, you know what the outputs are and the build system figures the rest out itself. And so this means that certain aspects of the build can be statically analyzed. In particular, you should be able to look at like relationships between different pieces of code or different parts of your project and know which ones are used by which other ones because that's what you define within your build files and that's what the build system just like takes that information and puts it together. But you don't need to run the build to necessarily know all of this information. So because monorepos tend to use declarative build systems, they tend to be easier to automate because of that. This isn't guaranteed. I'm sure there's some monorepo out there that uses an imperative build system. The maintainers are very happy with it and shaking their fists at me right now. But in general, monorepos do have to use declarative build systems. So now I'll wrap this up by looking at the AOSP monorepo, the Android open source project monorepo and how its build system has evolved over time to sort of prove or give a good example of this tendency of monorepos to use declarative build systems. So AOSP initially started off as like here's a big collection of projects that are stitched together. And this was like a fairly big undertaking. Most of the projects already used like make files or make as its underlying build system. So they continued to do that. And then this started to break down as a project got larger. It became error prone, it's hard to debug, it's hard to maintain those. They're hard to test. And so AOSP started to move towards Android make files, which are like make files, but a little bit more structured, a bit more declarative. And then once again, this started to break down. And so AOSP started to move towards the song build system using like blueprint files to define your build. This is again, much more declarative than before. But AOSP is like I said, it's a monorepo and it's a massive monorepo. Right now they have over 300,000 files. And I think it's something like 200, 250 gigabytes to be able to build it yourself. Like that's really hard to take 300,000 files and figure anything out of that. And it's also hard to instantaneously swap from one build system to another. So right now they're wrapping up being in a hybrid state where they have Android make files which are fed into Android make and then converted to a Ninja file for the Ninja build system. So that was the old type that still exists. And then they have Android blueprint files which are fed onto the song build system and that generates Ninja files which go into the Ninja build system. And then that compiles it. So this is like a fairly complicated system to look at manually and know how things relate. It's also really hard to look at in an automated way and know how things relate because you need to understand like so many pieces. You need to know how Android make works. You need to know how song works. You need to know how Ninja works. And you need to be able to get the pieces of information from those three systems and like actually stitch them together. So like I said, this is an example of a massive monorepo moving towards a declarative build system. So we started off with make files which are very imperative. Move to Android make files which are less imperative and moving to blueprints or song which is less imperative with Ninja as a sort of hybrid. And they've actually announced right now that AOSP is moving towards a basil build system, a fully declarative build system who is, it is like a very common build system for monorepos and it's like becoming adopted much more frequently right now. So like I said, clear progression from imperative build system to declarative build system for monorepos and a clear progression from a really hard to automate type of build system to much more manageable to automate build system. So to give a quick recap, open source compliance requires a really good understanding of your code base. You need to know what your dependencies are. You need to know how they're used. You need to know the risks and mitigation steps. And you need to be on track of like any changes that happen within your code base. And so we really want to automate this. However, automation needs homogenous projects to be effective and to get like a good payoff for the amount of work that goes into the automation. And so monorepos are like both a challenge because you need, it's much harder to get this granular in-depth understanding of the code within your monorepo but they're also an exciting opportunity for like automation because the projects are more similar so it's more manageable to automate them. And because we now have a lot more information on the relationships between the projects in your code base so you can understand them a lot better. I'd like to give a special thank you to my manager and mentor Eric Morris for his help preparing this talk. And also list off a couple of resources that I found really good. In particular, the book in software engineering at Google lessons learned from programming over time is excellent and they have a very, very good section on build systems in particular on Bazel and its evolution and how they've reached that and all the cool things they're doing there. And then the Fossa blog itself has a couple good articles on monorepos and on automation for license compliance. Then does anyone have any questions? Yep, so going over why imperative build systems are harder to automate, I will back a couple slides so I can point at it. So in an imperative build system, it's usually, you have like a set of steps where you'll say first do this, then do that, then do whichever and they tend to be very, very flexible. So you can tell your build, I have a warning, I'm sorry, how much is that? Thank you. So if you have a set of steps, you can do very much whatever you want with them. So you can, the example I think I gave was there's this urban myth of someone who's like has a bash command within their build system where it just downloads source code, builds it and then deletes it afterwards. So unless you're actually like walking through and looking at what it's doing, you can't quite tell. But there also tends to be a lot more flexibility in terms of how your build system files themselves are structured, so you can lay them out, you can order different things, you can have different naming conventions. In declarative build systems, you don't really have a set of steps. You just sort of have a list, like you'll have like a module defined where you say this is my list of inputs, this is my list of outputs and the build system, it's all sort of figure it out. So it's in general, they often do the same types of things and you can find a way to make declarative build systems and imperative build systems behave in sort of similar manners. But declarative build systems always push you towards like hey, define your structure with inputs and outputs whereas imperative build systems are like, oh, go figure it out yourself, do whatever you want. So they tend to be less standardized too. Thanks for that. Yep, repeating that back for virtual listeners sounds like Monoripos are a lot easier to automate or a lot better to automate because of these declarative build systems. Hi. Hi. Has the use of Monoripos changed your processes around branching and merging and pull requesting? Oh, yes. So I've worked with Monoripos at, I worked with the massive Monoripo at Facebook for a while and they had to build up this entire system around just getting any of your code in because when you have like thousands of people who are trying to get a commit merged in any hour, everything just breaks down and a lot of like automated testing on merges breaks down and stuff gets really slow and it's a little bit of a pain to work with. On my current job, we use a very small Monoripo where we have I think four or five projects inside of it and that hasn't had major impacts on us yet. Our pull request process is still the same. We have had to define different ways for engineers to handle code review. So specifically saying like, hey, we have this like round robin, we have this setup for picking code review on this project or picking a reviewer on that project. But it does tend to be, yeah, otherwise it's fairly similar for the smaller Monoripo and for like individual projects there. I think it can be very dependent on your team and on your project setup and just generally what you're working on. So you're aware of any changes to your process instead of doing a Git flow, like a modified Git flow for Monoripo or something like that? Yep, so I don't have any good resources off the top of my head. Yeah, I don't know any off the top of my head. Any further questions?