 Hello, welcome back to Brian. I hope you had a great great dinner Your belly's full ready for some more talks Because I am I didn't have dinner, but I'm ready. Are you ready, Benji? I'm ready. You are great to be here everyone Great to have you you're talking about Mona we both writes that is correct. Yes a controversial topic on which I have strong opinions Amazing your slides are ready. I'll leave you to it Appreciate that. Thanks So hi everyone, thanks for attending my talk. I hope you've all had a great conference My name is Benji and I'm one of the core contributors to pants, which is an open source build system And today we're going to talk about python code base architecture and specifically about monorepos And I know that this is a controversial Topic on which there are many opinions. I am going to present mine and I fully acknowledge that there are others A little about me and Why it is that I came to have a lot of strong opinions here I've worked as a software engineer for many years. I Have had the good fortune to work at some really great companies And as I mentioned, I'm a maintainer of the pants open source developer workflow system And in the last couple of years, I have been a co-founder of tool chain, which is a startup in the Developer tools space So a quick overview of this talk Basically three parts. I'll start by defining terms. What is a monorepo? What do I mean when I use that term? A bulk of the talk will be well. What why would I want one? And the last part of the talk will be about Say you're on board with wanting a monorepo. What kind of tooling makes working in a monorepo effective? So let's jump right in to What is a monorepo? so there is one thing that is a common characteristic of Kind of basically every code base almost all code bases that any of us are working on and that is that they grow over time and they grow because You have some set of developers that are adding code over time But also if you're hiring and your team is growing then you're adding more developers who are adding code over time and so your code base can grow faster than linearly over time and There is a very common consequence to this which is that builds Get slower and less manageable. They become unstable. They become unbearably Slow and clunky. Now I can already hear The ejection forming in some people's minds. What do you mean by builds? Python is not a compiled language So I'm using the term build and if you don't like the term build you can substitute develop a workflow In the general sense So it's every step that you take from when you hit save in your editor to having an artifact that's ready to deploy So for example resolving and downloading your external dependencies or generating code or type checking or running tests, obviously is a big one Debugging in a repl linting formatting actually building and packaging those deployable artifacts python definitely has a build It just doesn't have a build time compiler So that's what I mean by build And again these all of these get slower and less manageable as your code base grows So what do we do? as your organization and your code base grows we have basically two architectural alternatives To how to manage that code base in a scalable way And those two are the multi repo architecture versus the mono repo architecture And I should mention that obviously this is a continuum. You don't have to be at the extreme end of either of them Um, but generally this is the these are the two poles through which this continuum goes Now let's start by talking about multi repo Multi repo means as your code base grows you actually Split it up into a growing number of Small or you know, manageable size repos and typically you split them along team boundaries or project boundaries or library boundaries And very often this kind of happens naturally because it's the path of least resistance The easiest way to get a handle on scale is to just take an axe to your code base and just carve it up somehow And worry about the consequences later and we will obviously talk about those consequences soon But there is an alternative and the alternative is What I refer to as a mono repo um And the mono repo is you keep a single unified growing code base that contains code for multiple projects multiple services that share underlying dependencies and tooling and best practices Now a mono repo, for example, may contain code for multiple in multiple languages Or even in the same language it may contain code relating to multiple frameworks And typically you'll have different parts of your team working on different But overlapping parts of the mono repo very often they will overlap on exactly those shared dependencies And as your code base grows as your organization grows your single unified mono repo grows along with them now I should emphasize that We are talking about code base architecture not deployment architecture So a mono repo is like as an architecture is agnostic to whether you deploy You know a few large monoliths or you deploy many microservices And in fact, there are often many advantages to deploying microservices out of a mono repo because when you have a great many services they have to They share dependencies and in particular they often share protocols Because that's how they talk to each other and so it is often very helpful to have those be in the same repo And that leads me on to the second part of this talk, which is okay mono repo defined why should I want one and I will freely admit that multi repo does sound better at first It's more decentralized. It allows you to make local decisions Um You you can sort of do your own thing in your mono repo you can put a boundary around it and sort of maintain order In your repo and keep all the barbarians from from other parts of your team out of your code It sounds good and and these are good buzzwords. Uh, and there are cases when I think that is actually valid. Um, I'm not A total fanatic here But there are a set there is a set of core problems in code-based management that multi repo Not only doesn't solve, but it hides them It obscures the problems until you encounter them down the line Uh, whereas a mono repo makes these problems explicit so you can reason about them And there are various ways to convince yourself of this. Um, because of shortness of time. I've boiled it down to uh, really one main Point that I want to make which is that uh, from my experience, uh, the hardest code-based problems are managing changes and managing dependencies And the overlap the intersection of these two is particularly tricky. It is where so much of the pain Of code-based management, uh, lives and to the extent that we can pick a code-based architecture That makes handling these problems easier. I would argue that we should consider doing that So when we think about managing changes, um, in a world of dependencies Or managing dependencies in a world of changes Let's look at first how these challenges are handled in a multi repo world so Multi repo relies on publishing So if you have a bunch of repos and you know, some of them consume some code some library or some utility from repo a It has to publish an artifact say in python that would be an estus or a wheel or whatever Now when I say publishing this could mean to a private internal Corporate repository not necessarily to public pi pi because remember that we're talking about an organization's internal code base here So we rely on publishing But unless repo a never changes which I mean let's be realistic here. It will It's not just that you need to publish but the artifacts you publish need a versioning scheme So When you make a change or someone makes a change to this repo a they have to republish that under a new version And why because otherwise those changes might break the existing consumers of repo a at the old version So if you know to look at a more Specific example here if we have repo b that depends on repo a it does so add a specific version So now say you're an engineer working in repo b and you need some change to this upstream dependency so First you have the organizational task of finding out who owns repo a You know figuring out are you even empowered to make a change in repo a but let's say that you are or you've convinced The owner to make the change that has to be Published under a new version and then you need to consume that new version presumably in a new version of your repo and now something needs to happen and you have two choices when it comes to Change management through these dependencies So first of all you can decide to be a good citizen and make the virtuous choice The virtuous choice is you find all the other consumers of repo a that could be many um You ensure that their code still works with your change You modify those repos as necessary until say their test files or until you've figured you know qualified that your change is good for them And that is a lot of work Because first of all, how do you know who all the consumers of repo a even are remember that? Uh dependency consumption metadata lives on the consuming side So repo a there's no metadata in repo a that says here is the universe of my Of other repos that depend on me. So you have to find them somehow Then you have to figure out. Well, how do I? test the change what do how do I know enough about this repo to even run its tests or You know figure out how to qualify that this Change is good But let's say I figured all that out and I've done it um I'm still not done because I actually have to do this recursively if I made any changes to the direct consumers of a So repos, you know, sure. I made changes to b because that was why I started this whole Laborious piece of work, but say there's you know cd and e that also depend on a if I've made any changes there Then I have to recurse this process and I have to find all the consumers of those libraries So those repos and do the whole thing over again. This is a tremendous There's a lot of friction here is a tremendous amount of work so even though You know being a good citizen is good, but very often you end up making the I would call it the lazy choice Which is I'm not even going to worry about the other consumers of repo a because that's what versioning is for right They're safely pinned to that earlier version And when those other consumers need to upgrade I will let them deal with that problem but This isn't very nice because by the time They have to do that upgrade. Uh, they may have no context for your change. You may have lost context for your change Maybe maybe you're not around anymore Um, so essentially what you've caused is You know the problem famously known as dependency hell. Um, you've you've effectively Caused these other repos are potentially you've caused a dependency resolution problem Where if they don't upgrade but they depend on a through two different paths They may end up with two different versions Which is impossible like one of them has to be picked and it's possible that neither of them will work And this is exactly what I meant by Hide that the multi repo hides problems It allows you to essentially push responsibilities off onto other people at a future time So you've left a little time bomb in the code base And that is not a great way to be part of a of a productive and cohesive team So that was multi repo But in a mono repo, there's no versioning or publishing all the consumers Again, we're talking about a continuum But in the extreme case at least all the consumers are right there in the same repo You can find them using like ripgrep or whatever your you know code-based comprehension Utility of choice is you can run all the tests You can run all the tests in the repo You can run just the relevant tests if you have the right tooling to do that to ensure that your changes are good and Any breakages are immediately visible. You're essentially making self-contained changes that are Guaranteed if you are scrupulous about running tests and so on they guarantee to be good at head Like you are keeping everything in lockstep And so you can see an example here a really important one in my opinion of how code-based architecture actually enforces good teamwork and responsibility So more generally this was a big example. I would argue that Even though it seems counter-intuitive Mono repos can often be more flexible than the alternative. So for example, they are easier to refactor They're easier to debug because you can transit easily through all the sources that affect your the Code under debugging You have unified change history. So it is very easy to reason about The mutation of the code over time It's easier to discover and reuse code because everything's right there in this single repo Putting repo boundaries between different parts of your code base reduces flexibility and my last point here and more generally is that I have found I have seen from experience without naming names that The code base architecture is very often a reflection of the structure and functionality of your organization So there are many cases where localized decisions and creative chaos are desirable sort of wisdom of crowds type thing But an engineering organization Is not a bottom-up kind of thing like by its very name and nature It's organized and in a well-functioning engineering team priorities and decisions and effort allocation flow Top-down and some sort of top-down organization is required and in engineering organizations code base Very often reflects those organizing principles. And I think this is why many large companies You know google facebook twitter many others have adopted this monorepo architecture Because it helps keeps the organization unified even when they're at huge huge scale So that's kind of what I had to say about my my sort of advocacy in favor of monorepos for python and in general But focusing a little bit on python I did want to For the remainder of this talk about tooling for a python monorepo. So we are Accepting that we for the sake of argument. We've accepted that we want a monorepo. How do we work in one effectively? so The observation is that standard python tools the tools that you all use every day from Pi test to pip to my pi to flake a to you name it Generally are not really designed for a monorepo architecture Typically they rely on the reasons are that they rely on global state And they rely on sort of side affecting into the file system They they leave virtual m's and files all over the place and they're expected to be there when you need them and As a result Small changes tend to trigger full reruns of whatever task it is. You're doing whether it's running tests or running I don't know my pi for type checking Lintas formatters you tend to sort of run them universally The tools generally are designed to expect that they run on an entire Package hierarchy on an entire repo now some individual tools may have special case mitigations for some of these issues But it's very ad hoc So typically when you end up using when you use these teams these tools sort of naively in a growing Monorepo environment they do a lot of repeated work And that slows things down So how do we speed things up? really There are two ways to speed up work. There's one is do less of it and the other is do more of it concurrently To do less work In this context of developer workflows You really need two feet two important features. You need fine-grained invalidation namely the ability to analyze at a pretty fine grain what the effects of changes are And caching so that if work has been done before you don't have to do it again Done by you or potentially done by someone else on your team if you have shared remote caching When it comes to doing more work at once First of all, you need to be able to reason about concurrency You need some system that can say these pieces of work are allowed to run at the same time They can run concurrently because they don't one does not depend on the output of the other And you want to at least have support for remote execution so that Instead of concurrency being limited to the number of cores on your laptop or on your ci machine You could potentially have dozens or hundreds of cores working for you at the same time So what kind of tooling has these features? To work effectively in a mono repo unsurprisingly, you need a build system designed for that Now these build systems Or more generally developer workflow systems don't reinvent the wheel for the most part You want one that sits on top of existing standard tooling But it orchestrates them for you it figures out when to run which one and on which inputs and so these tools tend to be very different from things like make or talks or running the underlying tools directly like literally just running PyTest directly on the command line and there are good reasons for those differences and we will get into some of them So Fortunately such tools exist. Here are some examples of them I and there are several others now I am Freely declaring my bias here I have as I mentioned one of the core contributors to pants and have been for many years Through two major rewrites of it. So I'm naturally biased towards it But I will mention to back up that bias for this audience that pants was The most recent iteration of pants that we launched last year was specifically designed with python use cases in mind So it's not tacked on to a c++ System it is we designed for python But I each of these systems has their strengths and they tend to work in similar ways if you squint, which is what I will Go into for the remainder of the talk So there are many interesting aspects to these tools I'm going to look at three that kind of To me stand out and what makes them different from other types of tools You might be familiar with and the three are that they have a goal-based command interface That they rely on bill graph metadata And that they implement a workflow That has no side effects and relies on those side effects and that you can extend with custom logic And I will now go into what I mean by all of this so um Let's start with the what I referred to as a goal-based commands interface So with a monoripo build system You don't say, you know run pip on this or run pi test Instead you specify I want to achieve a goal and these goals are generic verbs like test or package or lint And the system translates that into execution of underlying tools like pi test setup tools pilin flake eight whatever My pi whatever test whatever tools are necessary here This is important for a couple of reasons one is To support in that it's necessary to support invalidation and caching Like you need a conceptual layer between what the user wants to achieve and the expensive part Which is actually running processes because maybe you don't need to run a process. Maybe You can resolve get a result from cache or maybe You don't you know, you only need to run it on some subset of files So you let the system figure that all out for you And so you can see some examples here of the types of command lines you can run You can say run on this file. You can run on this glob of files. You can run on just files that have changed in Git since This tag that kind of thing So that's what I mean by goals The next point I mentioned was The build graph the idea of this code dependencies So monorepo build systems need Extra metadata to figure out sort of the packages and the dependencies between them and like what the structure of your code is Some tools require this to be very explicit and handwritten and other tools and pants included can infer This metadata by looking at import statements and other aspects of your code And having this data allows the tools to do fine-grained invalidation. So for file changes We know Which transitive dependencies are affected? So now we know for example, which tests might need to be rerun and which tests can be skipped or resolved from cache And another example of how this data is used is that the system knows Which internal and external dependencies need to be packaged into your deployable binary because in a monorepo You may have multiple binaries that you're deploying And with the standard tooling, you know, you sort of have to pull in an entire requirements.txt for example And you know, you might need many requirements.txt if you had different requirements for different Servers and keeping them in sync is difficult. Whereas a monorepo tool can just Slice out just the dependencies you actually need So These are code dependencies. There's another type of dependency that these systems care about which is The dependencies between units of work And this is a static mapping that you can construct at startup time before any work is actually run And what it is is just a set of rules that knows how to transition between inputs and outputs And these rules form a graph That maps the initial inputs, which are things like files on disk and the Then sort of state Two final outputs, which is the goal that the user requested So you could imagine work just flowing through this graph from Output to input to output to input Until you achieve a final result And this is where the extensibility comes in You can in all of these systems because Everybody had you know, very many organizations have little custom scripts and build steps You can plug them right in there Into this rule graph And it runs exactly the same as any other part of any built-in rule And this is where we get to the workflow So those Based on the code dependencies plus the task dependencies You can create a with both of which are static You can construct a dynamic workflow at runtime And what actually runs depends on invalidation on what's available in the cache and so on And the important point here is that this workflow relies on no side effects causes no side effects other than outputting the final results And it has no global state And why is that important because that is what enables concurrency and remote execution Right, think about it. If you rely on local state or you rely on side effects, you cannot run You cannot take work and parcel it out to a remote System to a cluster With any kind of confidence So this explicitly modeled workflow Enables these four features, which if you remember from a few slides back was Were exactly the features which speed up your builds and make it scale as your monorepo grows So to sum up I claim that monorepos are an effective code-based architecture in many cases That they require appropriate tooling And fortunately this tooling exists and I have been fortunate to have had a hand in creating some of it So thank you all so much for listening. I'm very happy to take any questions And you can find us at this URL Pants is part of a very friendly open-source community and we're always happy to help out with any build questions you may have So thank you again Thank you Benji for the great talk. So there's there's quite a few questions not a lot of time So I'll just go quickly through them Have you tried to use having dependency versions for all projects in their own separate repos? Like the example of the main main project that's the parent project Which defines the actual versions of So Having dependency versions for all projects in their own separate repos I'm not sure I understand the question. Is this where there is a project that specifies all the dependencies um, I I'm I'm not super familiar with that to be honest. I don't want to give a glib answer here. I suspect that you still end up with the problem of So the problem is less where you maintain the version numbers and how changes flow through your dependencies And I'm not sure that this would solve that but again, I I'm not entirely sure what the uh, I'm not entirely sure. I understand the question Fair enough Next one. Do you do you just need to have the same release cadence for all products in the monorepo? Oh, great question. I know you you actually don't because well You always regardless of architecture need to make sure that if you're deploying servers and clients Like services that talk to each other that they are capable of talking to each other through the upgrade Uh, and that's a problem that uh, in many cases, uh, you know, that is a nuanced problem Uh, but no, you can actually release different parts different services. Uh, whatever cadence you like There's no requirements to have the same release cadence Good to know And how do you tools like pants compare to tools like poetry? Do they have different purposes? Uh, yes, so interesting question because we actually in the most recent version of uh, pants the one that we will be releasing In a few days, uh, now has poetry support. So I mentioned that these tools orchestrate underlying tools and now pants is able at this most recent version to consume, uh, poetry dependencies from your pyproject.toml And I think they do have somewhat different purposes. Again, pants is much more about orchestrating a wide variety of tools when you have multiple, uh, potentially when you have multiple binaries multiple, uh, You can where you may have multiple distributions in the packages that you're releasing from the same repo. Um, I'm not You know, I'm not a regular poetry user, but I'm more familiar with it as a tool that you use predominantly to manage Sort of a set of dependencies in a single world It is sort of poetry again is one of those tools that expects to own the world Uh, and again, we we now and we now support it Yeah, you have you have many tiny worlds inside the same repo Right So the example would be for, you know, if you have 10 different services that you're deploying from your monorepo Um, a tool like pants can say okay poetry determines the universe of dependencies I can Select from but pants will actually slice out just the ones that you genuinely need based on your actual dependencies So you can you know, if only one server needs this big heavy library only that server will Use that bit will will be deployed with that big heavy library Right Last question then, um, in a monorepo, how do you distinguish, uh, get changes from different parts of the repo? Um So again, I'm not entirely sure I understand the question. Um, you can So I can explain Yeah, please So just because I put the question in I'm assuming that that there's different services, uh inside the same repo Yeah So like so there will be like a change that's unrelated to the to the whole product by itself But it's just specific to a service for example I see so I would actually distinguish between looking at git and just looking at changes in general so, uh Pants for example has you have the ability to say I don't really know what changes have happened I the I don't know which parts of the code base I care about I care about the ones that have changed in git So just figure that out for me and you can you saw an example of a command line like that earlier But the way change management is done in general is through content hashes So essentially if a result has been cached You can say, you know run these tests or run these subset of tests And work has been cached based Entirely and purely on content hashes So if no changes have propagated if no changes where no changes have happened The fingerprints will remain the same and the work can be recovered from cache where Anything has changed the fingerprints will change and so essentially these tools do their own change detection Git is a layer on top At the sort of user interface level where you can say I happen to know as a user that the changes I care about are these changes in git But that is not what any of these tools use to detect changes For robustness for for If that makes sense Yeah, it makes sense makes sense right so change detection has gone through fingerprinting Sure, I'm happy to take more questions in chat or come find us online Yes, go to the breakout Group and if you want to ask Benji some more questions. Thanks again. Cool. Thanks