 So, our next figure will be Mr Angad Singh here presenting on Python PANs, a build system for large code bases with multiple dependencies. So, Mr Angad is actually a leading infrastructure and operation team at Vicky Rakuten, and previously he worked as a site reliability engineer at Twitter. He's passionate about infrastructure automation, monitoring, distributed system design, and development of the team. So, that's why our fans to welcome him. Thanks for joining. I know it's right after lunch, and you guys might be sleepy slightly. It's okay, I allow sleeping in my pocket. So, if you feel sleepy, you can just raise your hand and let me know. Okay, so let's start. My name is Angad, as the introduction mentioned. And today I'm going to tell you about build systems for large code bases. So, I graduated from NUS Computer Engineering in 2013, attended many lectures in this room. I was an SRE at Twitter after graduation for one year, and then I joined Vicky in Singapore as a DevOps Engineer. So, first thing that I'm going to... Let's get the agenda straight. So, first thing that I'm going to talk about is code organization and for large code bases, and then how do we solve some pain points to improve developer productivity? We will go over how PANs can be used as a build system and how it generates easily portable PECS files. And then we'll go over some code examples about PANs and just some repository examples about how we structure code. So, I might be talking about PANs here as a build tool, as that's the one that I'm mostly familiar with and it's written in Python. But there are other build tools as well, such as there's Buck by Facebook, Bazel by Google. So, I urge you to explore all of these tools. And the main concept being that using a build tool is better than not using one. So, you should explore all of these. So, let's start with code organization. So, let's say you're a small team or a small startup. The most intuitive way to organize code is to have a project repository per service. So, you can have, let's say, one service is the user service and you have a repository called users. You have one payment engine, you have a payment engine service. So, you divide your code into blocks just as you divide your service, just as you divide your microservices. The idea behind microservices is a great idea. I mean, you can have all your teams working independently on different pieces of the infrastructure. You can do different deploys at different times. You can deploy user's code. You can deploy your API. You can deploy some API changes in the front. But that does not mean that microservices should be micro-repositories as well. Your microservices should not map to micro-repositories. It is an opinionated thing to talk about when I say all of this. So, take all of this with a pinch of salt because these are all opinions and they all work in different scenarios, different opinions work for different people. So, a lot of companies are now going from the idea of splitting a monolith into microservices. So, the most intuitive way, the first thing that you think of is, okay, I have these services. Let me pull out the code for these services and put them into individual repositories. And that's the simplest answer. In fact, recently you guys have seen that GitHub is offering unlimited free private repositories. So, why not? Why not create multiple thousands of repositories? So, as a small team, that's okay, that works good. You have separation of concerns. But as teams grow both in terms of age as well as in terms of team size, managing these thousands of micro-repositories it becomes a pain. Managing microservices can be automated, can be handled. So, why not do something in this scenario as well, where you automate your management of these repositories? So, as I mentioned, so one of the main reasons I say that it can be painful is because a lot of these services might have some shared code. For example, some service might have a special way of connecting to a Redis. They might say that, okay, after three retries I don't want to retry again. But all of these are configuration parameters that can be supplied to a library. So, technically all of these services should not be re-implementing or restating all of these basic logics. So, this is where I say that's the shared code. So, if you have shared code in multiple repositories and multiple teams are managing these repositories it turns out that all of these people might be doing it differently and you end up with a lot of messy standards. That's one way. The other way is you can publish these shared code as some sort of artifacts, maybe jar files or maybe you can publish them as some Ruby gem in case of Python you can publish them as eggs whatever you want. But then all the other services import them by doing versioning but then you end up in a version hell. Let's say I maintain a library and there are so many downstreams there are 50 downstreams 50 downstream services that are using that my code and I want to make a tiny function change like a tiny change that does not change the API but it adds some improvement. Now I have to go in all of the other repositories and make that change. That's definitely a productivity nightmare. So my idea is that it does not scale well for a large number of microservices and you end up with some complex methods of sharing libraries publishing artifacts, versioning hell. So I invite you to think of code unit, not as projects but as libraries. So there's no special definition of unit of code but I'm defining one here which I call library is the unit of code. So now each of the services so let's say we divide our code into a bunch of different libraries and each of the services are composed of these libraries with overlapping libraries amongst multiple services. So the idea behind that if you start with this mindset then you promote the idea of writing reusable and shareable code directly from the start. So you say even if that module might not be intended to be reused later on you start with a concept that I will make make sure my code is individually testable and can be exported into another project that might need it. So we think of libraries as units of code. There are lots of other benefits of going with this approach. You get a single link build test and release process that can be baked into this repository. They're easy to coordinate changes across multiple modules. Easier to set up development environments because all the development environment tool chain is present inside this repository where you can set up all of these services together for you. So let's say you change one shared library and if that library is changed it will run the test for all the services that are using that library and run the test for all of them. So this is not a very new concept that I'm introducing here. This is an existing concept that a lot of other companies have used in the past. So it's called the Monorepo. There are pros and cons of this. A lot of people argue against Monorepo and a lot of people argue for Monorepo. So I define Monorepo as a repository with a defined structure for organizing reusable components of your code. So it started with Google. So Google has a famous very large Monorepo in which they have this system called the Bazel with which they extract certain pieces of the code whenever developer wants to work on them. But the good thing is that they have standardized their tooling all around that. And it prevents re-implementation or reinventing the wheel every time you start a new service. You get all the benefits that other services have. For example, if you start building a service within this Monorepo you will get monitoring for free. You will get profiling for free because it is already part of the build process. It is already part of the tooling. So the success of large engineering teams depends heavily on building on top of the work people that have worked in this company before. You keep on building on top of other people's code. So now we know that there are some benefits of thinking if codes are libraries and organizing your code in such a system where it's all present in one place. So in summary we need a system that allows easier code sharing amongst multiple projects. And as mentioned fixing a bug in a function should not require changing versions of other downstream projects. So we need standardization and testing in building process. And let's talk about Python. So I'm pretty sure a lot of you have used virtual environments in Python. And virtual environments to manage dependencies of a single Python project is very easy. You have your version environment. But if you have multiples in like tens of 20s of projects in your repository, managing the virtual environment is going to be painful. So we need some way to automate this process. We need some way to automate the building of your virtual environments. So that's where PANTS steps in. So PANTS it's actually PANTS. So it is a build system. It is a build tool with support from multiple languages written in Python. It was developed at Twitter and Foursquare to manage multiple build targets in a single repository. The dependencies are managed in build files. So think of build files as your make files that live alongside the code. I'll touch on build files. So let's talk about the history. Why is it called PANTS? Why such a weird name? So it used to be a Python wrapper around Java and the famous Java build tool. And it used to generate just the build.example file that used to be consumed by and. So Python plus and is PANTS. So that's why it's called PANTS. It's just a weird name somebody came up with. Then it was later completely rewritten to be an independent support for JVM languages and Python. Though the other languages are also supported with plugins. So let's go some basic concept of PANTS. So first thing is you define a source tree. So source tree is basically your mono repo. And inside it you have a source folder, SRC. And inside it you have multiple languages. Let's say Python, Scala, Java. Inside the source tree each of the leaf nodes is your target. You want to build the targets which are the leaf nodes. And build files define the targets at each leaf node in the source tree. I'll talk more about build files later on so if you don't understand this part, that's fine. So the build files are written in a specific DSL. It looks very similar to Python. It's very simple. It has some basic functions. It basically invokes some Python constructors in the PANTS project. So when we talk about targets, so what do we mean by targets? So when you have multiple services what does a service mean? Service is a process. So a process starts from a binary or some sort of like some sort of a process. You start from some sort of a process. Or a target can be a library which can be used by other targets. So targets can either be a binary or a library which can be referenced by other targets. So here I mentioned PECS for Python. So let's talk about what is PECS. So PECS is Python executable. It's similar an idea to what a virtual environment is. So a PECS file is a specially crafted zip file with a Python directive. So it starts it can run anywhere where Python can run. It basically is a one single file which contains all your dependencies just like in virtual environment but you have that one folder. So it's basically a compressed version of your version environment. So you can put this and it so it's an immutable artifact. So you can put this in your docker container and run it anywhere where Python can run. So this follows the same ideology that the docker follows. You create containers you create immutable containers and then you destroy them whenever you want to change code. So here we are creating immutable artifacts in Python that will run on any server that can run Python. So it is a zip file with a Python directive as I said. It can also run targets locally without the need of maintaining any complex virtual environments. So Pants also supports versioning of third-party dependencies. So for example Flask, Request or any of those libraries. So you can basically do so all of your projects in that repository can use a standard version of Flask can use a standard version of request library and you can ensure that all of these so you basically don't run into surprises because of versions. Let's look at what a build file is. So as I mentioned it's a pan DSL but it's essentially a function call in the background. So this is what this is how you build a Python library. So for example my folder contains source Python some shared library slash lib.py. So here in that build file this is how I'll define. I'll define the dependencies as fabric and then in the source file I'll say it's a lib.py and I give it a name called sharedlib which can be referenced by other projects or other targets. Now let's say I want to build a wrapper on top of this Python library. I want to build a wrapper a command line interface to this this library that I made. So all I have to say is Python binary named CLI and in the dependencies you can see that it references sharedlib. So you can define complex dependencies and pan can figure out the dependency path and build the files in order. So so this is the build file example. So let's let's go over some examples. So the most common popular examples are the Twitter Commons repository. So Twitter Commons repository contains basically all the common boilerplate code that is used into Insight Twitter and it's open source amongst so it basically contains like zookeeper shared libraries or we'll just go over that in a bit. So let me give you a quick demo of a like how a simple Python flask application can be built using Pants. So can everybody already see this? It's big enough? Yeah. So I have a repository here. Right. So ignore the disk folder because that's where the pex files are built and stored. It's actually also in Git ignore. So as I mentioned there's a source root. So source, Python. So I'm so the goal of this is I'm writing a small app that does Hello World and then and then I use a library, shared library to translate Hello World into different languages. So first let's just look at helloworld.py. So I'm pretty sure a lot of people have used Flask here. This is very simple. So the default route slash is just returns Hello World. And then there's another route slash Lang. So you can pass any two character language like DE for German ZH for Chinese. And it will translate to different languages. So you see I use from translator import translator. Right. So let's let's look at the translator. So translator is very simple. So it it maintains a counter of how many times you have translated. So it's just some simple functionality inside it. It uses a third party library called text blob to do the translation. So it returns the counter and just the translated message that is passed to it. Right. So let's so let's look at the build file for this translator. So as we saw earlier if you want to build a shared library this is the build file syntax. So you say Python library name translator dependencies so the dependencies the third party Python text blob. Right. So in the third party folder I've listed all my dependencies. So it's if you notice that's requirements.txt this is very similar to what you're used to doing in pip. So in here I've listed the two dependencies flask and text blob. Here I can also restrict them to specific version so all my projects, Python projects will use just a specific version of flask or a specific version of text blob. So source Python third party text blob and then globs basically means it finds all the star.py files inside the directory and concatenates into a list. So now I want to use this translator library that I just built in my hello world. So in the build file here you will see that I'm using flask so third party flask and then I'm also using translator which lives inside this project. So source equals hello world. Right. So now let's let's go back to the command line and let's run PANTS. So PANTS is shipped as a BASH file. So if you open PANTS so it invokes PANTS using BASH you can go read about the documentation about the details of how it's involved. So what I'll do I'll specify a goal that I want to create a binary and then I'll specify the path. So source Python hello world and I want to create the hello world target. So this will generate the hello world for me so now it's generated inside DEST folder and this is the PECS file. So now when I'll run this PECS file let's go back to the browser so localhost 5000 so that's hello world that's the root now let's say I've translated to Chinese, ZH I don't know what the language is MO I don't know what the language is D, E, German Right. So it's a simple app running as a single binary Right. Now I can put this binary inside my docker container and all of so my docker file will be very simple Right. From open to add hello world at PECS run hello world at PECS That's all. So it's my docker file is very simple it just contains hello world at PECS So one thing that I So now let's let's see what the PECS file is actually so I can actually do a unzip so it's as I said it's a it's a zip file so it basically extracted that here so it so the PECS file is basically just a zip file and it has a directive in the in the start that basically says invoke python on the main.py Right. and the main.py then eventually calls your code so all the binaries all the if you see all the dependencies are packed together in the dot depth folder so text blob is here all the NLTK the dependency of text blob so all the dependencies are packed together in one file in one executable format so this is very useful when you're deploying services using docker containers so I've built a sample docker container but this docker file is so this docker file is just an example where I actually build the PECS file inside the docker build command but ideally you should just be putting you should be just be putting you should be your CI environment should just build the PECS file and you should just put it inside the docker container so here I basically build it it's very simple install some dependencies PANTS binary hello world and just run hello world or PECS so so this is one example very simple example I'll show you one more example so this is this is wikis wikis actual code production code this is the operations repository where all the infrastructure-related code is so I'll show you source Python so we have so this is I've taken out some parts from here so there's so we have the deploy which is written in fabric HAProxy is basically some helper tools that we've written around HAProxy and opsmaster is just like a control it's a flask application that uses all of these libraries Postgres uses all of these libraries SignalFX uses HAProxy library so we basically have a lot of code sharing here and this is facilitated because of PANTS each of these could have been individual repositories but I think that would have been very difficult for us to manage there are more projects inside this that I've extracted out so that's now let's also look at Twitter so this is Twitter Commons so this is also another library this is also another MonoRepostile project where you can reference from if you're trying to build applications using PANTS so here you can see source since it has support for multiple languages Twitter mainly uses Python and Java so you can see there's Java code here that it can build there's Python code here so so check style so the benefit of having check style inside the repository where all the code is that's immense you can enforce standards across all the code that the developers are committing so common this is all the different so Confluence they have some API around Confluence some decorators all the other projects inside this use so this is just so this is probably 10% the size of the actual repository inside Twitter there's like hundreds of more projects that use all of these common libraries so I think that's all for my talk I hope you have a good idea about PANTS I'll be happy to take some questions it does not package the system dependencies so system dependencies need to exist on the system it only packages just like virtual environment it only packages the Python dependencies so this also only packages the Python dependency so container needs to set up I think for Postgres it'll be Postgresql, common dev that package needs to be installed so those header files need to be supplied so it does not so PEX does not fulfill the complete role of what Docker does so it's not a drop-in replacement for Docker it does not package the full system it's Python like the languages that it supports it packages the languages dependencies but if they have native dependencies then the system needs to provide to native libraries what server? whiskey server I'm not sure WSGA so that will come as a separate process so that you will have to that will be I would say that will have to be handled separately outside of PEX unless it is native Python unless it is native Python as mentioned earlier thanks for saying it