 So welcome everyone, this talk is about Bazel and here we are with Klaus Heling, he is a software engineer in Google since 2011, and since last year he started working on Bazel. Please welcome Klaus. So thank you very much and thank you for the opportunity to present Bazel, the open source system I am involved in, and the purpose of this talk is to give you an idea of what are the ideas behind Bazel and a bit about the look and feel. So what is Bazel? Thought answer, it's a thought tool, that is like MEG and many other tools, it organizes how we get derived files from source files, so typically compiled stuff. It is a tool that is used at Google for over a decade, or it's a core of a tool, Google has some specific extensions compiled in, and from that, given that it's been in use at Google for so long, you can already arrive that it's mainly focused at the use case we have at Google, or whenever there's a design decision to be made, it's made to prevent us from that use case, which is, you have a large single code base. So the majority of code at Google is in one big repository, and that is under active development, and basically you build every single from head all the time. So that's usually called a large one repository, or one repository and it happens to be large at Google with tens of millions of files and tens of thousands of engineers working on it. And it is an open source since 2015, so only recently being open sourced. Okay, so why get another Google system? As mentioned already, it's optimized for large repositories, so what Bazel tries to do is to be fast, and not only by running things in parallel, but by also aggressively caching, but do so without losing correctness. That is the main focus on Bazel, to build correctly in the sense that if you now clean up everything and compile again, you get the same result that Bazel gives you with caching enabled. And the same, ideally means not only functionally the same program, an equivalent program, but really byte by byte the same thing, which actually is a bit of a challenge with all the time since it entered this compilation all the time. Okay, and the other thing which you can use from the fact that it's meant for a large code base is that you go for a declarative style. So you separate the concerns of writing application and building, how you build it. So you say this is a C program and have some specialized knowledge about building that. For a company like Google that is necessary because you have specialized engineers in building an application and others who know best how to optimize for the architecture we're running our things on. And it's also generally useful to have a central point where you maintain the rules how you build your files. Okay, so how does it work? And we'll go through an example later, but the general overview is you get a target, I want to build that thing, then you read all the build files, or actually all the files you need, all the build files you need. There's nothing like recursive calling or yeah, no recursive calling, you just read and construct the whole graph of all the dependencies you have recursively, but it's all in the same process. And from that you decide which actions you need to do, so which invocations of compiler and so on, and then execute it. Except if you've done the same command already with the same input, because then you don't have to do it twice. Okay, so generally generic build tool and that's to give you a bit of feeling how the whole thing looks like. Let's go through an example. Oh, before, one of the main design features is that the repository is big and we say we read all the build files. That can be a lot of them. And the assumption is that it's definitely true for Google, you work in the same code base for extended amount of time and compile again, run tests, change code run tests again. So it's worth optimizing for subsequent invocations of build and not only for the first one. And therefore to not compute the whole dependency graph over and over again, Bazel has a client server architecture that is, once you start Bazel for the first time in a working directory on a code base, it will start a server in the background that constructs the dependency graph and keeps it in memory. And once you ask for another build or another test in the same working directory, in the same checkout basically, same workspace, then the information will already be in memory and only update it with changes made to the files since you last involved with Bazel. And that's also how we can handle the whole dependency graph in memory and read everything without losing too much time on pre-computing the build graph over and over again. Okay, but as I said, the declarative style, how does it look? Let's look at a simple example. We want to write Hello World, which is a simple program except to demonstrate a point I choose to call a library, make a call to a library. So this program depends on a library, as it is usual with also bigger pieces of software. And in this case a simple C library, so you have a header file, and you have an implementation. And that could be a typical, well, part of the source tree, but it shows already all the relevant parts. In that simple case, and now we want to tell Bazel how to build something useful from that. The first thing is we have a workspace file, which in that simple example and in actually many examples is just an empty file. It serves two purposes. On the one hand to give the border of the code base. And the reference point where all absolute paths or absolute names of targets to be built refer to. And it also allows to specify external dependencies that are built in, so external repositories. As I said Bazel wants to construct the whole graph, so it also needs to build external repositories that are part of the build. But for the main use case that you have ever seen a big repository, you won't have external repositories. So very often it's an empty file just to get the namespace quite clear. And then you have the actual build files. In this case you have a library, or start from the top. Next to the C program you have a build file saying, look I have a binary, it's written in C. It has a name, this is the source file, and by the way it depends on that library. And for the library it's similar saying this is the library, it has a name, and all the C files in that directory are my source files and all the header files belong to that library. The important thing to notice is what is not in that build file. And this is all the details about what is my C toolchain, do I do a cross compilation or not, what is my source and target architecture, that all is specified elsewhere or you use the foresee there is also some knowledge built in. And the whole point is you don't have to specify that each location of the compiler but at a separate place. Okay, so now let's try to build a world and see what happens. The first thing is we want to build a target, so we need to know what the target is. From the namespace you see it's in the top level package, so we look, say we need to know about the package, and then we actually find the build file and read it. Once we know that in that build file we see okay, it's a C binary, we discover dependencies, there are two declared dependencies, one is the source file and the other is the library. Okay, now we need to build a library and performing that, again the important thing is what is not shown on the slide, basically also implicitly records which toolchain it is building things for. So we later change that it knows that all the binaries have to be rebuilt again. Okay, but then the library is also straight forward, it's found in a package, so you read the build file and the directory, and as already mentioned the whole dependency graph is built at once and kept the memory in the server. Okay, then you see clock expressions which means you have to really read the content of the directory, you discover the files you need, and now you've discovered all the things you need for your build, then you can evaluate the rules and know which actions you have to perform, so which combination steps. In this case you compile the C file to an object, build a library, compile the other binary, compile the other C file and then make everything together. So this is the graph that is actually proactive. This is how we get to derived files from source files, these are the actions. What the important thing is, there are all the other dependents, since we read during the planning of the execution, and we need to keep track of in order to discover if something changes, and that's what's made about it. It really records all the dependencies, not only those that are source files. For example, if you add a file to that directory, just simply adding the file, then of course nothing has changed to the targets we've built, but we discover that the content of the directory has changed, and when it's covered I actually mean on the operating system whether it's possible, whether it tries to be notified by the operating system about changes, but it can always fall back to restarting everything. So we definitely try to avoid reading all the files again, but actively try to find out what has changed. In any case, the content of the directory has changed, which means the whole part of the graph gets invalidated, everything that is not very visible there. So basically the directory has changed, so everything that is reachable from here needs to be redone. Okay, sorry. Okay, so that's not very visible, but basically since we discovered that a large part of the graph is invalid, we walk through the graph again, and once we hit the library rule, we see that since the content of the directory has changed, we need to extend the action graphs differently, so we have the missing part of the action graph, and then we know what needs to be done again. So again, the point I'm trying to make is that by really recording everything that went into that build, you can detect changes and don't have to... We're not missing updates, so we can rely on it being correct, so we don't have to clean everything and start all over again. Okay, that's all, actions. As I already mentioned, these are the productive part. These are invocations of compilers and linkers and so on, so they actually do something, they generate artifacts, and tends to take the biggest part of the time of the system. So that's why it's particularly interesting in avoiding redoing them again. So we've seen the dependency graph and it shows when something has changed and we need to redo it again, but there's also a caching of actions themselves in the sense that if the input hasn't changed, we don't need to redo the action, the output will be the same, so we don't do it. Of course, for that to work correctly, Bayesian needs to know all the inputs and all the outputs of an action. So Bayesian needs full knowledge about that because if you read more files than you declare, you might accidentally hit... or you might not redo an action that needs to be redone. So conceptually, that means there are nodes such... done something, targets where you have an empty file, saying, yeah, I've done all the prerequisites for tasks, but you actually do declare all the inputs and outputs that go into a target. And all your actions are supposed to only read the input that are declared. Now, say, suppose that sounds like a huge burden on the person writing the build file. Well, there is some tools that help. Bayesian has built in a concept of a sandbox to make it easier to write correct build files which only access the inputs they declare. And so by a sandbox, that's basically an isolation environment where only the declared inputs and the declared tools of an action are present and where only the declared outputs are moved out. Depending on the operating system that can be implemented in a different way, whether it be a change route or is, there's also the implementation where you just make a temporary link in everything you need and nothing else. So that is not a security feature. It's just a tool to help you detect incorrect declarations in a build file. And for that it works really reliable. And the advantage of having full knowledge of all your input and output files is that you can't send then an action to be executed to a remote place. So you don't have to build everything on your own workstation. You can use a build cluster if you happen to have one. And that, using a build cluster that's fully declared input and outputs is really powerful in a situation like Google where basically everyone works on the same code base because that allows having a cache that is between different developers. So you're not only not compiling a file if you already compiled it, but if someone else works on the same code base did it. Because then the shared remote execution will tell you, ah, I know the answer. That is, is it? And as I said, so if a lot of people work on the same code base that can save a lot of time and be very efficient. Okay, that's the general idea of how Bazel works and now I've shown C program. What about other languages? So first of all Bazel has specialized knowledge about a lot of programming languages including C, C++, Java, Python, and a bunch of others. It also has generic rules, in particular a rule called general, which is, well, generically to generate artifacts. It's basically just, no, it is a shell command to say execute that shell command from these inputs and outputs and to help us specify the shell command you have variables that might look familiar, it's $add, $lesson, and so on. So that rule should look pretty familiar, basically the only rule you have to make files. But at least it means you're generically enough to build arbitrary things because you can do that. And in Unix everything is a shell command, so you can build it. Nevertheless, I mentioned that the idea is to have a central declarative place about how to build things. So yes, for every language you want a central place where you have the knowledge of how to build things. And it doesn't scale to add all that information to basis because there are more and more programming languages and so on coming up over and over again. I mean, it worked as long as it was a tool trust for one company where you, all the languages you had, could add and control the tool yourself, but in general that's not a good approach to add specialized knowledge of each and every language in the world into a single tool. So that's why they're... Well, there's a need for a way to extend the build files but there actually is one. And that's called Skylark. It's an extension language. It has a syntax that looks quite similar to Heisen and also Symantec is quite similar. But it's... Well, it's basically Heisen restricted to some core where you don't have too far-reaching side effects so that you can evaluate things locally, don't need to go in the state and have at least the declaration of what you want to build in a very insulated way. Yes, and of course the domestic because you don't even see what you want. So in the simple case, and as I hinted on the general that is quite a generic case already, is that you can code up the knowledge how to build your language that isn't built in Basel and describe how to compile that language by means of already existing rules. I mean, it could be just general all over the place. So that typical example, yes, to build that language you just have to run that shell script and, by the way, you have several targets for, I don't know, the binary and binary. Just from that source you can derive several targets. So it's a typical example of documentation. You have your whatever RST file and then you can demand page or web page and so on and then you want to have in one rule and you have commands for that so that would be a case, yes, here's a general and I generate these types of things from that source file and whenever I write that declarative style please add all the rules to build those things. And here's an example of how such a thing looked like and as promised, the language really looks like Python including all the things that you can pass parameters by name and not only positional, which is very useful. You can ask default values for parameters. You can do some simple computations and then you can, well, and you can map it to already existing rules like the general, which is a native rule. And, yeah, so, simple enough and as soon as you can, even the typical case not only do some computations on the parameter to set up your command and set all the environment correctly and then that can also do that conditionally depending on parameters or you can map to multiple native rules that by one declaration you declare a bunch of targets that semantically you knock together. So you write such an extension in a separate file with the ending dot bzl and then in your build file or in every build file that needs that extension you load it by saying this is the file where the extension is described and specifying the symbols you want to import from that file so that by reading a build file you actually know what is in your namespace and not suddenly accidentally declare commands you're not aware of and I don't have an example for that but Skyback also has full access to the action interface so in the more complicated cases you can really not only refer to native rules but specify actions in detail including all the things that native rules can do like checking parameters for tides checking that parameters are present and these kind of things So I won't go into more details of Skyback instead I'll answer the question that comes up occasionally is why does it take so long or why does the whole process of the sourcing take so long I said yes it was open source 2015 and in a sense it's still going on there are still some tests which are not open source there are some functionality which we intend to have open source but is not there yet yeah and that is to remember that it's only we can open source project after a lot of years of internal usage or a single repository I mean a large one but a single one and in fact as I said it's basically not a fork of the tool it's the same tool just Google has some extensions compiled in so that also means that everything has to work all the time also for the non-published use cases Google can't afford to not have the engineers built software for extended period of time but once you have a but from that history that it used for years it was only for an internal use case a lot of properties of the code base arose which made it hard to open source it or still block open source to some tests the one is if you have all one huge code base with a lot of useful libraries they tend to use they have a lot of dependencies including libraries with solver problem that also open source libraries would solve but they are Google specific internal technology and so on and of course if you want to make a tool open source you need all the dependencies open source so cutting dependencies of finding the best libraries is a big task so internally you have all the libraries it's just easy to use so you do it and historically as there was this only code base there was a big focus on these languages that are so the whole extension interface skylight is something that only very late in the process came and also still there is a bit of focus on the built-in languages and we are trying to make it a generic tool oh yeah and if you have a large code base then it's very easy to or if you have only the one code base where things are supposed to work or you build everything from then you have the advantage that for all interfaces you know all the core sites they can change it easily which is not that nice for an open source project if things change all the time and of course if it wasn't intended originally as an open source project they find hardcoded paths everywhere I mean a lot of things that are all easy to fix but in some that's why it takes such a while to build an open source I just know what my compiler is and a lot of small details like that why the process of open sourcing is still in parts going on even so what is open source is a useful tool in its own right and that brings me to the roadmap where it's basically trying to move to so the big goal is base 1.0 and I think on a roadmap we can't really say what's next year hopefully what does it mean so the first thing is we want the public repository to be the primary one at the moment the Google internal repository is the primary one and then at least once a day it gets exported to the public one that sounds more like a technical detail where you commit it first and then export it but that has a lot of consequences in particular since you first push to the we want to first push to the public repository that means the interfaces need to be well defined you can't do it at the moment you just importantly run all the internal tests that you care about and then see whether something fails and that is a big commitment saying you need to get the interfaces well documented and well tested because still Google is using between what having the engineers not have a built tool for a long time and to be a proper open source project we also want all the design reviews in public we're working towards that but there's still some use cases which are Google only so sometimes things come up but there is more and more going in the direction of having generic interfaces designed with use oh yeah at the moment it happens that the whole core team are persons employed at Google which is for a true open source project we hope to also extend that base and as I said you mean prerequisite for all that is to get things stable and well documented and while that is a big goal we're aiming for a lot of applications that we hope to solve on the way there for example we'd like to improve the world execution API I said model execution is very powerful especially if you can share caching between different persons working on the same code base so base open the same something wrong does have an API and a prototype implementation but we hope to make that a more standard API that is well documented and used by a lot of people and get out of this prototypical state and as I said for it to become a community project and to be a generically useful to a really big aim to make more repositories of specialized build truths for languages which aren't that important within Google and hope that there is a community contributing in all and we hope that this increases and there's a good collection of rules so that basic can become more a language agnostic tool that knows how to organize compiling how to do caching correctly dependency tracking and have a specialized knowledge about individual programming languages in a separate place and the other thing which does kindly work going on is improving the story of remote repositories. As said there's something that only became an issue with open sourcing as within Google there is only that one repository meaning it's working you can specify remote repositories as a dependency in your build file they get fetched, they get compiled but there are a lot of things to improve in particular recognizing if you have fetched something in the hash value or the declared hash value hasn't changed doesn't matter that you're now fetching it from a separate place you still can cache it or even saying yes you know these are the artifacts coming out of that remote repository I don't even need to recompile that and the repository hasn't changed then what about recursive remote dependencies if some remote repository depends on another remote repository and what if to that point to the same thing so there is it is working in principle there's a lot of more improvements to be made and we're working on it and a lot of small details also something which I didn't mention on the side but it's also very big is other platforms so making sure Bazel runs well on iOS on Windows there's a lot of work going on so to sum up what are the approaches of Bazel it has a declarative approach so you say what you want to build and then have a separate place where you have the knowledge how to build it you really track all the dependencies including tools, including things that are implicitly read by a tool and that is what ensures correctness and you have some support to help you really know what you're reading so the handboxes and that full knowledge enables fast builds because you can more aggressively cache you can execute remotely very easily because you know what is needed and can share that between different people working on the same code base and it is open source and the plan is to commit and there is some commitment to that and the plan is to get a fully useful open source tool it is already useful to make it a more community project but a project dominated by one company and you are invited to try Bazel yourself so there is our own page the whole code is at GitHub there are two mailing lists one for people who just want to use the tool and a set of mailing lists for discussion on how to extend Bazel further how to develop it we have an ISC channel and yeah all release artifacts are signed and pretty fast so that is the end of my presentation and we also have a stand next door where I can meet the base developers today and tomorrow the whole day thank you thank you for the presentation are there any plans on making Bazel more modular right now is one big binary file which contains a lot of dependencies 100 plus megabytes I know it could be necessary for to make it reproducible but so I am not sure what you mean by more modular so the code itself is organized in a modular fashion there is also an internal interface where I can add more functionality and another thing that goes in the direction of modularity is to knowledge all of the tool into files describing those that it probably will always be a kind of biggest binary and and we certainly won't get rid of the dependencies of the JGM so it is with a job and we will save it or what precisely do you mean by more modularity I mean more modular since that so you can remove a lot of the one of the dependencies that is in the binary I know the reason why the binary is so large there are some plans in that direction but I don't think they are the highest priority at the moment so there are plans to make it a more generic tool and not bundle everything but on the other hand it is very useful if you have one binary that just works which also makes deployment easy and that is a big thing but there are plans in that direction but I think there are other questions thank you for your talk one question is the service tool with the history why did you decide to open source now and not like five years ago for a particular reason and another one will we see Android and Chrome using this tool okay let's start with the answer to the second question yes Android well you know the word for better than we are talking to the Android teams and the Chrome U teams if you look at the Android source code there is actually now there are not all of the SDK is currently Google Spaces but we hope that we will get there we will see for Chrome they have built a lot of their own infrastructure they use a remote execution extensively sorry is not stable yet so it will take a while before we are at a point where they could switch over but we certainly hope that that will happen eventually and the question why open source now and not earlier that is you know better because I joined the morning here and I am happy that I could try an open source project because I like working the open source and having contact this okay why now sorry I seem to be the right time how does it identify the tool chain that is using and how does it detect changes in the tool chain for C for example sorry about that okay so we are still running the session this question and answer if you could keep it more quiet we appreciate that if you want to leave if you could leave quietly from both sides also if you could repeat the questions for other people that would be good questions because the other microphone is not if you could repeat the questions okay the question was how does space identify changes in the tool chain okay to be fair wouldn't notice if you change the content of the compiler that is external but the tool chain has declared I mean there is a declared tool chain even if you don't do it explicitly internally using those which compiler did use which although you know better you can also correct it there is always a declared set of tools that are not used and then declare a target and that information is kept into the graph that declared compiler changes then all the actions get invalidated you have to be done with the new tool chain so it's not as far as I know it doesn't track the content of external tools so if you just happen to sneak in in your compiler then that wouldn't be a notice but if you declare you want to build for a different tool chain then that is noticed okay thanks so you manually say I'm going to use GCC 5.4 for example and if you change that then it recompiles everything yeah so I don't know how it works how does it work together with external dependencies and build things like for Java we've got NPM and all these things okay so the question was how does it work together with external dependencies in particular specialized ones for languages so what you can declare is saying this is an external source tree or an external repository or by various means download it as a githash download it as a file and check that it better has the hash value and you can also drop in a build file but it doesn't have any knowledge about the various package in various package formats it has some generics in all of the Java packages in all of the SID files those at our account and so on but beyond that you have to describe how you build it but you can drop into a given source tree your own build file that then has a specialized information in that source tree doesn't bring build file okay two quick questions the first one is it's implemented in Java so what do you think about a project where you need to compile it on a platform where Java is not available but there's not enough resources enough to run it what do you think about a project where you need to compile on a platform where Java either is not available or it's too big of a dependency it doesn't run properly on a machine so the question was what do we do if Java is not available so base in itself is in written Java so on the host platform you need to be able to run Java otherwise you can't on mail but if you have a cross-touch and you can cross-compile for a platform where you don't need to run Java if it's a program you're trying to compile is written in Java obviously second question okay so how does a committing code into Bazel require signing a CLA so does contributing to the code require signing a CLA yes contributing to Bazel requires a CLA so that was the question the answer is yes it does and yeah but the code itself is under the apatronization so if you say yeah I don't want to copyright this Google I have the ability to I would prefer to have it as contributions to the project so one question here it's here down here hello so how do you implement the sandboxing using the separate process that runs for each action in some form of MPC between the actions good question yeah okay so the sandbox we use a user namespace and then we create a mount namespace and a network namespace and then we set those up to show just the files that you have declared and it effectively runs as your user but the writes work out in some way on macOS the existing macOS sandboxing mechanism which you know you probably know about on Windows we don't currently have a sandbox but we hope to add to that at some point we have a question at the back thanks for the talk so I was wondering at the start of your talk you mentioned that at Google there's a source treated as 10 to the power 7 source files which is quite vague that also means that for more complex applications the dependency graph itself would already not fit in memory how does place view of that start like some graph partitioning or does it only like partially evaluate the graph so the question was what does it do if the dependency graph itself doesn't fit in memory it assumes it fits in memory and it only looks at the part of the dependency graph that you need for your target so you don't look at the whole repository only your target then recursively to discover what other files you need but the approach is to hope that it will always fit in memory and then have big enough memory is it possible after all the building keys don't use battle to generate some like docker images ready to the point with all our binaries and dependencies so the question was can docker images or use docker images after we have all the building done to package our binaries and libraries into some form of container or packages like rpms or labs so the question was can basis package the artifacts created and the answer is basis knowledge about certain package formats as far as I know definitely including docker mpms and target and I'm not sure what else is already existing but the plan is that more packaging formats needed to better have that as an extension instead of compiling more specialized knowledge into the original binary but I mean whatever you know how to build a binary you can also use a generator artifact so to get a new artifact which is in the package that's just an artifact as any other artifact you said one of your goals was support for windows but you said you can use bash in your extensions how does that fit with your goal of supporting windows okay so the question was based on windows and how does it fit with you have rules that explicitly call bash I mean you can only of course you always need the tool chain to be present at the target so there is something like well there is bash for windows and if you want to write a rule in bash as your build language as your compile language then you better install bash but there are plans to have rules that are specific to your language like compiling C code download with a native tool chain but if you need a tool you have to have it on the machine where you're building there is no magic way around that there's a question over there over there we're going to have one more question after that thank you for your talk what is the status of ID integration like idea or influence or a set of ID integration you know better because I only use plain emacs and command line tools and that works but there are so we have an experiment with plugin for Eclipse we have a supported plugin for IntelliJ which also works with Android Studio we have an Eclipse for Xcode we don't add plugins we have a plugin for Xcode which is also supported we are looking at providing plugins for Visual Studio as well but currently our focus is to make things work really well and win those first before we start integrating with IDEs we don't have any plans for NetBeats right now but feel free to contribute there's no more question so it's time to thank everyone