 Thank you, Georgius, for inviting me. My name is Doug Krieger. I'm here from GitHub. I'm not going to talk about Left Pad or EventStreet, unlike most of the previous talks in this room. Instead, I'm going to be talking about some code navigation work that we've been doing recently at GitHub. I'm the manager of the semantic code team, where a team, as Georgius says, that does static analysis at scale on basically the entire corpus of code that's posted on GitHub. So that lends some interesting challenges that we have to solve. So my team, back in November at GitHub Universe, announced the GA launch of code navigation, fuzzy code navigation, on github.com. If you've used any kind of editor or ID, modern editor ID in the recent years, you've probably encountered code navigation before. This is not a fundamentally new concept. It's exciting for us because we get to launch it directly on the code view of GitHub itself. So you're looking at a particular code base on GitHub, this in particular is an example of Kubernetes, so some go code. You're looking at it. You're trying to figure out how it works. You see a call to a method. You can click on it. It pulls up a hover card that shows you a link to the definition of that method. Clicking on the link takes you to the source code that defines the method. Once you're there, you can also click on its definition and that'll pull up a list of all of the other references to that method. So again, it's jump to definition and find all references just like you would see in an editor or an ID, but direct in the code view on github.com, okay? So we're very excited by it. We're very happy to have gotten this launched. It was a big milestone for my team, like we were happy with the result. There are limitations in what we've shipped, so I want to talk about a few of those and what we're doing to address those moving forward. One of the first limitations to talk about is that it's only GA launched for three programming languages right now, Python, Go, and Ruby. We do have three others that are in beta, so JavaScript, TypeScript, and PHP. There's about a 10% chance that your repository has code navigation activated for it and a 90% chance that it doesn't. We do have other languages that we're working on as well queued up behind those, but right now it's only GA launched for these three. It is important to call out though that it's GA launched for all public and private repositories on github that has any code with any of those three languages. So it's still a pretty substantial launch for us even though it's just a relatively small number of individual programming languages. A more jarring, I would say, limitation here is I called it fuzzy code navigation when I first talked about that, and we also like to call this C tags lake. And what this means is that when you click on a reference to a function, we're not doing anything clever to figure out which particular function that you're calling, all we're doing is we're matching up the name of the symbol. So this is why we call it C tags lake. So if you have a lot of functions or methods in your repository that have the same name, we're gonna present an ambiguous result where you click on a method call and we do show you all of the possibilities based on that symbol name match. Similarly, if you're looking at references, if you're looking at a particular definition, we're gonna be showing you all of the references to any function with that name in your repository, not necessarily the calls to this particular function. So this is, like we said, fuzzy or C tags lake. We're not actually using C tags itself as the way of extracting this data. So in particular, we are using real parsing of the languages in question, so we're not doing just regular expression-based matches. That's interesting in that we do get a little bit more syntactic information about the symbols. We can see if you're talking about a function call or a local variable reference, and we are able to tweak the user interface slightly based on that, but we are still limited to that fact that we're just doing this textual match of the symbols. The last limitation that I wanna talk about today, and I think the one most relevant to this audience in the dependency management room, is that this is only code navigation within your repository. So you can see the links to the functions that you're calling that are defined in the same repository, but not to links that cross a repository boundary. Now, I am being very careful to use the word repository here. It is possibly cross dependency code navigation for those languages or projects where you're vendoring your dependencies, where you're actually placing a copy of the code into your repository with you. And in Go, that's actually relatively prevalent. So for many Go projects, you actually are getting cross package code navigation with what's currently launched just because of that tendency to do vendoring. But most language communities don't consider that to be a best practice, and so we really do wanna think about what a cross repository solution to this would look like if we wanna do this for real for the entirety of the corpus of code on GitHub. Okay, so we've talked about these three limitations, a limited number of languages so far, fuzzy code matching or fuzzy symbol matching instead of precise, and the fact that this is just within a repository instead of a cross repository boundaries. And we are actively working on extending the code navigation implementation that's been launched to address all three of those limitations. And what's really interesting is that for our purposes, the main bottleneck, the limiting factor here is really just a question of scale. There are techniques out there that let us extract symbol information for more languages than just the three that have been launched and that let us extract precise symbol information, not just fuzzy, and can even do that across repository boundaries because we have the ability to parse manifest files and download all of the code and just run our implementation across all of your dependencies. The issue is one of scale. So we've only launched to these three languages so far because the database that we're storing all of this data into is large enough to hold that set of three languages. And we have to sort of increase our database size to be able to move on to the next one. I'm also going back to that. So what we're doing right now is the most recent commit on any named branch. So it's not just master, it's any named branch but it is just the most recent one. When you're a soul, it's in the same branch. It's in the same commit. So you're looking at a particular commit. Because we're doing within repo right now, we can do this on the commit level. So when you follow a link, it keeps you within the current commit. We GC old commits, but it's sort of as an ad hoc basis. So you might still have lingering data, but there's no guarantee that it'll stick around if it's not a named branch tip. Okay? So yeah, so scale is the limiting factor here. We want to figure out how to do this efficiently for the entire corporates of code. So there's three main takeaways I want to hopefully convince you of sort of in related to addressing these limitations. The first is that local development is not the same as a hosted service. And so the way that you would implement code navigation for editing locally in your IDE on your laptop is different than how we would possibly implement code navigation for a large hosted service like GitHub. There's different constraints and that might lead us in the direction of a different implementation. In particular, because of the scale that we're operating at, incremental processing is a must. What this means is that as a new push is coming in from one of our users, there's a very good chance that it's only touching a small number of files in the repository and that a lot of the files are unchanged from the previous commit. We really, really, really want to be able to reuse the results from that previous commit, not have to reanalyze the contents of a file that we've already seen, not have to restore, store additional copies of the data for the files that we've already seen. We really want incremental processing. And then lastly, and this is maybe a little bit interesting and counterintuitive, if we can figure out how to generate precise symbol mappings incrementally within a repository, that same solution will work for generating incremental precise symbol data across repository boundaries. And that one maybe is, like I said, a little bit counterintuitive. So once we see some examples later on, hopefully I can convince you of that. And it's nice because we can sort of solve this once for the in-repository case, get precise symbol mappings, and then once we start to connect in with things like the dependency graph and other ways of expanding out and figuring out your package, transit of closure, then we can sort of extend this and get cross repo code now as well. Cool? Okay, let's jump in. So the first takeaway, local development is not the same as a hosted service. Otherwise known as why aren't we using LSP? So can I do a quick straw poll here? Who here is familiar with the language server protocol? We're not using it for CodeNav as shipped on GitHub. And we often get questions about why we're not. So I'm gonna drill into some of the implementation constraints that led us in a different direction. So what is true of local development? So what is true that makes the language server protocol a very, very good implementation of code navigation for local development? And it is, by the way. I'm not at all trying to despair a jealous P at all. So the first is that user choice is paramount, right? If the stereotypical VI versus Emacs debate, and it's only extended since then as we have things like VS code and Adam and other kinds of editors, right? User choice is important. They're gonna be using whatever editor they want. We don't get to dictate what tools they're gonna use on their laptop for local development in order to bring about something like CodeNavigation as a useful feature. And on the flip side, the programming language that the user is using is something that we have no control over. We want to provide code navigation as a useful service regardless of if you're using Go or Python or Ruby or Rust or any of those. So we have this very interesting M by N problem, this matrix of capabilities where every editor needs to be able to support every single programming language in the worst case. It's also interesting that your context is sort of like a single workspace. Like it's almost like a tunnel vision kind of thing. There's one user or maybe a pair or something like that if you're working with somebody, but you're looking at a single screen at a single thing at a time. And so as the editor is shifting, as the user is shifting what it is that they're looking at, the tool that is figuring out what the symbol data is and how it should be exposed needs to be shifting and changing as well. So there's this single tunnel vision focus where they're really only looking at one thing at a time. And lastly, it needs to be very, very interactive. It looks like I misspelled that. Maybe I didn't. Interactive. So for local development, as you're actively editing the code, we want your editor to be able to respond quickly as you type to the individual keystrokes. So that means we need a very tight turnaround time. We need to be able to react quickly to the changes that you're making and the tool that we're using to analyze the content of that code as you're editing it needs to be executing quickly enough that we can present the results to the user with very, very low latency. So all of these constraints sort of steer you in the direction of a solution. And LSP is sort of like the natural direction that you get to, the natural shape of an implementation that you get to for those constraints. And the way that it works, if you're not familiar with it, is that it sort of separates. It provides a nice abstraction boundary between the part that knows about the programming language that you're using and the part that knows about the editor that you're using. So if any editor supports the LSP protocol, it can consume code nav data from the language tooling for any programming language that supports the language sort of server protocol and vice versa. So it's a really, really good solution to this M by N problem because it's sort of that abstraction barrier turns it into an N plus N problem. And that's great. We love that. The other thing that's interesting about how LSP works though is because of that sort of single context and interactivity constraints, what's happening is that your language server is like a long running sidecar process. It's living right next to your editor and the connection between the two is very, very chatty. The editor is sending messages into the language server protocol to either give it the updates directly or to send it messages that a file has changed and it should reload. But it's a very chatty back and forth communication pattern. So that means that the language server is able to maintain a state in memory about its current view of the code that's being edited. And that's what allows it to get the sort of very tight turnaround, the low latency that supports the interactivity. Okay, what's different about a hosted service? So on the one hand, we don't have a single context. It's not one user with a single window in their editor looking at a single thing. It's millions of users looking at millions of repositories at any given time. And we don't really know in advance what it is that they're gonna wanna look at. So in some ways, we kind of just have to steer into the idea that what we're looking at here is something like a database. We just need to be able to store the symbol data for all of the repositories that our users are creating and browsing and interacting with and make sure that it's available for them whenever they come and want to query it about any of those repositories or branches or whatever. So it needs to be immediately available but it's also something where we don't get to sort of have that single focus that allows us to do one thing at a time with a nice in memory representation. We don't need the as you type interactivity because you're not actually interacting with the hosted service as you're editing the code. We are instead responding to pushes as they come in. So they make some edits, they make a commit locally and then they push it up to GitHub. We need to react to those pushes. So we don't have quite the same type latency bounds, quite the same interactivity requirements but we do actually still care about latency because we want it to be the case what we're sort of aiming for internally is our team goal is that once you save locally and push up we kind of want the data to be available on github.com before you have a chance to alt tab over to your web browser from your Git client, whether it's command line or editor or whatever. So that's like really tight. We still have to be able to process really, really quickly. And that informs that sort of incremental processing that I was talking about earlier. Another constraint or another feature of a hosted service here is that everything is hidden behind an API. So we don't actually have quite the same M by N problem that you do for local development. We have one web UI and we have to provide data for it but we don't have to provide data for it in a way that's directly consumable by lots of different clients or the way that we do that is through an API that sort of is again sort of like a single thing that we just get to implement. And then the last thing is that we carry a pager. This is a hosted service. My team is responsible for making sure that it's up and running 24 seven. So we really need to have a deep understanding of all of the different libraries and pieces of software that goes into the implementation of our service. And that means that adopting external code like a language server, even if it's very, very capable and accurate, that's just it's adding operational and maintenance work on top of the work that my team is already doing even if it is saving development work. So it's not necessarily a free ride. So how are we gonna do that instead? So we don't have the end by end problem where we have to worry about supporting lots of editors but we do still have the part of it where we have to support lots of programming languages. I'm gonna go through this part relatively quickly because I am starting to get limited for time but we want to make it as easy as possible to add support for a new programming language to our program analysis framework. And the way that we do this is through code generation. We wanna still limit the amount of development work that it takes. So for us for Symantec, the open source project that we've created for all of this, the first thing that's required for adding support for a new language is to create a parser for it using the tree sitter parsing framework. That's another open source project that my team is responsible for. It's used for other interesting use cases like syntax highlighting on github.com and in an increasing number of editors. So we're hoping that the act of creating a tree sitter parser will be something that there's enough value to that language communities will take that work on themselves. And it's not that hard, it's a parser, right? It's relatively straightforward and we think that there's decent documentation for this. So this is an example of the tree sitter go parser. This is the rule for a function call, the syntax for a function call. What's interesting is that tree sitter, in addition to producing the actual code that implements the parser, giving that parser description, will also create a nice machine readable description of the parsed states that go into the implementation of the parser. Why is that important? The fact that it's machine readable means that this is the entirety of the code in Symantec that allows us to use a tree sitter parser for go and have the parser results available in the rest of our library. So we're using Haskell. This is using a feature of Haskell called template Haskell. It's basically just code generation running inside the compiler instead of as like a separate script as part of your build process. But this is just running some code that says, take in that machine readable description of the go parser and it will automatically generate all of the Haskell data types that we will use to store the parser results. And that includes generating API documentation. So if you go look at Hackage, which is the Haskell package manager, you can see the links documentation for all of this generated code even though these data types don't exist directly in the code itself. So we still have documentation for it. Given those data types, you can write an instance of some pattern matching rules. So if you haven't programmed in Haskell, this might look kind of esoteric and hard to follow. But if you have followed Haskell, I think it's actually relatively readable. What it's doing is it's basically taking apart those parsed AST representations of the code. And what this is saying is that if you have a call expression, that is a definition. No, sorry, a reference. Call expression, not a function definition. If you have a call expression, this is a reference. And so it's pulling out the parts of the call expression that define the symbol name for that reference. And there's a few different possibilities depending on if it's like a qualified name or not. But it's a sort of relatively small number of cases that you have to think about for this particular parsed node. There's a little bit of boilerplate where you have to go through and say that for all of the other ones, you don't get to, there aren't any definitions or references there. So it's a little bit of boilerplate, but you're at least not having to say how to recurse down into the parse trees. So it could be worse. But once you've done that, we've extracted all of the data that we need forego to do fuzzy code match, okay? So what's really nice about that approach is that we can produce the data incrementally. So I called out earlier that that's really important to us. If you use something like this to pull out the list of definitions and references in a file, that list depends only on the content of that file. It doesn't depend on any of the other files in your repository. And that's exactly the incremental property that we care about. A new commit comes in. If the file is unchanged, we get to reuse the list of definitions and references that we had already calculated. That's in our database already. We just leave it there. We're good to go. So incremental processing is absolutely a must. The only reason we were able to ship fuzzy code nav in the way that we did for the set of repositories that we did for all of the Python Go and Ruby packages on GitHub is because of that incremental processing. If we didn't have it, the CPU requirements would have been too high. The data storage would have been too high. And the user-perceived latency would have been too high. Like we talked about wanting to have the data ready when you alt-tab over to the browser. If you have to process all five or 10,000 files in your repository every time, it's not gonna happen. So the important question is, can we do this also for precise symbol matches? If we wanna actually say that this particular function call refers to this particular method and not any of the other ones that happens to have the same name, can we do that incrementally? And to show why it's kind of an issue, right? So this is just a very toy example where you've got three Python files. If somebody wanted to click on that reference to foo, we would love to be able to show them that it refers to this definition down in b.py. And if you just sort of look at this, hopefully you can see that that is the case. There's a couple of import statements that sort of propagate the reference through a couple of files. But this is the example where that sort of highlights how it's not incremental, at least in the sort of naive case. Because we have to look at files that aren't the one that holds the definition or the one that holds a reference in it to be able to figure out what it actually points to. And if a new commit comes in that changes that file in the middle, right? We've changed not just what foo refers to, but which file the thing that foo refers to lives in in a way that depended on basically the entire contents of your repository. And it gets even worse if you have to sort of trace things through function calls, right? So this is trying to say, I want this reference to foo to refer to the field down here in this class, but that's being passed through a function call in a really weird way, and that function is living in a different file. So how can we do this in an incremental way as well? It's not just sort of syntactically do the things live in different files, but are you tracing things through control flow that bounces through different files at runtime in different ways? And this is really hard. And I don't have an answer that I can tell you about. We have experiments that are starting to work and hopefully we can tell you really quickly and soon about it working. But this is why it's really, really difficult. This is another reason why we can't rely on something like the language server protocol because it will give us something like this if your LSP implementation is really good, but it's gonna require a lot of CPU and it's gonna have to look at and analyze the contents of the entirety of your repository every time. Okay, last one. I hope that I can really quickly in my remaining time convince you that within repository is the same as cross repository. So if we can find an answer that works for this, here's the shortest version of this argument. It doesn't matter whether these two files live in the same repository or in a different package in another repository. The only thing that changed is the set of files that we have to look at when we try to stitch things together. If we can figure out how to do this on a per file basis, the location of the files, what packages they live in, what repositories they live in is no longer the sort of bottleneck concern that we have to worry about. So that's the end. I think we probably don't have a lot of time for very many questions, but thanks very much for your attention. Oh, okay, cool. Any questions? I'm sorry, can you say that again? Yes, it absolutely will. So the question was for the precise code matching, are we gonna be doing anything with graph theory, basically, and the short answer is yes, yes, very much yes. And briefly, the part that's hard is that if you tried to construct a call graph that encoded that information, the problem is that the portions of the graph that come from this file might live in remote parts of the graph because you had to follow those links into other files. And that's sort of like a nice succinct description of why it's not easily and obviously incremental because the structure of that graph is not something that obviously depends only on the contents of your file. But there will definitely be graph structure involved in order to make it work. So will we be analyzing code from third-party package managers? And so I guess to clarify, you mean a package uploaded to NPM where the source did not come from a Git repository and probably even GitHub in particular. So there's a couple of answers to that. One is that all of this is in a couple of open source libraries, so semantic and tree sitter are open source and so you could use the tools to analyze that yourself. It's not something that at least for the first revision we plan on including, not because we don't want to but just because we're trying to limit scope and have something that's a little bit more tractable to begin with. That's a really good question. And so the question was if we can do this incremental precise extraction in a cross file way that works for our hosted service, will any of those benefits translate over to code nav in local development? And yes, in that the techniques would certainly work there and in that you are still doing things like creating commits and doing edits as you type and stuff like that, we can probably still have that incremental thing work. In the local development, in the editing case it's actually even a little bit more fine grained. It's not just that you want to care about the edits to the file, it's that you want to care about the edits to the particular part of the file. So a lot of the incremental things and editors recognize that I changed this line and so I don't have to re-syntax highlight the seven pages that occurred before. And that's something that wouldn't directly translate across because we really are operating on the file boundary with what we're doing right now. Yeah, so are we doing things where we can do code generation based on extracting ASTs? That is effectively what we're doing. It's just that we've sort of, at least right now again for limiting scope we're just sort of locking down the way that you would implement the parser that generates those ASTs. But that is what we're doing. So we get the source code, we extract an AST from it and then we run that sort of pattern matching code that I showed you the example of to extract from that the definitions and references out of that AST. Okay, I think with that I'm done. Thanks very much.