 Hi, everyone. I'm Max. I work at GitHub on the AtomText editor, and I'm going to talk about a parsing library that I've been working on for about four years now. Initially, as a side project, and now as part of some production systems at GitHub. It's called TreeSitter, so I'm going to explain what it is and why I chose to write it, and then I'll talk about some of the things we're doing with it today at GitHub. And then finally, I'll talk about how it works. So first, what is TreeSitter? So TreeSitter is a library for parsing source code. It's written in C and C++, and it's designed to be used in applications like GitHub or Atom that have to deal with code written in many different languages. So the idea is you can use TreeSitter to parse files written in a variety of different languages, and it'll produce syntax trees that all have the same API and the same format. The really unique thing that TreeSitter does is what's called incremental parsing. So that means that once you parse the file with TreeSitter and you have the syntax tree representing that file, you can then edit the file and TreeSitter can quickly update the syntax tree for you to reflect the change that you made without having to reparse the whole file, so it'll take something less than a millisecond so that this is the feature that makes it possible to use TreeSitter in a text editor like Atom to parse in real time while the user is typing. So now I'll explain why I chose to write this. So there already exist good parsing tools today that are specific to any given language. If you just want to parse C, you can use LibClang, if you want to parse Go, you can use the Go AST package and so on. And yet, I would argue that most of the tools that we rely on the most as software developers still don't have good source code parsing capabilities. So for example, let's take a look at some syntax highlighting that you might see in your text editor today. So here's some Go code, it's kind of hard to read but it defines a person type. The thing about this is you can see here that the types appear in three different colors in this example. Person appears first in yellow and then in white and then this other type string appears in pink for some reason. This is a screenshot from Atom but you can see the same thing happens in sublime text or visual studio code. And so this isn't a good behavior and syntax highlighting is something that we all use all day every day whether we're looking at code in our editor or on a website like GitHub or Stack Overflow. And so my question is with all the good parsing tools that exist today, why isn't this super important feature implemented better? And so I think there's kind of three main reasons why these applications that I mentioned don't use all these standard parsers that we know of. The first one that comes to my mind is for performance. So in a text editor, for the editor to feel lightweight and fast, syntax highlighting should update on every keystroke and it needs to take, as I said, less than a millisecond or so. And so that rules out right out of the gate most parsers that we would normally use otherwise. Aside from that, for a lot of languages to use the standard parser that comes with the compiler, you need to know some other information about the file in order to parse the file. Like, for example, if you're parsing C, you would need to know all the other source files that are included via the preprocessor and what macros are defined and so on. And in an environment like Adam or GitHub where we're trying to work with the code of strangers, we don't always know this information. And then finally, each language's standard parsing tool chain has its own dependencies is often written in that language. And so for apps that need to work with tons of languages out of the box, having to integrate with all those tool chains would be very complicated for our system. And so what I set out to do with TreeSitter was to build one parsing system that could not only handle all major programming languages uniformly, but could also do so within these constraints. So it can be fast enough to run on every keystroke in a text editor. And it doesn't require any additional information about a source file in order to parse it and has no dependencies. In fact, it's implemented, the runtime is implemented as a pure C library, so you can really easily use it no matter what language your application is written in. All right, so that's the pitch. And now I'll show you what we're currently doing with TreeSitter at GitHub. So this, the latest beta release of Adam this month will include some new functionality that uses TreeSitter. It's still, I'm still developing it, it's a bit unstable, so it's behind a feature flag that you'll have to check a box if you want to try it. But if you do, you'll get this new improved syntax highlighting that I'm showing here. So this is the same piece of go code that I pointed to earlier. But now you can see that the problem that I mentioned has been fixed really nicely. All the types consistently appear in this blue-green color. And on top of that, all the struct field names are now differentiated. They're red. Whereas before the syntax highlighter couldn't even tell struct fields from local variables. So this is go, we also can do the same thing in other languages like C, C++, TypeScript as well as JavaScript, Python, Ruby, and Rust. And we're in the process of developing support for a bunch more languages too. So aside from those improvements, another benefit of doing syntax highlighting with TreeSitter is the handling of long lines. So if any of you've ever opened up a minified JavaScript file in almost any text editor like Vim or Adam, you'll often see a little tiny bit of syntax highlighting at the very beginning of the file. And then mostly a black and white file because of this performance limitation of conventional syntax highlighters. So by re-implementing syntax highlighting with TreeSitter, the layout of the file no longer matters. It's using a proper lexing and parsing tool chain so we can, we can syntax highlight files like this. And then aside from the handling of long lines, overall performance of syntax highlighting is just dramatically improved as well. So on this slide I'm just showing an example of parsing at the command line some 20,000 line JavaScript file like the development build of ReactJS. And it takes about 69 milliseconds, which is, I want to say, not quite an order of magnitude but much faster than the syntax highlighting systems used in normal editors like Adam, Visual Studio Code. Yeah. So aside from syntax highlighting, I've also re-implemented Adam's code folding system to use the syntax trees that's provided by TreeSitter. So as you might be aware, in most text editors, code folding was based on some combination of regexes but often indentation. So if you have a C function like this, which is very common where the indentation of the file doesn't match the syntax of the file exactly, code folding doesn't work as you would intend. But now, by re-implementing code folding based on the syntax tree, it now works regardless of what you do to the formatting of your file. So if you copy and paste code and this indentation gets messed up, you can still use folding as a guide to the structure of your code. And then finally I've added a new feature to Adam in this latest beta. It's called Extend Selection. You might be familiar with it if you've ever used a JetBrains IDE. It's a command for selecting larger and larger pieces of your code based on the syntax. So it's really powerful for editing efficiency. It combines really well with multiple cursors in this animated GIF here. I'm reformatting this data structure in this really powerful way using a combination of Extend Selection and multiple cursors that would be really time consuming if you had to individually move each cursor around. So aside from Adam, we're also using TreeCitter on github.com. So some of you may have seen a feature come out a few months ago on GitHub where when you're looking at a pull request and you open up the list of changed files for that pull request, now you can see within each file a list of the functions that have changed. And this is done by a team in GitHub's data science organization that is doing a lot of cool research about algorithms for analyzing and comparing syntax trees to each other. And they use TreeCitter for all of their code parsing because it gives them this uniform way of parsing many different programming languages. And so between the work that they're doing on github.com and the work that I'm doing on Adam, we're developing this larger and larger set of languages that TreeCitter can parse. And I think that once these Adam features that I showed you a moment ago go to the stable release of Adam, then that set of parses will grow even faster because now this community of millions of Adam users will be able to contribute support for their favorite languages in Adam. And so the dream is that you'll be able to use this tool to parse any programming language that you can think of. So now I'll talk about how TreeCitter works. So when you want to add support for parsing a new language with TreeCitter, as you might expect, you have to write a grammar for the language. And with TreeCitter, you write the grammar in a simple JavaScript DSL that I made. And the advantage of this is that grammars are represented as plain JavaScript objects. And so you can manipulate them programmatically and you can extend them. So it makes it really intuitive to write a grammar that is based off of another grammar, which I do in order to define the C++ grammar in terms of the C grammar and the TypeScript grammar in terms of the JavaScript grammar. And then once you've written your grammar, TreeCitter will generate for you the single C file that, as I said, has no dependencies and has two main pieces of data in it. It has what's called the tokenizing function, which reads the sequence of characters from beginning to end and groups them into tokens, as well as what's called the parse table, which is this data structure that tells the parser when it's in a given state what to do when it sees a given token. And then you can use these parsers in combination with this small pure C library called the TreeCitter runtime, which defines a few types that you can use to parse files. And then it gives you the simple API for dealing with syntax tree that's kind of like the DOM in JavaScript. This shows an example of using the raw C API, but usually we use it through bindings to either JavaScript or to Haskell. So that's what it looks like to use the system. I'll now talk about some of the algorithms that it uses internally. So they're mostly based on some research that was done at UC Berkeley in the 1990s about IDEs. And in particular, this one PhD thesis called Practical Algorithms for Incremental Software Development Environments that outlines, if you're building an IDE, what's the best sort of basic parsing theory to rely on, and then how to augment that theory so that it works incrementally to handle these fast edits. And then it proves that this incremental version has these good performance properties. And so the basic parsing theory that it uses is LR parsing, which many of you were probably familiar with, but I'll just go over it quickly here since some of the more unique parts of TreeSitter build on LR parsing. So the idea is with an LR parser, we read the sequence of characters from beginning to end without ever backtracking. And as we go, we group the characters together into tokens using the tokenizer function, which I mentioned earlier in the generated code. And then we group those tokens together into larger and larger subtrees, which we store on a stack. And at every step of the way, we decide what to do with the stack by consulting that parse table, which was the second piece of generated code I showed. And so in parsing a simple math expression like x times y plus z, it would sort of progress in this way. We start out with an empty stack and then we push the first three tokens onto the stack. But when we get to the plus, we would do a different type of action called a reduction where we pop three nodes off of the stack and group them into a new parent node called a product and push that back onto the stack in their place. And then similarly, we'd push the plus and the z onto the stack. And at the end of the file, we'd do another reduction, which, so we'd pop off the sum, or sorry, the product, the plus and the z, and group them together into a new parent node. And that would be the final tree. So you might ask like, okay, one of the stated goals of TreeSitter, though, is to parse like every major programming language, is this one, this one parsing framework going to work for all languages? And the answer is no, not quite. So while most languages are designed to be fairly easy to parse using parsers that were available in 10 years ago or longer, and LR parsing is actually a very powerful technique that can parse a much bigger set of grammars. Languages often have some weird quirk that needs to be handled via ad hoc code. And so let me give you an example of one of those quirks. So here's two JavaScript statements. And the first one we are assigning to the variable x, the value of the variable y, which we have in parentheses just because we can. And the second example, we're assigning to x this arrow function that takes a parameter called y and returns z. And so the reason that this is a problem for an LR parser is that, like I said, an LR parser never backtracks. So before the parser can process the right parentheses, it needs to decide what it should push onto its stack in order to represent the y. But it can't know whether the y is an expression or not, meaning are we evaluating y or not, until it sees the arrow token that comes later. And so I'll show you the basic technique that the tree setter uses to deal with this. So it's called GLR parsing. And the idea is, these are some more diagrams of the parser's stack here, but they're in a different format now that's generated by tree setter itself in order to help you debug the parsing. The idea is that the stack can fork into multiple branches in order to explore multiple possibilities at the same time without ever backtracking. And so in this example, when we get to the right parentheses, we'd have to fork the parsed stack and read in a few more tokens until we see the arrow, at which point we can discard one of the forks because we can tell which one of the two is valid. And so this is the technique that allows tree setter to be so general, to handle languages like C++, et cetera. And then another problem that I wanted to show to you is how tree setter deals with errors. So unlike most parsers that are built for compilers that simply halt when they see an error, report an error message, tree setter can't do that because it needs to provide a tree no matter what for use in application like Adam that needs to be able to provide functionality even if the user's in the middle of typing something. And so in this example code, there's an if statement and then you started typing a for loop before it. And so what tree setter would do is produce a syntax tree in which the if statement is still fully parsed but there's just an extra error node that represents the fact that the word for should not be there. And then similarly, if you had a for loop but you started typing an if statement in the middle of it accidentally, tree setter could still parse the for loop completely and provide you this error node in the tree that tells you that the word if shouldn't be there. And the way that it does this is similar to the way that it deals with the ambiguities that I mentioned a second ago. So upon seeing an error, tree setter can split the parsed stack into multiple branches in order to simultaneously explore multiple ways of recovering from the error and then decide which one is the best. According to this cost metric. So you can see here these two sequences of what would happen when parsing these two erroneous pieces of code that started out the same. And as far as I know that technique is novel. I haven't seen that in the literature or in commercial parsers but it allows tree setter to produce these very natural seeming syntax trees even in the case of severe errors like the ones I showed. And then finally the last algorithm I wanted to show you is how tree setter does the incremental parsing that makes it performant enough to use in a text editor. So the idea here is say you had these three statements in JavaScript and then you edited right there so you inserted an argument to this function call. The way tree setter would process that is it would walk the tree that it had and any syntax node that contained the site of your edit, it would mark as having had a character inserted into it. And then it would begin parsing as if it was going to parse the file from scratch starting at the beginning with an empty parsed stack. But this time since it has the existing tree as a reference it can skip a lot of intermediate steps in that parsing process. So after upon starting it can immediately push directly onto its stack the entire variable declaration that was the first statement in the program. And then it can reuse smaller pieces from there like this member expression that preceded edit and the left parenthesis right before the edit. And then it can parse from scratch for a tiny little piece where you inserted. But then it can resume reusing more trees. So it can reuse the right parenthesis and do some more grouping here. And then reuse the whole final statement of the program and group that into a new root node that kind of shares elements of the old root node. And in that way the time it takes to do this is not proportional to the whole size of the file. It's only proportional to the amount of text that you inserted and the number of changes that you made. So that's all I'll show you. Next steps for me is adding support for more languages. So there's about 10 supported right now out of the set that Adam supports. And I plan to basically continue adding support for these new languages until all the languages that Adam ships with can now be parsed with TreeCitter. And at the same time the github.com team I talked about is busy developing support for other languages that they want to support on github. Thanks.