 I'm Rick Olson. I'm talking about Git, the stupid NoSQL database. I'm a programmer at GitHub. I work with Ruby a lot, and JavaScript, CoffeeScript, lots of cool stuff. So what is Git? I assume everyone here knows what Git is. So Git is basically a distributed version control system. I'm not going to go into how most people use Git for versioning their source code. At GitHub, we use Git more as a database. If you look at the man page for Git, the name, they call it the stupid content tracker. And reading this, it struck me as odd, because everyone uses Git the same way. They use Git as a source control tool. But really, it's just a generic tool for storing data, storing revision data that just happens to be good for storing a source code. So at GitHub, I thought about this a little bit. And working there, I saw how we use Git, and it seemed more like a NoSQL database. But I really hate that term, NoSQL. It turns into this us versus them. If you use a database, you're stuck in the old world. And then other people are using Redis and MongoDB and stuff like that. So I'm not trying to sell you on Git. It's not going to make your app sexier or anything like that. It's just one option. So at its core, Git is a key value store. Every file, every object in there, you access through a SHA, which is the key, and you get back some kind of value, which is the content of your source code. So this is what Git looks like when it's just writing data to the Git database. The Git hash object command is probably not used very often. But all it does is take some data and writes it to the Git object database in Git's format. And it gives you back a SHA. And that SHA is basically just the SHA1 hash of the content. And then it gets compressed and stored on the file system. And if you want to access this data, you can call a cat file, which will just read the data out and read it from the Git format on your hard drive and just print it back out. That's basically like the core of how Git works internally. And when you do that, you get a blob. A blob is just any type of content. Git doesn't really care what the format is. So it can be JSON, it could be an image, it could be whatever. The interesting thing, though, is this blob knows nothing about itself. It's just content. It doesn't know what file name it is. It doesn't even really knows what size it is, and that's about it. But Git is also a graph database. A graph database is where you store nodes in a system, and then you track relationships with other nodes. So with Git, we have trees. Trees track relationships with blobs. A tree is just basically a directory listing. So each file is, each tree is just a listing of file names and then a pointer to the blob that it's talking about. So that's what that graph looks like. We have a tree here, and it's pointing to multiple blobs. And then at a higher level, we have commits. Commits are just a record of changes, and they track the committer, and when it was committed, and they track the original author, and when the code, when the original commit was authored. And all a commit does is it points to a tree. And then that tree points to sub-lobal trees that point to other blobs and other trees. So you start getting this graph of data. And then the next thing we have are references. They're basically just post-it notes that point to a commit object. And they're just literally files in the Git database with a SHA, and then you look that up and you find your commit, and then from there you can look up the tree and then any blobs underneath. So you get this whole structure where at the top, you have various references like branches that point to commits, that point to trees, and then blobs. You also have tags. Tags, when I started at GitHub, I really didn't know a whole lot about how this stuff worked. Tags was one thing I knew nothing about. I knew they were just references that pointed to arbitrary commits. People use them a lot for releases. And one of the first projects I worked on at GitHub was updating the downloads page. Someone filed in a support issue that they didn't like how the downloads page only showed the commit, and they mentioned that we could use annotated tags to provide more information about the commit. This is a screenshot of one of my projects, and I clearly don't know how to use tags. The description isn't very helpful. It's like either a version bump or it's like the last commit that I made, which is updating the readme or something. It's not very helpful. So looking at the support issue, I looked into annotated tags. Annotated tags are actual objects in the git database. They can track who tagged it and when the thing was tagged, and you can also sign the tag if you wanna do all that. The cool thing is you can do this to put more information about your release. This is the Rails project. If you look at the bottom for version 235, they have a change log of the big changes in this release. So doing this, we're able to pack more information into the downloads page. So if project maintainers wanted to add more contacts about their releases, they can. So I had a lot of fun doing this. I like the idea of encouraging our users to learn git better and improve their projects. So we use grit. Grit is the Ruby library we use to access git. It is written by Tom Preston Warner. And it's basically, it started out as a wrapper around the shell to access the git data. And since then, they've added a lot of performance improvements using Ruby to actually read and parse the git format. So this is a quick example of getting commits and that commits method just runs a git log and gets you back a bunch of commits and then grit will parse the output into objects, commit objects and tree objects and blobs and all that. This is a project that Paul Daumann just showed me like about a week ago. It's an active, git model is an active model compliant adapter for Git and it uses grit internally. So you can see here we're creating a model. It looks just like an active record model except you're defining various properties and blobs and then you can access it like an active record model and it basically just stores JSON files like in the git database and you can reference other blobs and images and things like that. Ribbit is a project that it's a Ruby binding to LibGit2. LibGit2 is a C implementation of Git. It was a Ruby or a Google Summer of Code project that Scott Chacon mentored is written by Vicent Marty and then Scott works on Ribbit which is just the Ruby bindings for it. One of the things that we can't use grit for or we can't use Ruby in grit for, we start the shell out is for walking the git graph. Like if we ask to get a commit, if they get the commit and then walk down to the tree and then to all the blobs and doing this in Ruby is extremely slow. So we still use, grit still uses the shell to shell out to a Git log and LS tree and some other commands like that. Ribbit does it all in C so it's still really fast. So this is what the walker interface looks like where you just add like which two commits you want and then you can just traverse through the, you can just traverse through the commits. The commit will point to its parent commit and then you go and just keep going up. wheat is an interesting Git blogging tool. It's written in Node.js and it's similar to Jekyll but the really cool thing about it is it's a live server. So there's no, you don't have to build up a bunch of files. You just basically commit your Git database and push it to a server and then all the caches just get updated because all the caching is done through the SHA. So as soon as you update the reference to the new SHA, all the old caches stay in place and everything, all the new stuff is displayed. So one of the cool things, one of the ways it uses Git, uses Git to track revisions of your articles. So it's basically like running the Git log command on a specific file and you get back the revision data. And it also lets you embed source code files and images and other things like that with your article. And basically each article gets its own directory and then it has files in there that you can embed in your blog post. So this is what it looks like on the site. You can see on the sidebar, you can see the list linking of the code samples and then you can see the revisions. And another interesting thing about wheat is it uses Git socially. You bring up community sites and he's very open about it. He's like, hey, if you want to contribute to howtonode.org, just fork my repo, add your content and we'll edit it and once it's good enough, we'll add it to the main repo and that's that. I thought that's a really cool way to foster a community. Golem is the backend to the new GitHub wikis that Tom and I wrote based on a spec that Tom wrote. So basically with Golem, we're taking advantage of all the strong points of Git where you have, well, one thing, Git scales down really well. It's not like a complicated thing to install. It's like you just install that command and you have it and you don't have to run a server or anything like that and you have all the data right there and it's really portable. So you can just have your wiki data locally and you run Golem and it spins up a web server around it and you can edit it locally and then push it up to GitHub or any other host that might support it. So you can see here we have the full editing interface locally and it's the same thing that's on github.com. Another thing, obviously Git is good at tracking versions of files which is really important for wikis too, wikis are just very open and anyone can modify so it's very important to keep a log of who does what and that's what Git excels at so it makes a lot of sense. So you can see here I'm just calling the log method to list out the versions of a file and we just render it on the site like that. And then Git is also really good at diffing. This is one of the feature requests on the old GitHub wikis which were just written on the database. So rather than write our own like diffing system we just take advantage of Git. It spits it out in the format that we like and we already had code on GitHub for displaying diffs so it's really easy to throw that in there. So you get nice diffs like that so you can see exactly what changed between versions. And we can also take advantage of other Git features like grep, you can use grep to set up like a really basic search interface. You don't have to set up solar or maintain indexes. This literally I think it just cats all the contents of the files and runs them through grep which obviously you won't scale but most wikis are pretty small so this works pretty well. With Git you also have Git hooks which open up other possibilities. Updating caches and publishing other formats and basically integrating with other back end systems. For instance every time you push a wiki to GitHub we kick off some jobs that update caches and things like that indexing the data. But Gollum presents some challenges running that on GitHub with all the wikis on there. There's some Git limitations because it's really designed for hacking on code and maintaining big projects. It's not designed for gigantic wikis and things like that. So one of the things I wanted to generate a feed, an atom feed of all the updated wiki pages. So I asked Scott how do I do this because I have no idea. Rather than getting the log of through Git and parsing and looking for pulling out the update files there's the Git diff tree command and if you pass name only you'll get back just like a list of files and I can use that to update our atom feed of latest changes. So I know that these files were updated most recently but the problem with this, if you look at this query I'm going back 10 commits from master. The problem with that though is if I try and go back more commits than are in the repo I get back this ugly error. So right away I know if I want to do this reliably on the site I have to account for the fact that some people don't have 30 updates yet so I'd have to keep some kind of internal counter of how many commits there are. Also there are a few obscure leaky abstractions in GRIT. Git Ruby is the component inside GRIT that implements parts of Git in Ruby. That's a really confusing name. So on the top I'm wanting to call lstree with dash l for long and dash r for recursive. So what I want this to do is list out every single file in the repo including inside subdirectories and things like that and then also the long format means I want the size but when I call it I don't get the size and I thought there was some weird bug in my code and I went digging through it and the Git Ruby interface just doesn't implement it because again we have to walk objects. You think about calling lstree on master you have to get the commit and just keep walking down all the trees and for a big repo this can take some time but if you're asking for size also then you have to open up every single file and pull out the size and you can imagine that can be a lot of work but if I use the native method to call right out to Git on the command line it'll do that in C and it's super fast. So this is what we use on the wiki. Also again Golem is suited for maybe small to medium sized wiki so this is the radiant wiki which actually brought up some scaling issues early on when we were in beta. You can see it has only 116 pages which is a lot for the GitHub wikis but it's not very much in the grand scheme of things so I had to rewrite some of the code for this and make it a little bit faster and as soon as I pushed that code I got this ticket on the Golem project that someone had a wiki with 30,000 pages and the stuff that I pushed was just insanely slow so we still have a little bit of work to do there. So yeah, Golem wikis will probably not scale to something like Wikipedia. They have tons of servers and I don't even know what's going on there but it's a way out of scope for Golem at least right now. So one of the issues, one of the limitations with Git is it's really designed for just single coders on a machine, they work locally and then they push up to other repos at a higher level but this doesn't happen concurrently. If you try and edit or commit to a repo concurrently you start seeing issues like this where there's lock files. This error message actually came from a Dropbox support thread, someone threw their Git repo on Dropbox and they're trying to share it with their friends with their coworkers or whatever and they're just running into all these weird issues where they would commit and then Dropbox would see the lock file and it would try and sync it with everyone else but then the lock file is removed too quickly for Dropbox so it stays on everyone's machine and then they can't commit anymore. Also if you look at the way you make commits you have to pass the parent SHA which the parent SHA is the commit before it that you are updating. You can imagine if you have lots of people updating say a Wiki or anything centrally located there are possible race conditions there. Maybe I edit one SHA and then you come in edit the same SHA we'll get basically two commits that are branched and one of them will get lost because it has to move up the reference for the master branch and basically the last right will win. So when you look at the commit SHAs it's very similar to how vector clocks work in various distributed data stores built on like the Dynamo system they use vector clocks so if two people update to two different nodes at the same time you can detect the conflict and figure out some way to resolve it. It's pretty complicated though to work with. This happens all the time in Git. We have merge conflicts so you do everything locally so you never have conflicts with yourself but then when you try and push to the main repo someone else will beat you to it and then you have to pull their changes in and work out any conflicts but the thing is Git doesn't help you out with that. It only detects the conflicts. Really it requires you to work with the other coder and figure out the best way to merge the code in and resolve the conflict yourself. So if you need to use Git at like a high scale you have to come up with some interesting way to work this out. Like right now the Wiki doesn't try and do any of this. Really just hopes you don't commit at the same time as someone. So the solution if you want to run Git if you want to access Git on scale is find Git a wingman. Basically some other tool I can work alongside it. A common one is caching. We use Memcache a lot for making Git fast and this is an article that Tom wrote a while ago after their big host switch about how they made GitHub fast and it's about how instead of accessing repos on the local file system they built a smoke to route requests to multiple file servers. So basically we have a mapping of repositories to file servers and we have a proxy that proxies the connection over there and right there Git becomes just like any other database. We start running into similar issues that have with MySQL and queries. You start worrying about how many times you're calling Git and things like that. So one of the issues that we're running into is the code was originally written with the repos on the same file system so things were reasonably fast. But once you move over to smoke there are sometimes where some pages will call like five to 10 times per page. So that's basically like basically like an unoptimized Rails apps where a certain page will fire off like 10 queries. You wanna try and work that down into as few queries as you can. So one of the things that's helping with that is we're building like simple wrappers around Git. So something like this, this is similar I guess to the data mapper pattern if you're into patterns. But you just have simple methods that they call out to Git and then you can add caching and things like that around it to make it fast. And then you can replace certain backend stuff, replace Git with other things on the server if you need to. So this is an example of how you might use memcache or redis for caching. You just wrap the, you just put the actual raw Git command inside this cache block and the results go to memcache and then future hits are served from memory which helps out a lot. This is an example of how I'm using redis for the last updated wiki feed that I talked about earlier. One of the things, I didn't wanna keep like an internal counter of how many commits a wiki had. So we set up a post-receive hook every time you commit to it that updates this redis sorted set with the, it's sorted by the committed date and then it's keyed on the name of the page. So then I can use a redis. I can say, okay, what are the last three updated pages for this wiki and it'll just list them out. I can also figure out the date, the last date that a certain page was updated by getting the Z score and that's the time in integers. Other possible wingman that I've looked at or played around with a little bit. One is RIOC, which is a distributed key value store similar to Cassandra or some other ones. It's based on the Dynamo white paper from Amazon. It has some other interesting features that other key values don't have like has like a simple link database where keys can link to other keys which I thought worked really well with Git. So I'm playing around with that a little bit. CouchDB is another SQL data store that's very similar to Git. Has similar replication properties and it automatically versions all your content. And then VertexDB or other graph databases would be a good fit because they're dealing with relationships of nodes and there's a lot of that in Git where you have commits and trees and how are they related. One of the challenges with Git is that the relationships are all one way. So blobs have no idea what trees point to them. Commits only know the parent commit. They don't know commits ahead of them. Things like that. Storing that in a graph database if I wanted to build to get some more meaningful information out of these relationships would be helpful. I've seen some articles where they're throwing like Twitter or Facebook social graphs into graphing databases and you can get interesting things like what things your friends are following and things like that. So there's some possibilities there with Git that I'd like to investigate. All right, so for this talk I wrote this basically a Twitter implementation on Git. So I'm solving all of Twitter's scaling issues with Git. Not really. Anyways, so it's on my GitHub if you wanna get it. But basically each tweet that you make is stored as a commit in Git. The commit has no content. It doesn't really point to a tree but it uses the commit message as your tweet. So this is what the object model looks like. You just create like a repo timeline. Basically each Git repo can store in multiple timelines and when you add a tweet it creates a commit inside a branch named after that user. So this is me like opening up my timeline and tweeting. You can also do retweets in Madriks. Basically a retweet is a commit with the same message as the tweet but then I set the author to the original author. So in that retweet I'm still the committer but then the original authorship and the time that they authored it is still maintained. Favorites, you can do favorites. They're similar to retweets. Right now I'm putting them in a separate branch. So this is going, so if I'm favoriting one of Scott's tweets, it goes in the techno we need dash favorites branch. So all the timelines stay in their own branches. And then if I want to show them I just run Git log on it and I can get back all the commit messages and then I can display them. This is what it looks like in doing it from Git. You can see here, you can see the commits in order. Well reverse chronological order like Twitter. The interesting thing about this is now I can merge timelines together. So if I'm following Atmos, I would pull his repo down and create a local branch based on his branch and then I just merge them together and then my timeline feed will put them in the right order because the commits, even though they're made, they have different parents and things like that. They're still ordered by the commit date. And this is what the log looks like. If you notice, like in Magics, I automatically skip any commits with more than one parent or with not, yeah, I only show commits with one parent. So the first commit has no parents because it's first and then merge commits will have multiple parents. So I basically skip all those and only display the tweets in the branch I'm looking at. So this is what that looks like from Git. So you see the merge commit at the top but then the tweets are still ordered by commit date. These are some articles and some books that I used a lot in making this presentation. They're really good if you're interested in getting more into the nitty gritty of how Git works and some other links to the projects I talked about. So there's a link to the slides on GitHub and on Heroku. I'll be pushing these later this morning. And that's it. Thanks. Any questions? Totally. Yeah, I mean, I'm not saying that Git is better than other things. Really, it's just, the thing that I like, I don't like the whole NoSQL banner but I like that all the NoSQL stores do things their own way and they're all very different and I like evaluating them. Git is definitely a challenge because it's not designed for this and we run into that every day. I think there are certain projects like the Wiki that really makes sense for Git. So that's why those work. Basically anything with publishing and versions and things like that. People ask for an issue tracker on Git which would be awesome because you can import and export really easily but you can't do things like querying, things like that. So the question is, have we looked at using API to replace the backend from Git with other things? I think Scott was working on something that replaced the backend with Cassandra but I don't know how that went. We're not using it so I guess it didn't go that well. All right, thanks.