 Okay, Eric is going to tell us all about BitBucket, with a focus on Git, Eric? Is it judging by your shirt? Just a bit, yeah, with a focus on Git. Well, it's a lot of faces. More than I think I've ever seen in one room staring at me. We'll see how this goes. So I'm Eric and I am with Atlassian. And I work on BitBucket. I'm one of the more back-end developers on BitBucket. And I'm going to tell you all about BitBucket's architecture and infrastructure. Or at least as much as I can in 30 minutes. Before I do that, though, I want to share with you this photo. And those who don't instantly recognize the rocket here. This is a Saturn V rocket. It's the rocket from the Apollo program. That's the Moon rocket. That's the one that got Armstrong to the Moon and back. And I want to show it to you. The whole Apollo program is, I find, a fascinating piece of history. And I'm sure I'm not alone here. It's like this rocket when they built this. And I guess the program around it. We're really sort of the pinnacle of innovation and engineering at the time. And a goal that they set out to achieve was so ridiculously ambitious in the 60 sending a man to the Moon and bringing him back when, I guess, the state of the art was the Russians who had just flung a chunk of metal into orbit. There's quite something. Enormous undertaking. I think at some point like 500,000 people were working on it. Ridiculously large. Billions of dollars. But it worked. And so you must have assumed that only really the smartest people worked on that and were able to pull this off. So quite literally rocket science. And I'm a bit of a nerd. And earlier this year I actually went to Florida and visited the Kennedy Space Center in Cape Canaveral. And they've got one of these things on permanent display. Here it is. An actual remaining Saturn V rocket that they've taken apart into the separate rocket stages. So you can see it up close. You can see sort of what's inside, right? And what struck me when I was there and I looked at this the first time I'd ever seen this stuff is that it looked sort of, I don't know, simple. Maybe that's not the right word, but rudimentary perhaps. As in it was very functional. Look at this thing. It's like a sheet of rolled up metal around a massive gas tank. I mean it's really not much more there. I mean there's some plumbing, but even that is limited. And I guess I never really considered what would be inside of a rocket like that to be able to do the things that it did. But I guess I sort of expected something more complex, more ingenious, I don't know. It's a similar story at the back or the bottom. It ends here, there's a flat surface and we'll just bolt some engines at the bottom. If you're there you can actually see the engine mounts like the screws and everything. It's not really polished. You see bolts protruding everywhere. Now I don't mean to disrespect the Apollo program by the way. I mean it's still as amazing as I thought it was. But it's sort of seeing this stuff up close, I don't know. Made it more approachable. It brought it down to earth, if you will. And I think that is representative of how we tend to perceive technology that we have in high regard but we don't really know much about. We tend to assume that things are more complicated than they really are and that the people working on it are by definition much smarter than we are. You know the whole grass is greener thing. And it is that potential perception that I want to debunk today by laying out the architecture behind BitBucket and also at the same time share some anecdotes and I guess some of the instances where we screwed up. So if you are a little bit like me and you tend to assume that other people are smarter than you then you'll be glad to hear that there's really no rocket science behind BitBucket in everything that is running now is built around the same tools that you will use yourself. So sort of try to break it down a little bit. So this is roughly the architectural BitBucket. I've separated it into three logical areas. So there's the web layer which is responsible for load balancing, high availability, that kind of stuff. Then there is the application layer. That's where our code is, that's where all the Python stuff is. BitBucket is almost exclusively written in Python. And then lastly the storage layer where we keep our repository data and all that. So we'll talk about each layer individually and time permitting, I'll share some anecdotes. So the first layer, the web layer, consists really of two machines only. So BitBucket is all, there's no virtualization in BitBucket. We run real hardware, we manage them ourselves, we have a data center in the U.S. and we have two load balancer machines. They own the two IP addresses that you see when you resolve BitBucket.org. And these machines basically run Nginx and HAProxy. And web traffic that comes into the load balancer first hits Nginx. Nginx is a, for those who don't know it, it's an open source web server. It's pretty good at SSL. It can also be used really well for reverse proxying. And that's what we do here on this layer. So when a request comes in, it is encrypted. Everything on BitBucket is always encrypted. So the first thing we do is strip off the encryption. And that's done using Nginx. And then once it's decrypted, we forward it on to HAProxy which runs on the same machine. HAProxy also an open source reverse proxy server, but it's really good at doing load balancing and failover when you have a whole bunch of backend servers. And so that HAProxy inspects the request and based on some properties decides how to forward it on. Ultimately, it will forward it on to one of our many actual application servers. And on there, there is another Nginx instance. So this Nginx instance is also just a reverse proxy server. It's not our actual web server. It takes care of things like request logging, compression, response compression, and asynchronous response and request buffering. And that's why logically it's part of the web layer because it doesn't actually process the request. And then ultimately that forwards it on to the real Python web server on the application server. Now, so that's HTTPS. We also do SSH. SSH takes a bit of a different path. SSH is a different protocol. We can't easily decrypt it first. And so we do need to load balance it. So it does go through HAProxy just as a TCP connection. And HAProxy then forwards it on to the least loaded backend server. So that path is a lot simpler. But make no mistake, it's not necessarily easier to run that reliably as we found out really just recently when users started to complain about SSH connections dropping out sometimes. Users would say that they'd get hung up on. And looking at the error messages, it seemed like that was indicative of a capacity problem. Like we had not enough capacity on the server side basically to handle the request rate. But our monitoring tools told us a different story that we had plenty of capacity. And so we're stunned for a little while until we started analyzing the network traffic on the load balancers. And in particular, we looked at the frequency of SIN packets that were arriving. And so a SIN packet, SIN packets are part of TCP. And it marks the start of a new TCP connection. And so time stamping those, each single one of them, gives you a really good, accurate view of the incoming traffic. You see that here. So what you see here is an interval of 16 minutes over which we captured every SIN packet. And you can see right away that it is ridiculously spiky. And these spikes are, aside from being very high and very thin, are also very evenly spaced. If you count them, you'll see that there are 16 spikes in an interval of 16 minutes. And there's no coincidence. These spikes occur at the start of every minute, like precisely at the start of every minute. They last about one to two seconds only. But you can see that the rate at that point is ridiculously high. Three to four times higher than our average load. Our working theory behind this is that this is the result of thousands of continuous integration servers all around the world that are configured to periodically pull their BitBucket repos. And that in combination with NTP, everybody I guess these days uses NTP and clocks are really accurate. And this is what you get. And that was a bit of a problem because even though we have enough capacity for the average rate, during these spikes, we actually don't have enough capacity. Now, solving this, it's not... We can't really quadruple our SSH infrastructure to be able to deal with the large spikes. And so what we did instead is we went back into the web layer into Asia Proxy where we basically have a hook into that traffic that comes in. And we configured Asia Proxy to never forward traffic at a rate higher than what we knew our capacity could take. But then don't make any changes to the ingress side. And so during these spikes, Asia Proxy will happily accept all the incoming traffic, but it won't actually connect or forward all of the connections at once. And so it sort of spreads it out over a few seconds. And now this graph on the application servers is a lot smoother than it is on the load balancer side. You could probably see it if you have a cron job and you started at that very start of the minute every time versus any other second. So you'll probably have a few seconds less lag. So it's a bit of a funny problem. Never really considered it until it crept up. Probably won't have it really with websites that have humans click on links. But if you operate like a public API that is very popular and people script against it, you might see similar issues. So the application layer done. This is where all the magic happens sort of. This is where the website runs. And this layer is distributed across many tens of servers, real servers. They all run a whole bunch of stuff. They run the website. The website is a fairly standard Django app, really, like Bitbucket started out as a pretty much 100% Django app and it's still very important. We run that in Gunnycorn, the web server. Gunnycorn, Python web server that is relatively simple. We run it in perhaps the most basic configurations. We use a sync worker, meaning that our processes process one request at a time. And so we have a whole bunch of processes and multi-processing to get concurrency. And then SSH. We handle SSH using really just the standard open SSHD server daemon, the same one that you all run in your Linux machines and laptops, with one difference. So we made a small change to it, a small patch that allows us to use the database to lookups in the database for public keys. Open SSHD looks on the file system. It's hardwired to look at a file system to find public keys. It's not practical for us, and so we have a little change to make that happen. Other than that, it is the standard open SSHD server. So we don't need to maintain that. We also do background processing. So any sort of job or process that we can't guarantee will respond in a few milliseconds. We dispatch off to our background system. That's comprised of a cluster of high-available RebitMQ servers. RebitMQ is an open-source Erlang implementation of AMQP broker. And to consume jobs, we use celery. So we have a whole farm of celery distributed across all these machines to process these jobs. Examples of that are if you fork a repo, for instance, there's actual copying of files involved so that it might not complete immediately. That's dispatched. So it looks very basic. At the end of the day, it's the same components that you all run distributed statelessly across multiple servers. There's nothing really special about it. Simple is usually good. We've had this set up for years. Bitbucket is now, I think, over 35 times bigger than it was when we started, when we acquired it, I should say. And this held up really well. However, you can still screw up as we do from time to time. One of those examples was when we decided to upgrade our password hashes. So up until that time, we never stored passwords in our database, but we stored salted sha1 hashes, which is very common. So it means that if somebody, for some reason, gets a hold of our database, they only have hash values and they still don't have your password. However, sha1 hashes for passwords are slowly being phased out and replaced by more strong, secure hashing algorithms. And the reason for that is that even though you can't decrypt a sha1 hash and get a password, what you can do is think of a word that might be the password, compute the sha1, and then compare it with what's in the database. And if you just think of enough words and try enough combinations, you might brute force the password. If you have a strong password, chances of anybody brute forcing that through sha1 are going to be careful with some cryptographers maybe here in the room, but let's call it negligible. However, we have millions of users on Bitbucket and not everybody has a strong password. If you have a password that is a word in the dictionary, then I don't have to tell you, I guess, then it's a whole different story because there really aren't many words in the dictionary. Certainly not when it comes to a computer computing sha1 values for it. So you're really at risk. Now, short of forcing people to not use simple passwords, another thing you can do is sort of upgrade to a stronger hash. Now, what these things do, nothing special really, but they're hash values or hash algorithms that are more expensive, deliberately more expensive by rehashing their hash value over and over again thousands of times, deliberately spending more CPU cycles. And that's what we wanted to upgrade to. And let me show you just how big that difference is. So we wanted to upgrade to Bcrypt hashes. Bcrypt is one of the sort of more modern iteration style cryptographic hash algorithms and we compared it with sha1. So this script measures how many hash values you can generate in one second. And for Bcrypt, you can see that my laptop was able, so this uses Django's code, so Django's hashing algorithms, all that code with the required or required with the optional C extensions to make it as fast as you can. On my laptop, that amounts to three hashes per second for Bcrypt versus a 160,000 for sha1. So it's five orders of magnitude more expensive, just CPU cycles. And so that is absolutely huge. And it's great because it means that even your weak password may stand a chance. But you have to realize that as a server, you have to incur that cost of that massively expensive calculation every single time somebody uses a password for authentication. We run a really popular high-volume API and a lot of people use basic off for authentication. It's all SSL, so it's not plain text passwords, but it means that we have to compute Bcrypt for every single request. And our API requests are relatively quick, like on average, in the tens of milliseconds. So you can imagine that if you add a 300 millisecond password check to every single one of these requests, you have a problem. And we did because we naively wrote this out and instantaneously the website went down. All cores and all the machines and all CPUs went to 100% and we're calculating Bcrypt. Now we realize our mistake fairly quickly, obviously not quickly enough, but fairly quickly, so we roll it back and we're able to keep the downtime minimal. But then we had a bit of a problem because we still wanted to move away from SHA-1. Now you can't really make Bcrypt cheaper. Actually, you can, but you don't want to have an expensive algorithm. And so we could do, however, is do less of it. You know, when people use the API and write a client, they typically do more than one request in quick succession. And so they'd be using the same password over and over again and we'd be computing the same Bcrypt over and over again. And so we decided to implement a sort of two-stage hashing system where when a request comes in, instead of computing the expensive Bcrypt, we now compute an old-fashioned salted SHA-1 value, and then we use that to look up in a in-memory map dictionary the Bcrypt value. If that's empty in the beginning, we then compute the Bcrypt value ourselves, check it against the database to see if your password was correct, and then store that mapping like SHA-1 versus Bcrypt in the in-memory table. Then the next request that you make with the same password is able to look up the Bcrypt value from the in-memory cache. That way, we're able to cut out, I'm guessing, 99% of all the Bcrypt calculations. But it's important to understand the ramification of this system, because you might be tempted to think that, well, you've now sort of weakened your Bcrypt authentication back down to SHA-1 strength. There's that whole story. The important thing, the main thing, I guess, is that SHA-1 values never hit cold storage anymore. So the database is all Bcrypt. So if you get a hold of a database, you still only have Bcrypt. And even if you were able to somehow tap into our servers and get a hold of like copy memory access, then you'd get some SHA-1s. But you'd only get SHA-1s from the users that are active in that very moment. Because these cache entries, that's essentially what they are, are expunged very, very quickly. So we're able to get this thing running and upgrade to Bcrypt. But even then the remaining, sort of 1% of the time we spent on Bcrypt is still very significant. Very significant. Just look at that ratio, right? 160,000 versus 3. And right now, today, if you look at one of our servers and you run like a Perftop or something, you can see that the Bcrypt Cypher method is the most expensive method that runs on that machine. At any point in time, I think it eats like 12% CPU. So it's still hugely expensive. So in the future, I guess we should probably be looking at migrating or offering something like an alternative to basic auth. Maybe, you know, standard relatively standard HTTP auth tokens which are revocable and have a limited privilege set. And then let's move on to the storage layer. So here we keep track of all your data, obviously. The biggest amount of data that we store, of course, is the contents of your repositories. And there are millions and millions of repositories. And we decided to keep the storage of that as simple as we could. So we live in line with everything else that you've seen so far. And we decided to just store that stuff on file systems, just like you do on your local machines, right? Git and Mercurial were designed for file systems. They worked really well as a post, for instance, to modifying Git and Mercurial to be able to talk to like a distributed cloud-based object store system of some kind, for instance, as we decided to keep it simple. The file systems that live on specialized appliances by NetApp, commercial company, are accessible from the application servers simply using NFS. And then, aside from that, we have other, so we have sort of no-SQL storage, like distributed, like, map systems. We got Redis and memcached. We use Redis for your newsfeed in the repository activity feed that you see. We use memcached for basically everything that is transient that we can lose. And then the data for the website is all stored in traditionally just in SQL. So we use a post-processed SQL in the data is manipulated and accessed basically exclusively through the Django ORM, and that works pretty well. The only thing is that SQL databases, post-processed no exception, are generally kind of hard to scale beyond a single machine, transparently, I should say. So unless you go implement application level sharding to separate your data across multiple databases, transparently scaling an SQL database across multiple machines isn't entirely trivial. And so so far, we've kept things simple. We are running a single post-processed database. It's a very, very big machine and has no trouble with the load at this point. And then for high availability, we have several real-time replicated hot slaves on standby. But yeah, in the future, should that thing ever sort of become a bottleneck? Which hopefully it will. That means service is popular. I guess we have to look into sharding. I can talk about this stuff all day long and I wouldn't mind to do so either. But there's only 30 minutes and there's only a few minutes left at this point. So I want to leave it at this. If you have any questions, I'm happy to take some now we don't have a lot of time, but I'll take some now. Otherwise come chat to me afterwards. We also have a booth in the lower level and so you can just find us there. And otherwise, I'd like to invite you for a drink tonight. We are hosting a drink up in a bar nearby starting at 7 at Mainhausamsee. And so, I'd like to invite you over. Come have a drink on us and you can talk all about this stuff. There's two of my colleagues too. We're also hiring, so if you want to talk about that, that is also possible. With that, I want to thank you very much for listening and I hope to see you all tonight. Thank you. Thank you, Eric. With the next speaker, I'd like to come up with his slides up if you'd like to take any questions over there. Any questions? Question about HAProxy and nginx in the beginning of the request. HAProxy actually has the SSL support. Have you tried that? Yeah, it does. Our setup on the web layer is a little convoluted maybe. It's a lot of components as you saw. That's not strictly necessary. Part of that is sort of organic growth in historical. So, HAProxy hasn't always been very good at SSL, at least not in our experience. We've experimented with a ton of different SSL terminators. We've used Stunnel, a bunch of others, and at some point we found that nginx was, at least for us, the most reliable. And so, we've left it there and I know that it's going to be an issue for HAProxy, so it is something that we intend to revisit at some point in the future. So, yes. Okay, thank you. Hi. So, I noticed that Bitbucket uses quite a lot of JavaScript on the web server side, and I wonder if you use WebSockets and if you do, what do you use for them on the server side? So, do we use WebSockets? No, we don't currently use WebSockets. For things like real-time notifications and pull requests for instance, those kind of things. But no, we're not currently using WebSockets. Okay, thank you. Anyone else? Yep. What's PG Bouncer? What do you use for? PG Bouncer, you said? Yeah. So, I had a whole spiel about PG Bouncer, but I wasn't no time to go into it. And there's not enough time to go into all of that and try to chat later on. But PG Bouncer is a Postgres connection pooling daemon. And so, we use Django. Django by default doesn't come with any connection pooling. And so, getting Django to talk to a database efficiently is a bit of a challenge. It's not a challenge, but you need something else. So, PG Bouncer is part of the Postgres project. And basically what it does is it makes stateful connections, long-lived connections to the database, a limited amount. And then you configure Django to talk directly to PG Bouncer. It acts like a database. And so, then if Django open and closes connections at a very high rate, because you're serving a lot of connections, that is a lot cheaper than open and closing actual database connections. And so, it bridges between the two to make that more efficient and to also be able to limit the total number of connections that you end up having on your database. There's a lot more to it, by the way. So, we actually, if you notice, we have two layers of PG Bouncer, and it's a good reason for that. Yeah, as I said, I'll have to talk about that afterwards, because there's no time. Cool, thanks. Yeah, no worries. Hi. Hi. You talked about the machine with the database, like a little large machine. As far as I could understand, that's physical machine. Yes. So, what happens if that goes down? Yeah, so if the physical machine goes down, we have several real-time replicated hot slaves. So, we're streaming replication for Postgres to have a bunch of slaves. If the machine goes down entirely, then we steal the IP, basically, and almost instantly hopefully move over to the other. Yes. It's never happened, by the way, but yes, it's configured that way. All right. Thanks a lot.