 So, today's first talk is rewritingpars.com. I am Abhishek Kona, I'm a software engineer at Parsh, which is a subsidiary of Facebook. I was an ex-flipkart engineer. My Twitter handle is shaky. I've worked on RocksDB, the database engine for a bit, so that's about me. What is this talk about? This talk is roughly organized into three sections. What are the scaling problems we had at Parsh? How we solved them and did we learn anything from it? Are there any principles we kind of found useful? It's roughly organized, it's basically kind of a one-on-one for a back-end team sort of. Let me begin by introducing whatisparsh.com. How many of you have heard of it or used it? It's not bad for a DevOps conference. So what is Parsh.com? Parsh.com is a platform to build mobile apps. The idea is, if you're an app developer, you do not write any back-end code. All your abstractions are present on the client-side code in Android and Java, iOS, in REST, and in JavaScript and PHP. So you just do not manage servers, there's no back-end as such, you start with building an app. So that's Parsh.com, the marketing term for it is back-end as a service, Parsh. So Parsh was acquired by Facebook in 2013 and that's how I ended up working there. So what was Parsh like in 2013? Parsh was one big Ruby on Rails app, like every other startup coming out of Y Combinator. Just one massive monolith. It had 60,000 active apps on it. We had 10 engineers working on the product. The request rate was around 3,000 requests per second. Our deploy times were around 30 to 35 minutes. So that was Parsh around 2013. We've been growing a lot in the last two years. Right now, there are around 500,000 apps built on Parsh. We have 100% year-on-year traffic growth. So our current traffic is around 15,000 requests per second. Our stack is primarily built-in-go. We have replaced most of the Ruby. And that's why this talk is generally about and how that helped. So what were the main issues with Parsh in 2013? The biggest issue was our uptime was around 90%. For any infrastructure service, this is pretty bad. I'll get into reasons why it was later on. Another big problem we had was a single app could take down Parsh. What happens in the app world is an app gets featured on iStore, on Google Play Store, or on Tech Ranch. And a lot of people start installing it. One typical example of that is Yo. It just went crazy in like a day and everyone starts installing it and it takes up all the capacity on your side. If an app is hosted on Parsh, they become really big. They go viral and they take up all your capacity. This is all right for the app because that app is mostly working, but it hurts other apps on the platform. And as an app developer, you really don't want to see this. So we had this issue of tenancy. We could not basically silo apps very well. Our code base was unmanageable. We don't know what was happening. At times we were guessing. This was mainly because the code base was in Ruby and it was dynamically typed and it was hashes all the way down. So we had to guess, yeah, this is a hash here. This is an array here. So yeah, it was basically a sheet here. So what we did is we started by listing down our problems into a list called the beast list. This was a list of 10 to 12 items we put down, mentioning like these are the reasons why our uptime is not 99%. Yeah. And then we went about like, how can we slay each of these beasts? What is a software or a tool we can build? So that this problem goes away. And we went from there. I think a beast list is a great idea because it also helps people to be onboarded. If someone new is joining your team, you can point them at the beast list and be like, yeah, these are the problems we're having. X is working on this, Y is working on that. So you can pick whatever is left. So I think any infrastructure team generally needs to have a beast list of all the problems they have. We kind of use the beast list and it was pretty good. So some of the concrete issues on our beast list where Unicorn or Ruby HTTP server was a resource hog. Has anyone dealt with Unicorn here? Yeah. What do you guys think of it? It's good. Okay. For our case, it was not very good because it was a forking process based server. So it had a upper limit of the number of requests. We could serve a box. As an app scales out, one of the trade-offs we want to do is give up a little bit of latency for higher throughput, which we could not actually do with Unicorn because it had a upper limit of the number of requests you could do on a box. And it was not very efficient compared to modern standards. So we could not just keep adding on nodes because it became very expensive for us. So one of our biggest problems actually was Unicorn. Another big problem we had was our deploy times were very large. 30 to 35 minutes to roll back a bug causes a huge dent in your uptime. If you have a NIL pointer exception in Ruby, it goes away and you're taking 35 minutes to roll back, it's like you lose a percentage, 0.1% of your uptime on your service. That's pretty bad. So our deploy times were pretty large. Unicorn was a resource of that. Various other problems we had listed down on the list, but these were the two biggest ones. So yeah, we decided to rewrite our stack and go. I know it's not very revolutionary, but we did. There is also a lot of literature against rewrites. I think a lot of people would like to point this block saying that never do this thing is rewrite your stack, but sometimes rewriting is not that bad, especially when you do not understand your current code base. So I'm gonna go with this other block which says when understanding means rewriting. So if you have another block to counter you. Why rewrite? We could not understand the Ruby code base. The estimated performance wins of go over Ruby were fairly large, five to seven times in our dry runs. That's pretty big. And other big reason was our new code base would be statically typed compared to Ruby's maps of maps. So a statically typed programming language is also good for onboarding new team members because when they come in, they can see what an API does. It's easier to read and you can basically figure out stuff, you can extend APIs in a much nicer way in a statically typed languages. Especially in large companies, it's kind of like tribal knowledge that if your software grows, you need to move into a statically typed code base to maintain it. We kind of moved into that phase and we wanted a statically typed programming language. So that's why we decided to rewrite. Why did we pick go? As I said, it's a statically typed programming language. It has great concurrency features or it has working concurrency features. It outperformed Ruby in all our tests, not only in execution times, but in build times, which was very critical for us to shorten our deploy time. Honestly, we didn't want to write Java or C++ because no Java and C++ is hard to do the memory management. Our second biggest choice actually was .NET. It is a very stable language actually. Back then we couldn't run it very easily on Linux. Maybe today we can. Maybe we'd have picked .NET today. But yeah, our second best choice was .NET. C-sharp. Yeah, status of the rewrite. So we were three to four engineers working on it. It took us like one and a half years to complete it. This is the traffic graph, which shows how the rewrite went. We did some experiments in August 2014. You see some traffic spikes. And then we started in earnest in September. And yeah, as in when we launched more endpoints, we bumped more traffic to go and we basically completed it in May of 2015. So that was way more than our estimate, by the way. So how did the rewrite help us? It got rid of Unicorn, which is our biggest problem because SETP stack is great. It is asynchronous. And you can again do the trade-off of sacrificing latency for throughput because the model does not have a upper bound of number of requests you can serve per machine. Our deploy speeds went up because Go's build times are pretty sweet. We could add capacity very quickly. So this helped us to scale out when we had traffic spikes. So this helped us to use the elasticity of Amazon Cloud fairly well. And for now we have a readable code base, which is kind of statically typed and you can look around what's happening. That's for now though. Yeah, so I'm gonna start going into things we kind of figured out on the way. This is where I'll spend a little bit more time. Yeah, so one of the first things we learned is there's no silver bullet. Like rewriting, just rewriting does not help actually. You have to also build tools around your problems. You have to build other software. Just replacing a single binary with a binary in another language, mostly it does not solve a lot of problems. You would have to think about problems in a different way. What a rewrite helps you to do is actually spend some more time figuring out, oh yeah, I did not think about this problem the first time. How would I have done it differently? That's what a rewrite helps you in. It's more of a thinking exercise. Yeah, and most of the problems would not have an immediate solution. The best thing you can do in those situations is like basically kick the can down and be like, okay, I cannot solve this. What is the minimum I can do so that the client is not disrupted? And then just move on. So yeah, the rewrite actually was not magical, but it helped us think and we kind of built a lot of software around it and that kind of elevated our problems. The first thing we learned is monoliths are all right. Who does not agree with me or do everyone agree with me? Monoliths are all right. So what is a monolith? A monolith is kind of an opposite of a microservice. Microservices are all a rage nowadays. Everyone is building small services. They have a lot of tools to do this. There's GoKit, for example, to allow you to write services. Yeah, but having a single binary, a single monolith is usually better if you're a small team. You can build and test the single binary on a single machine. You can compile it locally. It works great. Development times are much faster than a microservice landscape. Yeah, so we build pass.com mostly as a monolith. It's one binary talking to a bunch of backends like Mongo, Cassandra, V8 servers. It serves a bunch of different APIs. There's not really that many services we rely on. Yeah, it's partially also inspired by how Facebook does things internally. The Facebook's binary is one big PHP code base compiled into C using hip hop and just deploy it everywhere. They're like different terms for it. It's called cookie cutter deployment, mono repos. So we just one code base, one binary kind of philosophy. Another thing we found out is microservices work great if you have different teams managing each of the services. If you have one single team maintaining the stack, it's just easier to build a monolith. Another thing we found out is proxies. Proxies are a very useful resource. You do not usually assume this, but opening a lot of connections to the database consumes precious memory on the database. And this memory could be better used by the database to cache indexes, cache data, make your queries faster basically. So every time you open a connection to the database, there is a cost to it. That's memory cost on the database. To give a concrete example, our database MongoDB would consume one MB of RAM on every new connection. So at 20,000 connections, this would be like 20 GB of memory, which is like a lot in a 144 GB box. This is a lot of memory which could be used for index caching, yeah. So proxies effectively help manage your connections across app servers. They reduce the number of connections open to the back end and make it more efficient basically. So invest in a proxy fairly early on for your database is what I'd say. Another side effect of having a proxy is you can manage or you can monitor the performance of your database from the proxy. What happened once we deployed this proxy in our code base was we had a single view of how each of our database was performing. What are the most expensive operations? What are the cheap operations? You get all this information because you could watch it from the point of view of the proxy. Yeah, so we wrote our own proxy for Mongo in Go. It's available at that URL. So also modern databases make it very easy to write proxy. Database like Mongo, Rethink, et cetera, have very nicely typed protocols. You can write serializers, desalizers, easier than MySQL for example and write a proxy in two to three weeks. It's kind of easy and Go makes it also performant and the network stack is pretty good. Yeah, proxies are not that hard. Invest in one fairly early. Metrics, I know this comes out a lot but when in doubt measure everything. For example, we measure every incoming request and outgoing request latencies on our stack. We kind of do it automatically using a re-writer. But yeah, measuring everything is useful. We started with using Ganglia. It froze after around 100 graphs. The UI just stops loading. Yeah, so we started using Facebook's internal tools called Scuba and ODS. I'm sorry that they're not really open source. They're like great tools. There's a nice blog post explaining what Scuba can do. This is startup which is trying to replicate it. I hope they succeed. Yeah, so metrics are important. Find a metric service. Hope that you don't have to write one yourself. That's like really tricky and painful. Yeah, another thing which we found very useful in our rewrite was shadowing live traffic. Real bugs kind of show up under live traffic. How many of our integration tests you have? How many of our unit tests you have? A bug shows up in live traffic. So we had a mechanism to track our playback or live traffic on a test cluster. So we did that and we found a lot of bugs and we went about solving it. This was an important tool for our rewrite. We would bring up a new go endpoint, play the same traffic on the Ruby stack, then play it on the go stack and see what were the differences and go fixing bugs around it. So yeah, we have a mechanism to shadow live traffic. It just takes us a config change and we get 1% of the traffic on a new node. This mechanism is kind of built into our stack now. We find it really useful. Yeah, so the setup is fairly simple. We have a go HTTP proxy which sends requests to both test and product clusters, measures the delta, puts it in a log stream. We see what's happening and yeah. It works great for read APIs. To do this for write APIs is kind of tricky but we also have a solution for that. It is not great but we just take a snapshot of a database. We siloed two versions of the code base and then we run the load against it and then we compare the database deltas and see what's happening. So that works but that's fairly complicated and like a lot of scripting and it's kind of breaky so yeah. But we do have a mechanism for that. One of the interesting bugs we found using this was the go time library truncates time to a single zero if it has three zeros. So if you have a time ending in zero, zero, zero it truncates it in zero and the Ruby stack would return zero, zero, zero. The Android SDK would crash on not finding three zeros. This was, we never knew what was happening. We just saw 1% error rate every day and we started shadowing traffic and we're like okay, this is the difference. They're like no two zeros here. So yeah, those kind of bugs are hard to find out. I think shadowing traffic is the only way to go about it. Throttles, one of the things we also did in the second version of the software or in the go version was we built throttles throughout. So we have the capability to throttle any client or any backend instantaneously. So if you have an app which is going crazy the first thing we do is okay, we're gonna throttle them before we figure out a nicer permanent solution for them. So that A, they don't take out the whole stack and B, they don't see, they get a nicer error message saying that hey, your requests are being throttled. Please try again in X seconds instead of 500s or there was a server crash. So throttling works in both ways. And then yeah, so throttling is important. We have simple memcache based counters for our throttling service. We tried ZooKeeper here. It was pretty complicated. So memcache is simpler and it works for us. We have at this point evolved this into auto throttling service. So there's a auto throttling demon which monitors request rates. If there's a spike, it throttles and then notifies the ops person saying that hey, this app has kind of gone crazy. Can you look at it instead of an ops person coming in and doing it manually? So yeah, throttles are great. Always build throttles for all your network operations. Again, if you could rewrite your client libraries to do this, it'd be great. If you have, for example, a thrift library, you can build it into it. That works great. In our case, we have it on a Mongo code base automatically but in other places we have kind of manually added it. So yeah, auto throttling is great. Invest in it. Gatekeepers and deciders. Gatekeepers or deciders are feature flags of production hooks to control rollout of new code to a fraction of users or traffic. This is a good way to gain confidence when you're rolling out controversial changes. Roll it out, deploy the code, make sure that it's only rolled out to 0% of the traffic and then keep bumping it up, observe your graphs, see what's happening and then gain confidence and roll it out. Yeah, so deciders are great. We have our own in-house system again. It's written and go up. It's based off Redis. It's called Decider. The Facebook one is called Gatekeeper. It's based off ZooKeeper, things like that. Another thing we found was Decider code or Gatekeeper code tries to pile up very quickly. When you start launching features in this, you tend to have like four or five things going out together. You don't know what's happening and then the code base becomes this huge mess of if and else statements. You're like, oh yeah. And you cannot basically follow what's happening while you're reading it because you don't know what the decider value is at that point of time, et cetera, et cetera. So the most important thing to do is actually to clean up your decider code once you have rolled it out to 100%. I've seen a lot of people not do it. I think it's a fairly important thing to do. We have a linter in place which kind of checks after one week saying that, hey, this Gatekeeper has expired or this Gatekeeper must have rolled it out so probably you should get rid of this code. Yeah, so keeping your code base healthy in the land of Gatekeepers is kind of important. People do tend to miss that out. Deploys, yeah. So as I said earlier, our biggest problem was deploys. We had 35 minute deploys. We kind of wanted to bring it down. Also, we had fixed release schedule before. It was Monday and then we'd have to talk to the ops engineer or the release engineer to get a deploy later on through the week. That was kind of bad because we would always be blocked on one person when there was a critical fix to go out. You'd have to wake someone up at night and be like, hey, I need to deploy this, this buggy. So yeah, so we moved away from that where we wanted to get rid of the human in the deploy process. So our philosophy is every engineer should be allowed to deploy at any point of time. So it's the engineer's responsibility to figure out if this deploy is safe and if it did roll out very well. We have a lot of graphs to give you a very good high-level overview of this, but it is the engineer who deploys. We also try to deploy as many small changes as often as possible. On my flight here, I deployed six times because I had like six dibs going out and I had to figure it out what's happening. So yeah, deploy as often as possible. The marketing name for this is continuous delivery. Yeah, I just call it deploy as often. Yeah, so we have a tool again, it's called deploy CTL. It's written in Python, it compiles Go code, makes static binaries, ships it to all the machines. Yeah, it works very well. This is like probably one of our only critical services which depends on zookeeper. Yeah, another important feature I found in our tool was deploy locking. Deploy locking is the ability for an engineer to lock deploys for a particular binary. If you are doing some risky changes, you do not want someone else to deploy the code base you're kinda in flux with. So you should be able to lock deploys. We have a nifty feature where we can put in a message saying that I'm locking this for X, it goes out, nobody can deploy that binary till I unlock it. So deploy locking is good. The ability to canary your software is also pretty good. You should be able to roll out only one binary and see how that performs with the rest of the binaries. Is there any difference? So yeah, those features are built into our tool and it works well for us. So I think those are important concepts. Another thing is cockpit. A cockpit is an admin service which runs on every binary. In our case, it's an admin STDP service. We use this for debugging. It exposes health checks. It exposes git versions, build times, the uptime the server has been running for GC stats. We also run it at a thread pool with a higher priority than the server thread pool or the main service thread pool so that this is always available even if there is a lot of load on the regular service. So we can debug and see what's happening. Because we have mostly written it in Go, we can connect Pprof over it. We didn't do much here. It's just available in Go. We just like build some glue around it. Yeah, so that works very neat. Another thing we use cockpit for is to activate verbose logging. What we can do in our stack is go to one server, log into via cockpit, enable logging for that app. And then you can see all the incoming requests and response bodies dumped on log file. This is again important when you want to debug a particular app. You see, okay, what's happening with this app? Why is it crashing so much? Or you just enable debug logging for that app in a regular case for that client and then you get all the stack and you can go from there. So this is, also you can change logging verbosity levels using cockpit. In our case, we found it fairly useful. When you want to figure out what's happening, you set the logging level to debug without actually restarting the binary. So yeah, cockpit is kind of useful. You can put in a lot of stuff around it. Yeah, I think every binary should in generally have a cockpit. Context, a context is an object which you basically tag along throughout your stack. This is an object you can add additional information to and it is available in any part of the stack, basically. So in our case, we use a context and we tag along the request ID and the app ID together. So you can find out who did this request and take a decision based on that way down in the stack, not at the API level, maybe at the Mongo level and see what's happening. The App Engine people use the context concept to actually introduce billing. So they pass along a context and they decide, okay, I'm gonna bill X because this request has already done these things. So yeah, context are an important tool we found. One of the nice things we did with the context was actually have a request ID tied along on the context object, pass it down to our Mongo driver, add it in a comment in the Mongo query so that it ends up on the slow query log on Mongo with the request ID tagged. You use this request ID to figure out which app is the culprit. And then you can basically get in touch with the app, you can do some more debugging with it, but this is really neat. If we pass it on a context, we could see in our slow query log what is happening and which is the culprit app. So I'd say, yeah, context are super useful. Context objects are a great idea if you're writing a new library. Please allow for them. Golang has a great context package. It's called Golang.org, yeah, it's pretty neat. I think owning your database. So when your app gets successful, the biggest problem usually is the database or you're probably doing something wrong. Databases are hard and databases have different trade-offs. So it helps if you can understand the internals of your database from the beginning. Databases, modern databases have a lot of concepts like query planners, DB caches, indexing trade-offs. You can also, there are a lot of major locks in them and each of them have a different tuning parameter. So it is important to understand how these tuning parameters kind of affect your app and your workload and tune them. Every database comes with really bad config out of the box. So it's very important to find a good config for your database. One nice way to do it is actually spending some time on the internals of the database, reading the code, what's happening. Again, modern databases are not MySQL. They have much more readable code. They're not, so you can figure out what's happening. We started that, basically, we started hacking on Mongo. We started adding metrics around the major locks, what is happening, and then we found out that it is actually not that difficult to add a new storage engine to Mongo. So we started this project called parse plus rocks. So this is a new storage engine for Mongo. Three people built out in six months. RocksDB based storage engine for Mongo. We just announced it, I think, day before yesterday. This works great. We could only get there because we already had a good grasp of how our database code, database code base was arranged, what are the layouts, what are the major locks, et cetera. So investing in your poking around the internals of your database is very useful. Keep doing it, do not shy away from it. I'm gonna talk a little bit about our code base. We use dependency injection a lot, but we use it only at boot time. There's no runtime dependency injection. So while we are booting up a service, we just run the dependency injection graph once, and that's it. You cannot run it again, or else our binary kind of panics. So dependency injection is great, but if you only use it at boot time and don't do magical things at runtime, it also helps you boot up test instances very quickly. Organizing your graph for the tests is fairly painful, and if you have dependency injection, it is easier. We write a lot of smaller libraries instead of building one complex library sort of thing. So we have, I think, 50 libraries at github.com slash Facebook Go. A lot of them are used outside of Facebook too, so it's pretty sweet. We try not to fork any software. We always try to submit patches upstream. A bunch of us are committers on the Mongo driver package. Some of us are committed on the Mongo database. So it's very hard to maintain multiple forks of the code base. We had like pretty bad experiences with this in Facebook land where MySQL is forked, then it was sent upstream, and then it's kind of at a messy state right now. We do not want to get into that state with the software we use. So we always submit our patches upstream and we make sure that they're accepted or work around with it. Our tests, yeah. Tests are one of the important things which made our rewrite very easy. We could always write good tests and figure out, okay, every time we found a bug, we wrote an integration test for it. And then that is how we kind of gained confidence. Our test suite is basically our spec. If we have a behavior argument, we go look in at the test. I'm like, this test was written because of this, so that might be the right behavior. One of the things we found was, yeah, integration tests are greater than unit tests. Unit tests are nice to write, easy to write at times, but integration tests are the real deal. You kind of run like sample from your live traffic. You put an integration test and it works better. We are very proud of our test run times. They're under like two minutes for a code base of around 200,000 lines. We use Go, it has parallel test suite built in and we also arranged our code base in such a way that you can run it parallely. There's no every, there's no like global lock sort of thing. So yeah, parallel test runs are a great idea. I've not seen a lot of languages do it. I think that's, you should always try to get your tests to run under like a few minutes so that you can try it out locally. Other, another neat thing we do is we actually boot multiple Mongo and Memcache instances in memory while we are writing, running our tests. So every test gets an individual Mongo or a Memcache instance booted in memory and it's basically a fresh start so there's no old data on it. You can start, yeah, so this just quickens our test times. We have two packages which allow you to do that in Go. It's a Facebook Go MGO test and Facebook Go MC test which is Memcache test. So at one point of time we had 30 Mongo DBs booted up by a single binary to run our tests. So in the two minutes we boot 30 Mongo instances, run all our tests, clean up and die. So that works pretty well. And I think this is again with modern databases being very lightweight, good configurable options. You can run them in like tens of MBs of RAM. So I think that's a, that's something you can use. I've not seen like a lot of frameworks and tools do it. So I think I'm just putting it out there. Yeah, so I think I'm at the end of my talk. This is like a lot of gyan I guess, but yeah. The closing thoughts is, rewriting software is not the worst idea if you invent some parts of the stack, if you look at what you can do better. Blind copying of code sometimes is not just the thing. You have to be smarter about what you're rewriting. You have to re-architect things. Go is a great programming language. I'm a big fan. I use it at the next startup I'm doing or the next company. It's got a great standard library. You do not depend on that many external libraries, which is good. Yeah, so Go is great. Thanks Google for it. And yeah, have you guys used Pass.com? If you're not, you should. It's a great platform for building your app. Like a bunch of apps already using it, 500,000. So, usepass.com. Yeah, that's it. I think I'm gonna leave it up for some questions. Hey, great talk. Thanks. So, I had a couple of questions actually. One is, did you consider OCaml as an alternative when you were choosing languages? It's also statically typed pretty fast. The second question is, you talked about microservices. So, would it have been perhaps easier for you to do the rewrite if you had smaller services that you could replace one at a time? Okay, we didn't try OCaml. I don't know why, because nobody mentioned it, I guess. That was the, like, yeah. So, I guess nobody had written any OCaml ever in my team. So, we didn't, we did talk about Erlang. We tried Erlang, and it was like, again, dynamically typed. So, we gave up on it. Yeah, no idea why we didn't try OCaml. About the second question. I think if it was microservices, we probably wouldn't have to rewrite all of it. We would have to rewrite only the ones which were problematic. And that would have been cooler. But I don't think any successful startup, any successful company, is not looking to build microservices at the beginning. You're just trying to bootstrap something. So, Parse was in the running for one year, and they just wanted to get an app. And I think they were very successful at it. They got 60,000 users. And that's what they were looking for, and they built it as a monolith. So, yeah, microservices, if done from the beginning, are good. But, yeah, one of the biggest problems I have with microservices personally is testing them. When you have to build them and test them on your local box, you have to boot up like three or four services. You probably need some sort of a service discovery thing, locally running, if you don't want to keep entering port numbers and stuff like that. So, I think monoliths get rid of that problem. I've seen some newer tools with Docker which take away this pain mostly, but I think monoliths are still simpler. You can just have one binary poke at it, see what's happening. Hey, it's a good talk, so I just have a question. Hi, thanks for sharing your knowledge. Hey, here then. So, just could you share how you were able to replay the live traffic for your testing purpose? What is the solution? So, we had a HTTP proxy. Our pass.com is, at the end of it, a REST API for the client SDKs. So, we had HTTP proxy, which would, it had two modes, it had log mode and it had live mode. The log mode would basically record the incoming bytes and the outgoing bytes for every request and put it in some sort of a log file. And then, we had another tool called flashback which would read this log file and play it on another server, basically. So, that was one mode. And then, the flashback would also basically measure deltas and then produce nice colored output. You would watch on a screen and you're like, yeah, there are these many errors, this is the error, you go about it. We also had to write a library to do nice JSON diffs. It's called, and remember, it's called spewediff, spewedump, something, it's there on the Facebook Go. So, yeah, we did this. This was the basic idea of replaying live traffic. The other mode we had was live mode where we would have two Go routines spawned up, send the request to both the Ruby and the Go stack, getback results. If they were fine, if they were the same, just return it, don't do anything. If there was a delta, return the Ruby version because that is still the correct version. And then, log the difference again into a log file on the server. It was not just a log file, internally, we have this concept of a scribe stream, which is basically a log stream. I guess, Kinesis is the Amazon equivalent of it. So, we'd put in something like Kinesis, it would go down to an aggregator and we'd have an early job which would be like, hey, these are your deltas in the last one hour. Your deltas are super high in the last one hour, so there's something wrong probably. Things like that is what we built around it. But the crux of it was actually finding out the deltas. Thank you. Hello. Just have a question that you have different stacks and different applications running, right? And different services. So, what was your main base that, this is the programming language that I have to take? And in fact, this is my first question. And then the second one is that, how do you come up with an errors, how do you handle it? When there are like 60k apps running and then you talk to different protocols and then how do you get it, the events of this particular things and you do a call back of that and do you note somewhere where these errors go and then you automatically rectify with some of the techniques where you use it? Yeah, actually I did not speak about that. We have an interesting part of the stack there too. We did not build it for errors. I'm gonna talk a little bit about the errors. So, for errors we have this system called log view. I guess the open source equivalent is discusses century if you've used it. So, what log view does is we track every non 400, non 200 response on the server to this stream or this machine, this service. And this would aggregate all the similar ones into one collapsible UI and we'd see okay, we have a spike of this, we have a spike of this and that's how we would track errors. Another important thing we did in this system was compare errors with past history which we found was very interesting. If the error rate was much higher than last week, same time, there's something wrong because at times there would be errors and you cannot go looking at every single error. So, we'd go look only if there was a delta in a major delta between errors this time and errors last week this time kind of. So, that was very important. I think our system allowed us to do it, yeah. So, that's how we did errors. So, man, how does you manage the threading and when you're writing that in Go, like through Grow, GoRoutines, how you handled it? We try not to spawn any GoRoutine in a request. Kinda, that's the idea. We try to keep it dumb. I mean, the HTTP stack underneath does GoRoutines. We just try to rely on it. We're like, this is a request. We're gonna think this is a blocking. That's the, I think that's generally the beauty of Go. Do like, write code as it is blocking. So, I'm gonna do this, I'm gonna do this, I'm gonna do this. I'm gonna, and then respond. And we just generally wrote code like that. We had libraries which would do things asynchronously like metrics, like reporting logs. Our billing was asynchronous. All those were generally library components and in our request code, you would not see that they were concurrent. You would just feel that they were like regular calls and then there'd be some buffering in the background and that happened in that individual library. So, that's how we kinda like organized our code. We have a neat little library called muster which we wrote for generic background processing jobs kinda thing where you would like put in a bunch of things, batch them together and execute at once please. So, if you had to write to a log service like scribe, you don't wanna write every time, you don't wanna write one request every time. So, we would put it in a muster. A muster is a group of 50 objects and then we'd send over the muster to scribe. That's how we generally organized it. But our request code would look kinda like dumb. It'd be like, do this, do this, do this, do this, and do this, yeah. Is muster is open source. Hi, I'm here. Yeah, sorry, I joined late. So, pardon me if you have already answered my question in your presentation. Just wanted to know what engineering process do you follow? I mean, you're doing tests, you're doing, all these things you do, keep doing experiments and on how do you prioritize? How's the roadmap laid out? So, my boss calls it cowboy engineering. That's what we generally do, I guess. So, there's no answer to it. We have a goal for six months. We try to hit our goal for six months. But apart from that, it's kinda like, depends on what your day-to-day problems are. So, parse personally has operated in two modes. The earlier first six months was the house is on fire. How do we like fix it? And that would be very chaotic. You'd be like, okay, I'm doing this thing for the next one week. I'm figuring out what's happening. I need to put in this bandaid so that we don't break when this big app is gonna go on Japanese TV shows. So, things like that is what was the first stage. And the second stage was, we need to kind of build out products now. And how do we go about it? That's the stage we are in currently. So, how we prioritize that is kind of, I don't know, we don't have a process of what we are doing. But one of the things I think we do differently is we have a timeout on discussions and meetings. We're like, okay, we're gonna discuss this for four days. And at the end of it, we're gonna do something. It doesn't matter what it is, but it might be wrong. But there's gonna be a timeout at the end of four days. We're gonna figure out, okay, we're gonna write this new API because I felt like it. And then that could be the reason. And then we just go do it. And we try to minimize the time to do that. Like, okay, we're gonna do this for two weeks, launch it out and see how that goes. And if we can get some information around it, do that. So, an example is we launched a slow query tool recently. It took, I think, three weeks from concept to roll out. We had a lot of planning around it. The UI needed to look shiny and neat and stuff. But we launched it out. We got 2,000 page views a day. And it doesn't really work, so we just gave it up. Kinda. So that's what we follow. Another thing we do at Parse differently, I guess, from Facebook, the rest of the company is, because we are a developer platform, we focus on not rolling back stuff as such. Like, we do not deprecate anything. If a product is not doing very well, we just leave it at that stage. We don't try to kill it. Which I think Facebook does a lot, is like, kill a lot of products. So, because we are a developer platform, we give a little more respect to that. We don't want developers to, developers are using that to be paced. So, yeah, those are the only two considerations we have. I don't know if this answers your questions. Kinda generic. Yeah, we have, we've faced a few issues. One of the issues, I think it was in Go 1, 2, was the stack size was fixed. So, when we were writing out our push service, we would open, like, a bunch of connections to the Apple push service. It's called APNS. And it was consuming, I guess, four megabytes per connection. And this was okay for most of the applications, but for our application, it was kinda not doing very well, because it was a very dumb application just opening a lot of connections. So, we kinda found that irritating. We actually compiled, re-compiled Golang with a lower two megabyte, it was a constant in the code base. So, we changed that constant to two megabytes. We re-compiled Golang, we called it our own version, we deployed our own binary. Kinda worked out well, but I think in 1.5, 1.5, 1.4, they introduced a splitable stack sizes that kinda made our problem go away. But this is one of the problems we found out initially. We found out a few bugs in the HTTP library. We've started discussions, we've got some of them fixed, some of them are not fixed. Go is opinionated at times kinda stupidly. A nil in a JSON gets an empty string representation when you're printing it out in JSON. It caused a bug for us and we didn't realize what was happening out for a long time. Then we're like, oh, in Go, you gotta allocate it to a size zero array, or else it's gonna do weird things. So they're like tiny, tiny things like that, but overall, we're generally happy. So I think you said that the quota-based system in a multi-tenant kind of a setup was managed by the memcache counters. Was it on a per-app basis? Because if it is on a per-app basis, then some app is underutilized, but it has a high memcache counter. The other one cannot really use up the resources. And the second one was, you talked about team the production traffic to your test cluster as well. But then how do you, that's a big traffic. Did you have a test cluster which was equivalent to the production cluster in terms of size? No, we ran only a percentage of the traffic all the time. We didn't run the whole traffic. That was kinda like, or we could run one app's traffic, that is what we would do. We could set up a cluster which could take one app. We had the concept of siloed app clusters. So we would be like, okay, bring one more, there's an app called Shorts. Bring one more Shorts cluster up. And then Shorts versus Shorts, what we would do, things like that. We could do that. The first question is about multi-tenancy, co-tenancy, it depends on your business model. It depends on what you're offering to developers. Amazon offers this as a service. We generally do not offer elasticity as a thing. We're like, we're not elastic, we are plastic. We are like, we'll be like, we have an upper request limit of this much. You got to decide what you're gonna do. Every app has an upper request limit, which is like 30 requests per second. That's the minimum we do. And then you can go up from there like thousands of requests per second. As soon as you hit that, I'm sorry, you can't like get more than that until you pay for it or do something else. So yeah, we are not elastic in that sense. So we do not really have this problem where it'll be like, hey, we have some free resources, take it, we have some free resources, sucks for you. I guess this is the last question. We have a slow rollout process in the sense that it depends on the size of the cluster at that point, but there are n nodes which go out at one point at a time. N, n, and then that's how it goes. We have a batch of n going out each time, and then it just increases. There's no, it's linear, so the n is fixed. We don't do n to n and stuff like that. So that's fairly simple, it works for us. There's no math just because it was easy to code.