 So, my name is Vedang. I work as a platform architect at HelpShift. This is going to be a crisp talk on simple scalability patterns. So, basically, HelpShift is a three-year-old startup in the mobile CRM space. So, as a startup, your greatest weapon is your agility, right? So, if you ship faster, that is how you compete with established companies. So, you have to balance these two requirements. You want to ship faster and you want your system to be resilient. So, any structural and architectural processes that you implement need to take care of both of these constraints, right? So, I want to talk today about certain simple things that you can do within these constraints. So, this is going to be the agenda for my talk today. So, first up is monitoring using logs, right? So, monitoring your system is really important, right? I hope you all agreed on that. So, without monitoring, basically, you're just flying blind. But having said that, monitoring systems can be complex to set up. So, surprisingly, or maybe not surprisingly, logs are very effective for monitoring system behavior, right? You can add logging around your network call, every network call that you make, and you can record stats like this. So, in a runtime like Clojure, which is what we use at HelpShift, you can even do this on the fly. So, you can collect a reasonable amount of data from production machines, and you can turn off this kind of logging to avoid the performance penalty. So, once you have logs like this, then you can use simple available unique tools like salt, cut, unique grab, and these will give you deep insight into what your system is actually doing. So, these are all the low-hanging fruit for you that you can quickly fix, and gain performance boosts. If you know Ock and Stead as well, well, basically, then you are a wizard and you can do whatever you want, right? So, this is a simple macro to do something like this in Clojure. And yeah, it's very effective for the effort that goes into implementing something like this. So, next up is database access patterns. So, the big problem which goes undetected and should not go undetected is unmounted DB calls, right? These are calls where you don't have a bound on the size of the response that you're expecting, or you don't have a bound on the number of requests that you will make in a certain call, right? So, no matter how hard you try, your dev and QA environments are never going to be the same as your production environment, right? So, real-world usage is really, really hard to predict, and we rarely think of the effects of data piling up over time, right? So, that is when we see problems with unmounted DB calls, where you give enough time to production, your data sets will grow really, really big, and then suddenly everything will fail over because it runs out of memory, right? So, the way to counter this is to use batch sizes when you make your DB request abstractions, right? Check that your DB clients provide batch sizes, default batch sizes, make sure that your requests have batch sizes. Real-world examples I have seen program was explicitly override batch sizes and say, give me back all the data for this query, right? Because that's convenient. So, catch that in code review. That should never go into production system, and give abstractions to your data, to your developers so that they can make chunk requests, and they don't have to think about how do I deal with this problem again and again, right? This is, again, a big problem. I don't know if it's a functional programming thing or what it is, but a lot of code gets written without any thought about what the side effect of the code is, right? These are all the problems that you should catch in code review. You should use, fetch the minimum amount of data that you're going to require, right? Don't fetch everything and then return the ID of that object, right? You should batch your requests. If you're making repeated requests to the same collection or to the same table, batch them and make them in one go. And you should not fetch the same data again and again, especially in the same request, right? Maybe across request you don't have the bandwidth to implement local caching or whatever, but in the same request I have seen people fetch the same thing again and again and again. Catch these problems in code review and they'll solve a lot of performance issues for you, right? Facebook has recently open sourced a library called Haxel, which provides abstractions, safe abstractions to your devs and abstracts away access to remote data, right? So this is exactly basically what I was talking about. This is what they implement and give you. So you can check this out, all right? So next up on our slide is, on our agenda is serialization and deserialization, right? So when you're building your data structures, think about how your data is going to flow through your system, not just for your current use case, as well as what, how it's gonna behave in the planned future, right? Every message queue that you use, every cache, remote cache that you use, when you put data through it, it's going to impose a serialization, deserialization penalty on you, right? So this slide, it shows an example of two data structures, one which store data objects as objects and one which store dates as longs, right? And the second data structure can be serialized and deserialized more than twice as fast as the first data structure, right? And it fixes as simple as this, like don't store data objects into your DB, store longs, right? So, all right? Next up, network calls. Network is scary, right? So when we are young and when we write things like, I became core Java expert in six months on your resume, then we also believe that the network just works, right? But given enough time, it is going to burn you. And we have full sessions dedicated this fifth elephant year to network flakiness, what happens in the face of partition tolerance and stuff like that, right? So, I trust everyone here will definitely attend these sessions. For the purposes of my talk, I would just like to say avoid network calls, right? Avoid them, just avoid them wherever possible. Network calls are slow. They are the number one reason for cascading failure in your system. And if your data size is small and it doesn't change too often, just cache it in memory, right? And if your data is not something that you can cache in memory, if it's really, really big, just compute the data, put it on local disk and use that, right? So, do everything you can to remove every single redundant network call in your system, right? And this is the best gif that I could come up for this. So, yeah. So, all right. So, integration points. So, an integration point basically is anything where your system is making a call over the network or to some resource, right? So, these are the number one cause for failure in your system. So, for example, say your database becomes unresponsive, right? So, your app server, all the threads on the app server are making requests to the database and they're waiting for this response to come which is just not happening, right? And because of that, all of these threads get blocked. And when the app server threads get blocked, the web server which made the request to the app server, that is waiting for a response and then all of those threads are blocked, right? And then slowly your system is in a state where it cannot do any work at all and then it just falls over, right? So, there are two powerful patterns which we can use to combat this kind of a problem. And I hope everybody uses these patterns. They're timeouts and circuit breakers, right? So, every call that you make to any resource should be configured to timeout. This waiting indefinitely business is just wrong, right? You should have default timeouts for everything. So, if you're using your database clients, you should make sure that the default timeouts are present and they're sane. For example, Monger, which is the library, closure library for accessing MongoDB, supports a lot of timeouts, but the default value for many of them is waiting indefinitely, right? Which is not very helpful. So, timeouts, the other half of timeouts is circuit breakers. The two patterns go really well together. So, circuit breakers track the health of your resource and avoid badgering it when your resource has become unresponsive, right? So, if you know that the resource is failed, you can do things like you can fall back to a secondary source or you can just fail that particular operation immediately, right? Without having to actually go all the way to the resource. So, this allows you to do graceful degradation of your system, right? The idea here is if you know that a request that you make is going to eventually fail, you might as well fail immediately, right? So, anytime you make a request, you already know all the resources you are going to require to satisfy this request. So, what you can do is you can check the circuit breakers on the resources and if all the circuit breakers are operational, your request can proceed. If any of the circuit breaker has stripped, you know that something is gonna fail anyway, so you might as well fail the request, right? So, this means that your system fails really, really quickly, it doesn't drain resources or hold up resources for a long time and this gives you a chance to recover piecemeal, right? Without having the failure cascade over. So, Netflix has open source library called Histrix, which gives you nice implementation of circuit breakers and also gives you a whole host of other scalability patterns. So, you can look into that if you're interested in these kind of things, right? So, along with timeouts and circuit breakers, basically circuit breakers have another really nice application, which is health checks, right? So, you should always have health checks for your system. So, a health check is just a way of querying a system and the system will return am I alive or not, right? So, it tells you if your production service is responsive or not and it is essential to support features like auto scaling. So, basically the idea is you bring up a machine, you wait until the machine passes the health check and then you send production traffic to it, right? So, circuit breakers can act like a poor man's health check, basically. So, anytime your machine comes up, you can just query it to check if all the circuit breakers are operational. If they are all operational, then you can say, okay, fine, my service is up and it can take requests, right? So, with all of these things in mind, if you revisit our stack, so, you have now implemented all of these patterns. So, you can make sure that your failure will not cross these boundaries, right? So, basically now you can be sure that if one part of your system fails, it doesn't bring everything else down. So, you can have certain parts of your system fail, but you can still have the system working, right? Which is what is the aim in the end. You don't want one useless feature bringing down your essential critical stuff, right? So, yeah. So, that's basically what I wanted to say today. So, the takeaway that I want people to take is that designing scalability is a better thing than testing for scalability because testing is really hard, like we mentioned. It's really hard to simulate production environments. In fact, it is a really, really awesome problem which for some reason, like if you say QA, people get bored, right? People are like, oh, QA, QA doesn't, is never as good as dev, but distributed systems QA is, I would say, harder than distributed systems dev. And the other thing is that scalability is an important perspective. So, if you're doing anything, if you're building a feature, if you're building anything for production use, you have to think about these problems, how it is going to behave under load, right? And you have to design so that it doesn't fall over. So, yeah. That's all that I had to say today. Are there any questions? Anything else? I didn't, what is the circuit breaker again? Oh, okay. So, let me go back to this slide. I thought I would run out of time. So, I went over without explaining this. So, a circuit breaker is basically something that trips when your resource becomes unresponsive, right? So, this is basically the life cycle of a circuit breaker. So, anytime you make a request, it goes through a circuit breaker to the resource, right? So, and the resource, the circuit breaker tracks the response of the resource. If it's successful, then the circuit breaker knows that everything is working fine. If it fails a certain number of times, then the circuit breaker knows that this resource is not responsive and there is no point in making the request, right? So, it will just break there. So, the next time anybody asks for the resource, it will just say, no, this is out, gone, not working, right? And it will wait for a certain amount of time and then after it waits for that time, it will move into this state called half open. So, the next guy who asks for it, the circuit breaker will actually make the request to the resource. If it's still failing, it will go back to, nope, this is not working. If it succeeds, it will go to a closed state where other requests can go through, right? So, it's a simple pattern where you avoid making the call and you get these benefits of not having your resources locked up, right? And if you know that your circuit breaker is stripped, for example, you can implement things like, okay, my database is not working, query from this secondary older slower source, right? So, your request might be slower, but it will still give you some response, right? So, you can do that or you can just check. So, when you're going to start the request, you can just check, okay, I need to access the DV, I need to access the search engine, I need to access the cache, right? So, are all these circuit breakers operational? If they're all operational, then great, let's go ahead. If something has stripped, anyways, I'm not going to be able to serve the request. So, Midas will just fail, right? Don't make any calls at all. So, that's the essential logic here. Anything else, any other questions? Hey, I'm Rohit, I think you had back pressure somewhere written in one of the slides. So, again, these are, so I timed myself and I was running over 15 minutes, which is the time for the stop. So, I ran through it, right? So, back pressure is another pattern for handling scalability problems. So, the idea here is basically that analytics consumer is consuming from the queue, right? So, if something goes wrong with that consumer and if it can't do anything, it just stops, it just stops consuming. What this does is it builds back pressure on the queue. So, it's saying, all right, I am not going to do anything. Now, it's the queue's responsibility to do something with his data, right? Either, so it's a queue, basically. He's gonna keep building the data until certain point and then it's gonna start dropping the data. The idea is that you die in bits and pieces instead of bringing the whole system down, right? So, you don't say, oh, okay, I'm not gonna be able to consume, just finish, right? Instead, what you say is key, all right, I can't deal with this now, you deal with it, right? So, you give the responsibility to somebody else. That's back pressure. Hey, Siddharth, yeah. Yeah. So, you made a very generic statement that avoid network calls. Yes. So, this goes very much against the microservices pattern which is very famous these days. Yeah, so- So, I mean, what are your thoughts on the microservices? I mean- I have a lot of, I personally don't support or I would say that microservices is something that you don't need if you're starting out, right? So, prematurely implementing microservices is, I think, a bigger mistake than going the monolithic way, right? So, there are certain important wins in the microservices way of thinking which is loose coupling and stuff like that. But you can implement all of these things in a monolithic architecture as well because with microservices there are a lot of problems that come up, mainly operational problems. So, it looks very nice from a dev perspective but it is hell for ops people, right? Because now you don't have to maintain one service. You have to maintain five services, clusters for these services, fallbacks for these services, network problems in these whole services when they contact to each other, right? So, it's a lot of pain. So, basically I would say take the whole microservices hype with a grain of salt. There is a cost to everything. So, there is a cost to moving to microservices architecture as well, right? So, there are patterns like channels, queues that you can implement in your monolithic system as well. You don't need to have separated systems for these things. Anybody else? Any other questions? Hi, my name is Mohit. My question is that the circuit breaker has run after what the duration of the time. So, this is a health. So, that is up to you. You can configure these things, right? So, the thing is that suppose I deploy my app and after that I have the application deployed on the four different servers distributed in my mind. I'm sorry, I can't hear you. Can you slow down? The question, okay. So, my question is that suppose my app is deployed on the seven or 10 different servers. And after that, you know, the deployment just happened and after that, the first request is hitting to the server and at that point in time, I'm checking the health of my, you know, the application different resources. So, that's not all. So, you can configure your circuit breakers to handle each failure, right? So, I'm not saying the first time it fails, you trip the circuit breaker. You can have a counter. You can say that this must fail at least five times before I consider this resources down. Right? Okay, so. You can also handle different failures differently. For example, if your system is saying that okay, it's down, right? If the cluster says to you, all right, I don't know what is happening. This is just like a 504 error. I have no idea, 503 error. I have no idea, right? You can handle each of these things differently versus you can slow responses, right? So, you can track, okay, my response is I expect it to happen in 10 milliseconds, but it is happening in 50 milliseconds. This is a slow response. So, I can behave differently for this scenario versus for outright exception, right? So, suppose the scenario like, I didn't get the response in a very, you know, the consistent way. So, suppose in between, I run the health checker and that point in time, the user is requesting the app and that get delayed in their response. So, how we can handle these kind of the situations? I don't understand the question, actually. My question is that suppose that I run this app. What I'm saying is that these are all checks, right? So, these are not like bullets for you to do things, right? So, health check is, you cannot check the full system that is running in production. You can't check everything that is happening there, right? So, health check is a simulated check. So, you are going to say, okay, these are my core things, these are working, right? This definition of health check is up to you. You might check whether certain function is working or not. You might check whether certain thing is accessible or not, right? Whatever, it is up to you. But it is important to have health checks in your system, right? So, that's the basic idea. And if you don't have the time to implement a generic health check, circuit breakers can give you a simple solution there. Thank you. Any other question? Yeah, hi. Excuse me. Could you go on a bit for graceful degradation? I don't know, where is it? Hi. Graceful degradation, could you elaborate a bit? So, graceful degradation is, so for example, what you do is, if you're rendering a webpage, right? You render a simpler webpage. You take away certain actions that the person can do on your site, right? So, that is an example of graceful degradation. So, you decide what is important to your system, what is not important to your system, and you keep going back to a point where, okay, now I can't do anything, my system is going to fail. That's when you actually fail. So, you don't fail at the first part of your system crashes. You allow for certain graceful way of die, right? So, for example, like I said, if your system, if you're okay with slow responses, ideally you would really like to have your response returned really quickly, right? But maybe your resource which you're using has died. So, you can maybe fall back to something which is slower, but which will work, right? That is a graceful degradation. So, it's not ideal, but it's working. That's the idea. Does that answer your question? Okay, thank you Vedan. Thank you so much. If you have any other questions, please catch up with me.