 Yeah, so there's very little code in this talk, if you're expecting that. You know, just usually what I do is I take up a small concept. Things that we probably don't associate in our day-to-day coding, you know, but very elementary and essential for us to understand how to write data software. So I've been working in this infrastructure and reliability space for over a decade. And what I've realized is that concepts that I learned long back, you know, like, and these are things which Unix systems have pretty much from 1970s and 80s. If done right, they're the same principles that we need to do today to write good quality software. So I'll start with the question, you know, there are two servers, 8 gigabytes of RAM, each 55% full. Is this system stable or is the system reliable? Yes, no, there's no third option. So, okay, this system is actually not because come to think of it. Just one of them, if it goes down, the load of 55% goes to the other side makes it 110%. You know, standard wisdom says, oh, this is going to work, you know, I have two servers, this empty, plenty of space is going to work, but it won't. It's just waiting for a failure. Another question. Let's say I got three servers. What should be the minimum threshold at which I should panic? Should it be 66%, should it be 75%, what should that number be? Because let's say now one goes down and that load has to spread across to other places. So now this is fine, 66% because you would say that 33, 33 would go out and everything becomes quite stable. Another question. So this is a support ticket that came in a long time back. And somebody realized that on a load balancer on AWS, what was happening was traffic was unevenly distributed across nodes. And this bug was reported a year after the system was deployed and had served around half a billion credit scores. So the question is, how do we detect these things early enough? Are we capable of handling or is our software capable of handling these problems before it hits us, before a year late, we realize that, oh, we're not doing this right. Standard answer. You know, we all say, well, you aren't monitoring your system well, which raises me to a question. You know, like I'm not going to talk about monitoring because I mean, we know if I ask people what is monitoring, monitoring is what Prometheus does. Monitoring is what Datadog does, right? Any other answer than that? Or maybe what Senso does, right? The problem with monitoring is, and this is where I'm probably getting to the linguistic of things a little. When you are monitoring something, think about it, right? You are already looking for something that you know. As in, when I am monitoring for RAM failure, I know I am monitoring for something which fails at X percent. This is getting into a domain of known failures. It's not going to tell you what is unknown to you, right? What about unknown unknowns? Things that you just don't know. You can't set an alert or a monitor for something that you don't know. This gets slightly tricky. You have to spend some time thinking about this, right? Now, when you're looking at a system, right? And you want to monitor it. There is no recipe which says monitor every single use case of this ever possible or every single failure case. You're going to apply monitors and triggers for things that you already know that this can fail in this known manner. So, but that is not going to give you enough to build a system which may fail later. And this is exactly where the concept of observability kicks in. I mean, I'm sure a lot of you would have heard this term observability. Has anybody heard that term observability? You have? You have? Everybody else have? Okay. What are we missing here? What we are missing is systems are actually distributed in nature as we are growing, right? Now, no longer server is going to run on one single machine. And failures are far beyond what we see as, okay, RAM failure, CPU failure, etc. What about accuracy? What about latency? What about correctness? What about consistency? These are the problems that we usually do not look at a system, right? Standard, you know, like, okay, so I deploy a new login service or a new payment service or a cart service. How do I measure accuracy of it? You know, and as systems move towards more complex paradigms like eventual consistency patterns, right? All of these are known terms. How do we measure that? How do I know whether the response that I'm getting is actually correct or not? You guys know of eventual consistency patterns, right? So I wrote something to a database. I read something from another shard. I didn't get the right value. Is the system correct or incorrect? Is this an acceptable use case for the business or not? So the monitors itself have to get pretty complex as we evolve with the software. So is the consistency pattern about it. Another problem with monitoring is it always tells you a problem till now or from now. So how many of you are aware of kernel traces? You know, has anybody used a S trace ever? A P trace, these things, right? So or TCP dump, people are aware of TCP dump, right? Now, what does a TCP dump do? When you are debugging a network, what you do is you start TCP dump and it starts showing you something, you know, like there are certain logs that keep coming out on the screen. Now, this is happening because somebody once wrote a software which dumps everything on the TCP stack when there is a monitor applied using TCP dump. This ability to debug was baked into the software only then you could later debug it. It doesn't work that way, right? That now you push another build, you're emitting more metrics, you're emitting more log lines and then you debug it because the problem has happened. Now, you're going to watch a running system. The hooks and levers have to be present into the software. I'll walk you through all of this. Don't worry about that. But do you agree with me that the standard way of we say that hey, we go to data doc, we go to AWS cloud watch, we install another alert and that is enough for us? Probably that's not enough is what I'm trying to say so far. I'll walk you through a bit of monitoring timeline as well. You know, like as system started earlier, we used to monitor servers. When we moved ahead, we started writing complex services as well. We started monitoring services. Late 2000s, we moved towards service-oriented architectures. We had ability to monitor that as well. Early 2010s, they were microservices. Then came lambdas as nano services and now there are functionless as well, right? Like all of this falls under the same lambdas. So whole paradigm of monitoring is also getting pretty complex. The real question is what everybody should ask. If a tree fell in a forest and nobody heard it, did it make a sound? Has nobody ever heard this thing? Nobody? Oh, cool. Okay. So anyway, the answer is all falling trees will yield log lines. Cool. Software by design is opaque, you know, to debug and control a running system, you need observability built in. What I'm trying to say here again is let's say there is, okay, let's make a checkout application, right? Or a payment application. A user comes in, clicks pay now. It's not happening. How do I debug this running system? Why is it not working? Where should I go? What should I look for? Should I look at existing stats which are not enough? Because if they were enough, I would have found something out there. How do I debug this running system? Debugging a running system is almost next to impossible. Unless it has emitted every single data point that you would eventually need. You know, this is where tracing comes into the picture as well. You know, when we say request tracing, why do we use request tracing? For the same reason, you know, that tomorrow when something fails, there should be ample data points available for me to debug retrospectively what had failed. Because after something has failed, you can't go back and fix it. I mean, if that was the case, I would have never failed my 10th grad, right? I mean, I could go back and fix things, but I can't. So, if you look at the software, you know, like what are the things that we look at? Things like daily active users. Can you do this without emitting data from within the software? There has to be data points that you need to emit out. There have to be trails. There have to be analytics. Now, are you guys aware of request tracing, distributed tracing? Because this is a point where whoever doesn't understand this, raise your hand and I would want to explain all these things. Do we understand request tracing? Does anybody want me to explain what request tracing is? Okay, good. So, have you guys worked with distributed services, microservices architecture? Okay. Service A called service B, right? Okay, where do you work? Service now. Okay. Do you have architecture where there are multiple teams who are maintaining services? Okay. Your API calls another API calls another API. Now, in this flow of requests, what if something fails? How do we debug what had failed? Now, also on top of this, if I say they were, what language do you code in? Python. Okay. Has anybody coded in Java? Okay, no problem. If you haven't, that's all right. So, let's say there are a million requests coming in. Now, when you look at million log lines, and you see there's a million log lines on service A, you see million log lines on service B. No, you can't look at million log lines, right? So, you need to find identifiers, right? You need to know this request in this million is mine. Now corresponding to this request in this million is what I'm looking for. Now, at that scale, you need correlations where you can trace that a request came in. It went to service A, service A called service B, and this is the same request ID that we're catching and trapping here goes to service C. So that you can track. Okay. If service C did not happen, we are going to isolate our bug. This is a request tracing. This is the purpose of it. Does that explain now? Now, this is something that you should be doing. If you aren't doing already, you know, like a lot of time you would have heard people emit logs for tomorrow when a failure happens. You know, like usually what we do is we see there's a bug happening. We sent another patch which has a log enabled. No, this is not. I mean, this is for tomorrow, not for today. Like today, what has failed, you can't sell debug it. So the idea is now there are logs, right? Now, if you look at a standard system, have you device now detraces in system in kernel? So how do you? Okay, now let's say a simple thing, right? Do you have a process which is supposed to call Google.com and is not calling Google.com? How do you debug this? Any thought anybody? And that's a good one. Anything else? Does anybody work in anything? Would you probably also use a stress for that? Right, you could use a TCP dump as well to see whether it is actually making an outbound connection or not. Now these are tools that you're using which are working on emissions which are happening on the TCP stack. So now you can debug those things. So, yeah, emit everything as much as you can, collect as much as you can and only then you will be able to debug something tomorrow when it fails. So coming back to it, what is the need of observability? The need is just debugging and pattern detection. You know, like it doesn't really have to be failure centric as well. You can actually build patterns. Now this is important. Why pattern here? Because this pattern is something that I will carry forward into control theory. Pattern is, I see a pattern that every time request goes on to service B, it starts degrading, but on service C it doesn't. This is a pattern. Now if I had data being emitted, now this data could be in form of just a simple log line as well. You know, like there's a log being emitted which what it does is request sent, total time taken was Y. Just a simple data line. Just events like this is enough. If I'm monitoring them, I know that C is the service where pressure is building. So that it helps me now take conscious decisions that if service C is degrading, I probably need to install another version of it. Now that's not for debugging something, but also for building a pattern there. This is actually a very, I've taken this from a slide by the city of Joint Cloud and designing for debugability affects true software robustness, differentiating operation failure from programmatic logic failures. Now if you can take a moment to think what is happening here, this is actually there's a lot of information here. Operation failure should be handled. Programmatic failure should be debugged. Operation failures are well, disk went out, network outage, CPU got full, RAM got full, you know, these can be handled externally. Logic faults cannot be handled externally. Logic faults need to be handled by you. But by knowing that there is enough information that you have about there's a logic fault happening. Now to do that, you will have to actually write a software which emits so much information so that tomorrow you can debug it. Now the interestingly, ironically, the more the software is designed for debugability, the lesser you will spend time debugging it. What he means by this is and software which actually emitted every single line, you know, like, okay, so let's look at imperative code. If every register access of your code was being debugged somewhere, tomorrow when an issue happens, you directly can go to that single line that this is where the issue has happened. So a software which is designed for debugability, when an issue happens, you don't spend any time debugging it because you already know this is exactly where the fault is. So pretty obvious thing to state but takes a while to hit us. And the more you leverage it, it is used to debug the software around it. So let's say service, a bigger service calls another service calls another service. Now if this service is perfectly debuggable, you will not spend any time debugging this, you will spend time debugging the service outside of it. Example, service A called B called C. C service has no issue at all because it's perfectly debuggable. When you look at an issue, there are logs, there are metrics where you say that, okay, service C worked beautifully. A service which was debuggable, instantly the focus is on B now. Okay, C has worked for sure. Let's spend some time debugging B. B has all the logs that you want. The issue can't be in B. Now let's spend time debugging A. So the one which is not debuggable is the one where you spend most of your time debugging. Does that make sense? Okay. Are you guys finding this a waste of your time? Was it a yes? So the difference is, unlike monitoring, observability is not failure-centric. One thing you have to understand, these are not two different topics. These are orthogonal topics. You use monitoring to make something observable. You use logging to make something observable. So it's like a superset of things. How do we make something observable? Monitoring, logging, et cetera, et cetera. These are not versus concept. This is one thing where people get a lot of confusion. So is observability monitoring 2.0? No, it's not. You know, observability is an adjective. It's more of this system is observable. Doesn't mean that somebody is observing it. Doesn't mean that it is being monitored. It's just observable. The day I have to, it's observable. Monitoring is an act of doing something. Even linguistically, these are two different things. Now, I want to take this time to actually move towards control theory a little. How many of you actually know control theory? Have you studied in college control theory? It used to be in those electrical and electronic subjects. Everybody long forgotten? Cool. No problem. So if you look at a software, the mission of a software is to give you bounded values. What I mean by bounded is the output of something should be as much predictable as possible. Let's say you wrote a function define some which has two arguments A and B. What if you couldn't predict the outcome of it? Can you rely on that software? No, right? So you have to build something where the boundary conditions are known for you. It moves from X till Y. What that means is if for anything, let's take this software consumes memory somewhere between 10 gigabytes till 100 gigabytes. It consumes CPU between X till Y. If this X and Y are not known, you can't build a software which is reliable. Let's take it this way. You are building a login service. This login service may require anything from one MB RAM to one petabyte RAM. Can you build a system that way? It's not possible, right? Because suddenly the boundary conditions are outside of your control because one petabyte is a different topology. One MB is a different topology. If I was to say this service has a footprint of let's say the database required for this is anywhere between 5 connections to 5 billion connections, would you be able to build a system that way? You can't. So point what we are making is any system that you build has to operate within a bounded nature. Now, you can ignore this. This is just to make it look pretty and catch your attention. So anyway, this is just basically you add a sign function to any function and basically starts adding bounded values, which you can ignore that. Anybody wants me to explain this? I can. No? Okay, cool. Cool. So the basic premise of control theory is you give an input to a system and most of this looks like, oh, this is electronics. This is nonsense. This is not for us. It is just wait for a few minutes. There's an input. Input goes to a controller. Controller is going to pass this input to a system, which is going to yield an output. A lot of you will not associate with this, right? Let me catch your attention and let's associate with this. Let's make this system our API. Let's make this controller our load balancer. Let's make this input and this dot a mobile phone. Does this make sense now? Instagram load balancer mobile phone. Do we associate now? Okay, good. So now in this system, we have to add bounded nature to it, right? What we're trying to say is the behavior of this is we are going to control this. So let's say a request came in and output was output came out. Now, what if we started measuring the output and started capturing it somewhere else? What we're trying to say is load balancer sends a request to something to your TikTok video or your Instagram upload. First time it takes 10 milliseconds to upload. Next time again, we could be observing any attribute of a system. The assumption is system is observable, right? So right now I'm going to start observing the latency, 10 milliseconds when it goes to server A, 20 milliseconds when it goes to server B. And this is a pattern. Now suddenly I realized that every time there is request reaching server A, it works between 0 to 10 milliseconds. But when it goes to server B, it takes somewhere between 2200 milliseconds, just data being recorded. We're not taking any decision yet. We start capturing those data points. Now, what sort of decisions can I take on top of this? If I start feeding this back into my load balancer somehow, I can start taking decisions like do not hit server B at all because it is going to result in a degraded performance. These are decisions I can take. I'm not saying I have to take. This is now tools and levers that are with you to allow you to write better software. Now you say, how would I do that? I mean, that's a different concept altogether. There's another talk by which I did, which is about load balancing techniques. You can watch that. Over here itself, you can now connect to a different endpoint itself. Now, how would you say that? But you said this is my mobile phone. How would my mobile phone take that decision? DNS takes that decision for you. Every time you go on AWS, Route 53, you remember there is weighted balancing there. Ever wondered what it is? It is something like this happening there at the back. So your mobile phone makes a DNS query because it has to connect to the API endpoint. The DNS that it resolves, there's an IP address that it's going to return. That IP address is taking into account your load or other such statistics and it is always going to give you the best one. So this ability of having data being recorded or making it observable starts allowing you to make better decisions and start writing better software. Now, this is already happening in 100 other places. You may never know of it. DNS is one example I gave you. I'm going to give you another example very shortly. Are you guys able to associate with this better? I'm going to take a slightly complicated example. Have you guys heard of NY and ever wondered what these things do? NY, Kubernetes, is that making sense now? So what they do, anybody heard the word hysterics as well? Yeah, cool. So hysterics is for those who don't know, who doesn't know? I'll just quickly explain one minute more. Fantastic. So let's just spend half a minute on this. It's a set of utilities which were released by Netflix. So you can think of these as design principles, but then also coded into an app, available as library. You know, like as Java first, then Golang also did it, Python may also have it by hysterics. What this does is it's a collection of good practices. What they do is they do things like back off on a failure, you know, like if there's a failure of a request. Okay, so how many of you write API is here? Okay, has anybody integrated with a third party payment service? Or any third party service? Has anybody ever made a request call request dot get request dot post? You guys are good. I mean, we have to get down to a point where everybody agrees with it. Now, when you do a request dot post, if you get a failure, what do you do in your code? No, like, imperatively, I want you to like, tell me, right, you're writing x equal to Jason body in your way x equal to Jason body. Request dot post Jason body. If this line cap gets an error, what do you do? Try except what happens in except. Fantastic retry. See, there's a decision that you're making that I'm going to retry this. Now, this is already, you guys are doing this on a daily basis, right? Now, this retry, how long do you wait for a retry? Do you retry instantly? Do you retry after a while? How does that happen? Now, these are decisions that you'll have to make, right? Now, so there are these algorithms here, you know, you say back off, what do you say is that if the first retry did not work, it's not going to automatically fix. Wait for a while. Do one, two, four, eight, you know, like go any sort of algorithm. Now, this is one such principle. Another such pattern that they use is circuit breaking. What circuit breaking does is if a domain failure has happened. Now, this happens in in a slightly convoluted architecture. Let's say you are Amazon scale or something which is pretty. I'm sure Flipkart guys also do it. You realize that a certain route path, a domain or an end point is failing. It's going to fail for everybody. You know, like every person on that, let's say an example, alphapay.com, you know, like I'm not going to use any real name here. But say any XYZ payment service, let's call it alpha pay has gone down. Every user on your website is not going to magically work. It's going to fail for everybody. Now, imagine there is a Diwali sale going on. You suddenly realize that the payment service has gone down. You can throw an error. You can keep retrying. It's not going to solve the error, right? So retries didn't solve it. What do we do next? Now, should we have this retrial go for all the million users and transactions happening on the website right now? It's not going to solve anything, right? What you could do alternatively is when you realize that a certain critical mass has failed, what you do is you cut this off. That this domain, we're not going to try at all. I'm not going to try a payment at all right now. Let's process all orders and let's do payments asynchronously. This is a decision that you could take as well. You know, suddenly you're moved from a synchronous to a Q base. Okay, process everything. Fulfillment will happen later because you want a better user experience. So these things allow you to build such flows where the perception of your software will become more reliable as well. And that's exactly what you want to do, right? Because it's not just about technologies, but business as well where it sees a product to be reliable. Make sense so far? Cool. So I'm going to take example now of how many of you know TCP flow control as an algorithm? Okay, you can take a nap. Others, do you want me to explain quickly what a TCP control works? Because this is one of the beautiful examples of control theory in practice, which is being used for almost 20 and 30 odd years now. And we just haven't paid attention to it. So how TCP flow control works is so from A to B, data has to be transferred. There are X number of bytes that need to be transferred. Bytes, we will take a MTU size and we will split it into packets. So at any point if you are not with me, if you lose me, just raise your hand, right? It's okay, silently raise the hand. The ones at the back front, the people in front can't see you. So let's say there is 45,000 bytes to be transferred, right? Now each packet size will take it as 1500, alright? So how many packets need to be transferred? 30. Cool. Now in a single, so A, so total file, let's say there are 300 packets. The first thing that happens in TCP when A establishes a connection, this is a accent, et cetera, all that history done. After that, there is a window size which is transmitted. The window size is on the side of the receiver, there is a TCP buffer before application can read from it. Are you guys aware of this part? Okay, if you're not, on the consumer side of a TCP connection as well, the receiver side, in memory buffers are created before an application reads from it. So when you do SOC, when you do request.get, underlying, there is a connection which has a socket and you do a SOC.read, that socket.read buffer is actually reading from this buffer which has already been received from another service, right? From the other side of the pipe. Imagine it is a pipe, I pour water in it, I run the tap out, but before I run the tap out, water is still collected on this side, right? It's as simple as that. Now, in flight, it says I have a buffer on my side of 30 packets. You can translate 45,000 to 30 packets now, fair enough, because 1500 is the packet size that we're taking. So V says I can accept 30 packets in my buffer before my application. Python, Golang, Java, .NET, doesn't matter, reads it, because it's just a TCP socket, it's language, agnostic. A realizes I can send 30 packets, it transmits 30 packets. When it transmits 30 packets, B sends an acknowledgement of 10, that 10 have been received. Now, when 10 have been received, in entirety of 300 packets, I still had more packets to be sent. Now, A has to send more packets. Now, when A has to send more packets, there are 20 remain, it sends 10 packets. Why? Because only 10 were acknowledged and the buffer size was kept as 45,000, right? So it's just 30 packets is the buffer window, which is max that it can accept. 30 is done. Now, the next time the application, Python application has become slow. It's doing some time dot sleep or something, writing a database where it is not able to read from the buffer yet. When it receives it, it says, and it still, it returns a window of 30,000, that the buffer is 30,000. Now, A cannot send any more, because there are 20 packets already in flight, which is 30,000 bytes, it can't send any further. A, so A won't send any further. Now, when A doesn't send any further, after a while, B sends another acknowledgement of I have accepted 10 more packets. But my application is still sluggish, so the receiver window is 0. A cannot send any more, because there is no room left to be sent. But what this means is this will result in a dead connection. So A, after a while, will send a probe as well. Now, all of this is actually baked into TCP for the past 20 years. And this is exactly what control theory is, right? Every time there is a feedback being received by your system, underneath of everyday code that you write, the Python code that you write every single day, the .NET code that you write, the Java code that you write, this is happening. There's a control feedback. You know, this feedback helps you to write reliable software, better software. And this is how TCP reliably sends data. Fair enough on this? Cool. So we're going to look at another aspect of TCP, congestion control. TCP is itself an amazing protocol, like you can actually just read it and understand so much out of it. So what TCP congestion control does is flow control ensured that we are never going to send data, which is more than the receiver's ability. Congestion control takes it one step further. What it also takes into account is network failures. Now, what it does is every time TCP starts a connection, the first thing that it does is it only sends, no, that's just timing for me, 10 minutes left. So it sends fewer packets. You know, what it's not going to do is if they're one gigabyte to be transferred and there's a MTO 1500, let's say you can only send X packet, it's not going to try and transmit entire gigabyte because it has ability to retry. It's not going to sit and choke your pipe. You know, like, okay, let's keep retrying gigabyte every time. That's not how it works. What it does is it first sends a packet. You get an acknowledgement. Then it tries to send two packets. Then it gets two acknowledgments. Then it tries to send three packets. It gets three acknowledgments. And this one, two becomes one, two, four, eight, 16. There's an exponential growth that starts happening of trusting the other side. This is called slow start. So every TCP connection first starts with a slow start. Then at a certain tipic point, it realizes that, okay, this exponential cannot become sustained because networks are unstable. Next time when you write application, I want you to just remember this. Even if you don't believe me, trust me, one day you will. Networks are going to be unstable. There's no network which is stable. And TCP works on the same assumption. So after a certain tipping point, it moves into linear focus. Let's not do exponential now. Let's linearly grow till the first acknowledgement doesn't arrive in a given time frame. So if it expects that acknowledgement would be back with me in 10 milliseconds, if it doesn't happen, it marks it as a failure. Even a failure, it drops down the rate by 50%. Subsequent two, three failures, it will fall back to the slow start. And then TCP will start again. So if you ever wondered, I'm assuming everybody has done torrents. What is rule number one of torrents? The more you watch it, the slower it gets. So if you would have realized, many times what happens is you're trying to download a file or you're trying to initiate a TCP connection and you realize that it's behaving weirdly. What you can do is simply check in your ping as well if you see packet drops. Because every time there's a packet drop, your connection comes back to the slow start. So the burst of packet transfer that was happening is going to stop. It will again re-establish the connection. So flaky connection is actually worse than no connection. Because what you're trying to see is like, oh, I got a 16 Mbps bandwidth. But because it's dropping packets in a while, it is as good as having a few Mbps of connection. Because it's always going to start again. In like two subsequent failures, it falls back again. I mean, all of this can be changed. There are a bunch of this cubic, there are too many protocol implementation You can change these as well, but I know default system. This is how everybody uses it. I'm going to now take practical application. You're like, okay, a lot of this feels like theory. Why should I be using this? Why should I even care about this? So practical applications are constrained, optimized problems. You know, like, why do we need control theory or any such system to be implemented? And I don't want you to look at it as a complicated thing. You're like, oh, it's control theory. No, it's just a feedback loop. You're doing something, you're recording the data and you may decide to act on that data. One is constraint optimization problem given a set of constraints. How can I reliably deliver better exactly how TCP does constraint scaling problem. Once you have realized that there is a feedback coming in that this is not enough. You have to change the constraint itself. Example constraint scale problem auto scaling. That's an auto scale. You get a feedback that, hey, RAM is not sufficient. CPU is not sufficient. Let's scale it. This is a feedback loop. Everybody knows this concept of auto scaling in Kubernetes, AWS. This is what's happening. You have data which is constantly coming in. You keep monitoring it. You have certain set of rules and using those rules, you change it. But you can't just keep expanding it. You have to shrink it as well at some point of time. So this dynamic up and down is just a feedback loop cycle. Cache. Now, when you use a cache in your system, you use databases like complex databases, Cassandra, then you use aero spike DB, blah, blah, et cetera, disk cache as well, SSDs, all that. All of them underneath intelligently. What they're trying to do is whatever you access faster, whatever you access more often would be automatically case. Now, this is where those all algorithms come in. Lease recently used, most recently used, most frequently used, blah, blah, et cetera. All of this. So every time when you're building these caching systems, you are using a feedback loop and use it. Gateway. Load balancers. When DDoS happens, when Cloudflare comes and tells you that, hey, we're going to prevent DDoS for you. What are they doing? They're just using a feedback loop mechanism underneath that every time there is some request coming in, there's an IP address, what they see is that, oh, this is where the request burst is coming from. This is increasing my throughput. This is where I see a pattern and let's block this. You know, there's a certain action that they would take. Load balancers. Load balancers use this as well all the time because the concept of load balancer, assume there is a definition of load. Load cannot be defined by the sender. It has to be defined by the receiver. It's a feedback loop that the receiver tells you, this is my current load. Don't send any further. You know, back off. Don't give me any more. Logstash is being built on this as well. You know why on elastelk stack, when you send in logs from filebeat, Logstash tells, hey, back off, don't send me more files, filebeat. That's another feedback loop cycle. Lastly, progressive streaming. Everybody watches Netflix. Have you noticed it doesn't stop? No matter how bad internet you may be on, Telegram won't connect, WhatsApp won't connect, but Netflix keeps working. Progressive streaming. What they're doing is on every single packet transfer, every adaptive next transfer, the bandwidth is being reported. That X bytes were sent first in Y milliseconds. Now it is being sent in two Y milliseconds. So bandwidth is fluctuating. So drop the bandwidth at which you're sending, which is progressive streaming. What is happening is first it was sending. Have you seen that? When you watch a video, it says auto bandwidth. Now that's exactly the start. You know, it's going to adapt itself. Cool. That's converging thoughts. Software by default is opaque. To debug and control a running system, you need observability and observation built in. Thank you. I'm Piyush Varma.