 All right, so feature flags, they're awesome, right? Like, what could possibly go wrong? So some of you may remember, in November of 2013 at Connect, we had a bunch of new functionality we wanted to roll out. And so we said, hey, we wanna keep this stuff super secret, super hidden. Keynotes it, I think it was 9 a.m., 10 a.m. that morning in New York. We're gonna flip the feature flags at 8 a.m. Like right beforehand, it didn't go well. So we had all kinds of problems. It took us a couple of weeks to fully root cause everything. And the worst thing was, this was on launch day, Wednesday, November the 13th, we sent the system into a tailspin. Because we flipped on so many different feature flags, so many new features. Clearly we hadn't understood all the interactions and certainly not at production scale. We had all kinds of things happen. Like we were running out of, it turns out, as we dug into it, we were running out of snap ports, for example, on the SLB. I mean, when you look at TCPIP, you've got 16 bits for the port. If you hit 64,000, you're gonna run out, right? 65,535. We had all kinds of problems. So we realized from that experience that flipping on a bunch of feature flags right before somebody gets on stage is a bad idea. Brian Harry was on the phone, he was in New York, I'm sitting at my desk. He calls me, he's like, so many words, what's going on? So we're trying to sort this out. Thankfully, somebody had the good foresight to put together a deck with screenshots in case things weren't working. So his manager, Soma at the time, could do his talk without the system live. So we learned, don't do that. And this, to add it, make it worse, and I'll take a question here in a second. To make this worse, remember we were talking about we only had one instance at one time? This was that time. So we had two things at that point. In the spring of 2013, we had factored out SPS. So we had SPS, and we had one instance of TFS, that was it. So this problem affected the whole world. The blast radius was global. And so what do we do now? We turn these things on incrementally, and at least 24 hours ahead of any event, we turn all that stuff on. Now, we may hide a few last things, like we might have a button that doesn't show up or something, but everything else is on. We've, of course, solicited some of you to go through and try this stuff out. As a good counter example, showing that we learned from this, in November of 2015, we rolled out the marketplace. Some of you, we asked you, hey, go try out the marketplace. We're gonna roll this thing out. Please don't tell anybody. So you did. That one went really, really well, but we had everything turned on. In that case, I think full 48 hours before the event, we had all tested in production. We tried it, had other people tried it, et cetera. Totally different experience. Major change, but executed much, much better. Question. But wasn't that ring zero supposed to actually solve this problem and address this problem? Wouldn't that be great? Yeah, sometimes that doesn't happen. So this was a scale problem. We had all kinds of things kind of come together at once, and you can go rock and read it on Brian's blog. Brian wrote an RCA on it, and basically what Brian does when he posts RCA's on his blog, he takes the internal RCA, tries to add enough context so it will make sense to you and posts it. Otherwise it's the same thing that we have internally. So please go back and take a look at it. It's interesting. All kinds of things happened, and for the interest of time, I'm gonna skip it. So you got features tracked in VSTS, then you have multiple instances and all these feature flags, and then you have test plans. How do you stitch it all together so you know what you've actually tested? Good question. So the teams have to think through how do they need to test things, right? So you gotta test feature flags both ways. We also talked about online upgrade and the fact that the ATs work with the old DB and the new DB. So some of these things we handle centrally. So for example, with deployment and being able to test whether or not the binaries actually work with both versions of the DB schema, we have what we call ATDT compat runs, meaning do the VMs, ATs and job agents work properly with a database that's old. So we'll take the new binaries, run against an old database. Obviously we've tested new binaries and new database, so that one's kind of a given. We do runs like that, but for feature flags themselves, it's really up to the teams to make sure they've tested the combinations and they have to keep track of it. There's not a, it's not done centrally. All right, so we talked about a failure case here. In those days, we had no notion of any kind of resiliency. Resiliency meant we'll try to fix it fast. We had none. So there's gonna be failure in the cloud. How do we deal with that? And there are two things that we primarily rely on so far. And it'll progress as we get further on down the road here. First one is circuit breakers. And this was something that was created, originated actually at Netflix. And of course everybody's familiar with chaos monkey and sorta they're very aggressive at testing and production, which is something we aspire to. Manil is gonna talk a bit about fault injection testing, but we're very early on in what we actually do in production. But the whole goal of a circuit breaker, much like you would find in electrical panel, is to stop the failure from cascading. So you end up dropping a hairdryer in a bathtub and okay, the breaker pops and it shuts it off. And so the damage is limited to that one item and not the rest of the electrical system. Circuit breakers help us protect against latency and concurrency. Latency meaning if something takes too long, it's the same as it failing. If it takes you five minutes to save a work item, we're down. Like yeah, technically you can do it, but you can't get your job done, so we're down. So we need protection from latency. And by the way, failing slowly is the most insidious thing. If you fail fast, you can actually more reasonably deal with it. But if it takes you five minutes to fail, oh boy. Dealing with that becomes a much more challenging problem. So I always tell people when they think about failure, think about things that are just simply too slow. They're harder to deal with than things that fail fast. And then there's volume concurrency. If for SQL Azure, for other resources that we depend on, there's a limit onto how many simultaneous connections you can have, for example, to a SQL Azure database. If you go beyond that, problems start happening. How do we prevent issues that just due to too many calls at once? We need to be able to shed load quickly. So if something bad happens, a lot of times what happens is the load will sort of pile up. So in the old days, before we had any Circuit breakers, database might have a problem. Now let's say that we rolled out something that has a bug in it. The database is very unhealthy. The CPU is pegged. Well, while that's happening, we're queuing up a bunch of calls in ASP.net. Somebody figures out, oh, here's what I need to do to fix it, so we fix the problem. Then what happens next? Boom, here comes this whole wave of calls. So you made it quote healthy. Now it's gotta deal with so many simultaneous calls that goes down again. And this cycle repeats. It's kind of like a bit of a death spiral. So the whole point of Circuit breakers is to shed load quickly. That way, so we keep things from queuing up, limit how much stuff can be pending in the system in order to allow it to recover. Another key with Circuit breakers is if you're going to fail, do you have a way to fall back and gracefully degrade? Some features that's easy, some features that's hard. If we can't call AAD, there's no graceful way to do anything about that. You're either signed in and you're good to go or you haven't signed in and there's nothing we can do about it. But for other things, you can fall back. You could, for example, decide, hey, if we can't get to an extension management system, we'll assume you have access to that extension and let you keep going and not disrupt your work. So some things have choices, some don't. Here's a quick diagram of what a Circuit breaker looks like. How does this work? And the key is, as calls come in and it goes through the Circuit breaker, normally the Circuit breaker's closed, normally things are flowing through. And it's looking at the failure rates. When the failure rates exceed some percentage in a given window of time with a certain volume, it's gonna say, oh, something's horribly wrong, I'm gonna open. And when it opens, it just starts failing calls. And this, by the way, is a blunt instrument. So you may have a problem in the code. And that problem might have, in fact, been triggered by somebody's behavior, but we're gonna start failing all those calls to save the system. Circuit breakers are all about saving the system to prevent the system from going down, to prevent failure from spreading through the rest of it. And more targeted version of controlling these kinds of things is throttling, and we'll talk about that next. But things are coming through, then things start to fail. And when it realizes it opens, and it's gonna start failing all the calls, now, if I fail all the calls, how do I know when to re-close? How do I know that it's safe? So Circuit Breaker actually occasionally lets something through as a test. So you may have 1,000 calls a second, let's say, coming through, let's imagine. And instead, when it's open, you may let 10 go through because you're trying to figure out, is it healthy or is it not? And if that resource is some dependency of ours, the other thing that's good about Circuit Breaker opening is we take pressure off of that system, whatever that system is. But we need to feed a little bit of it through to find out, is it working? Because at some point we need to re-close and go back to normal. So that's represented here in this kind of state that's called half open here, where it's kind of flooding a little bit through. Now, what does that look like in code? I literally copied this out of the code base. It's a little bit overwhelming on a slide. It's not nearly as overwhelming as it looks. This is just literally a set of properties on a Circuit Breaker. They're defaults for all these things and there are a few things that are particularly important. One is request volume. How much volume do I have to have coming through the Circuit Breaker for it to start analyzing things? What is my error threshold? How many, what percentage can fail and still stay closed? At what point should it open? And then what is my time window? What window do I wanna analyze this over? Could be seconds, could be minutes, could be any number of things. But you have to think about how the Circuit Breaker should analyze the calls that are coming through. And then what happens then is I can go then use this in my code. So if I go back to the previous slide here, this one was called installed extension settings. And here, I'm gonna make use of that. And I think actually I ended up, yeah. So you can see it over here, installed extension. This is fetch installed extensions. And it's going to look at this Circuit Breaker, which is called a command set or sets up the Circuit Breaker. And then I instantiate it over here and I give it a fallback. And if the extension mechanism fails, what should we do? This fallback method will determine what actually happens when the Circuit Breaker opens. What responses do the callers get? And this made a huge difference for us. So when we think about things like concurrency, if I get an overwhelming number of requests based on the settings on the Circuit Breaker, I can open the Circuit Breaker and say, something's going horribly wrong, the system's getting slammed, I'm gonna protect the system. The best example, and I mentioned earlier, is SQL Azure. SQL Azure per database, there's a limit in how many connections you can have open. We wanna protect that. We also wanna make sure that if something is optional versus critical, that we can control that too. You can take circuit breakers and create, which might call bulkheads where you say, okay, I'm going to allow only this amount of calls from this stuff that's kind of optional so that I always make sure I've got some capacity left for connections from these critical things. Identity would certainly fall in the realm of critical. As I mentioned earlier, I'm gonna invoke the fallback when there are too many requests. So let's take a look at an example. This was one where we had a slow DB and SPS, whatever the bug was, I don't even remember what the bug was at this point, but we hit the concurrency limits for one DB for two minutes and the way the Circuit Breaker was set up, at that point it opened, and we had passed the concurrency limit of 100 and it started failing requests. Now, we don't take this lightly because as soon as Circuit Breaker opens and start failing requests, somebody's having a bad experience. If you're using the product, you're getting errors and you're going, but why, right? I'm just trying to save work and I'm just trying to do something. You're getting errors because of this Circuit Breaker, but the system keeps working and instead of devolving into some incident that affects let's say an entire partition DB with 40,000 accounts in it, we affect let's say a couple hundred people and it's bad for those couple hundred people and we need to go figure out why this happened and go fix it, but it's also not a giant emergency. It becomes something that can be handled as it's not an emergency, something we do normally as part of our daily work and that's kind of the point. Take the emergency out of this, keep the system healthy, don't let it go down and this is what Circuit Breakers do for us. So I'm gonna transition here, oh wait, sorry, I'm gonna talk about lessons learned before I do my transition to resource utilization. So I actually now, so the first line here, tune in production, what does that mean? If you tell me, hey, buck, I've added a Circuit Breaker to my code, I'm resilient now, whatever you say. I'm gonna look at you and go, have you tested in production? If you say no, I don't believe you because Circuit Breakers are only good if they work and they are failure cases, right? Failure cases have to be tested. So we wanna try this out in production. The great thing about having these SU-0 instances is we can go have them fail, we can go open them. So testing a Circuit Breaker, there's sort of two pieces. There's that set of parameters, the volume, the percentage, all that stuff I talked about earlier. There's also just what happens when you open it. Did I contain the failure or did I take the system down, right? Because we all know if you get it wrong, you can take the system down. So we'll do things like, we'll just go open Circuit Breaker on SPS SU-0 or TFS SU-0. We may also intentionally add a bunch of calls that intentionally fail. We go hook the code and have the code just return failure every time so that we can see, does it react the way we expect? The other thing that's interesting about Circuit Breakers is, we also have timeouts. And so the Circuit Breaker's looking at some period of time, the timeouts might be 30 seconds, it might be 100 seconds or whatever. How do these two things interact with each other? How do the other parts of the system react when the Circuit Breaker opens? None of this stuff is believable to me until you've done it in production. And this is why Netflix has their whole chaos monkey test and production mentality. It only matters if you can prove it. So this also allows us to verify the fallback. So I put some fallback in there. Does it really help? Does it work? How are you gonna know? This monitoring is a key piece. So I go test a Circuit Breaker. Remember I said earlier on a different topic, the absence of failure is not success, right? So when you open that Circuit Breaker and go test it, maybe nothing quote bad happens, which means your coworkers aren't complaining that they can't get their job done with MSN. That's good, that's a very low bar though. You need to go look, am I getting exceptions that I didn't expect? Do I see a spike in exceptions? If suddenly the foo exception starts going through the roof, okay, everybody's kind of okay, but there's something going wrong. You need to go understand why and root cause that. Cause again, you wanna be able to depend on these in an emergency, and emergency is the wrong time to find out if it works. Make it easy to understand what caused the Circuit Breaker to trip. And you say, yeah, Buck, that's obvious, but you'd be surprised. So when we started using these, one of the things that took us a long time they would open and we'd try to figure out why and we realized our telemetry wasn't very good as to understanding what was causing it. Cause it's always gonna be multiple layers. Circuit Breaker's open, and then you gotta figure out what was it that triggered it to open, and you gotta keep walking backwards until you find the root cause. And, but your very first step is why did it open? And for a while it was hard to figure that out. We've made that much easier to do. When we introduced Circuit Breakers, we'd get these people to say, ah, we got a problem, Circuit Breaker's are popping. Yeah. What, we should close them. No, that's good. They're doing what they're supposed to. They're protecting the system. You have to go understand the root cause why. Like if Circuit Breaker's are opening, it's always a symptom, it's never a cause. Unless you have a bug in your Circuit Breaker code, pretty soon you'll hammer that out and that won't be a problem. So, the mindset shift here is, if Circuit Breaker's open, that's a problem that you need to go figure out, but it's not Circuit Breaker, it's whatever triggered it. And getting people to realize, hey, you've got a problem in your code that you've gotta go figure out. So at this point I'm gonna transition to resource utilization. So, question. So as a developer, when should I think about probably putting this Circuit Breaker in my code? Like, do you have some guidance around, if you're building a service, make sure that you have a Circuit Breaker or any guidance do you have for your team? It's a good question. Some of them, people quote, get for free. So using the server framework, we have Circuit Breaker for protecting SQL. SQL is major dependency, we have Circuit Breaker for that. But the identity team, for example, they had to go think through their calls to AAD and how they wanna put Circuit Breaker in for that. You're really trying to think through, I'm doing something, if that something slows way down or starts to fail, what's the impact on the rest of the system? Oh, that could spread. Let me put a Circuit Breaker in there and then you gotta go tune it. But since Circuit Breaker's aren't free, this is kinda key to your question, since Circuit Breaker's aren't free, you don't wanna just sprinkle them everywhere. Because then you've got a different problem. You may in fact cause a problem when a problem didn't have to exist. Do you have a standard Circuit Breaker that everybody just uses? So it's not like anybody can just come up with their own implementation. So it's kind of, okay. That's right, so that dense slide that I had with all the settings and all. So there's a standard Circuit Breaker class, it's part of the framework, everybody uses that one. And then you have to think through, it's a lot like if you've heard of threat modeling for security, it's kinda that same sort of mentality. Think about how your code could quote go wrong in some sense be abused in production. And think about where you need to put Circuit Breaker in order to protect the system. Question? Yeah, you recommended LaunchDark for feature flags. Is there a reference implementation that you'd like us using for the FastTrack customers for Circuit Breakers? So I don't know of a great one to use for .NET. The sort of canonical one is written in Java, it's called Histrix. And it was put out by Netflix. So you can find Netflix as Circuit Breaker's implementation up on GitHub. So certainly if you're talking to customers using Java, you could go reuse that. There's also by the way, a more complex diagram that explains it. There are presentations by Michael Nygaard from Netflix that go into great detail on Circuit Breakers. I've got those in the notes for these slides. Anybody who's interested in really getting into Circuit Breaker so I highly recommend that. But unfortunately, I should go look again, but certainly a couple years ago when we started this, there wasn't a good .NET Circuit Breaker implementation that we could recommend. So any chance to open source what you guys have? Good question. Maybe. This is less dependent. This is not as quite as intertwined as the feature flags. So maybe. So how do you monitor the Circuit Breaker? So like if one opens up, how do you guys get that information? I know which one's tripped. Ah, good question. So as you might imagine, there's telemetry around these. And since it's a common class, everybody gets a Circuit Breaker telemetry and you name your Circuit Breaker so you can tell which Circuit Breaker tripped in production. When the Circuit Breaker's pop, there's, you know, you can go look either in the dashboards and see it. If Circuit Breaker's pop repeatedly, we're gonna actually send an email alert, right? It's not an alert that would go wake somebody up out of bed, but we'll send email to whoever's on call, the DRI, the designated responsible individual. Whoever's on call and therefore responsible for life site, you know, you'll get an email that says, hey, the Circuit Breaker trip, you know, and it stayed open, let's say, for 10 minutes last night. Okay, I gotta go figure out now why that was. So there's, there's, there are dashboards you can look at, but there are also email notifications if the problem is persistent or recurs so that we don't overlook these. Because again, if they're popping, there's some reason. And sometimes it's a bug in the code. Sometimes it might be somebody's doing something abusive and we're just not sufficiently resilient to it. Because again, Circuit Breakers are great. They protect the system, but they're also a blunt instrument because somebody's getting a bad experience, just not everybody. So it's important to go follow up on a question. How do you manually close Circuit Breakers? Is it through Q&A or UI? How do you do that? Yeah, so it's a good question. It's a PowerShell script. So you can go close it with a PowerShell script. And by the way, that's rare because if it opened, it opened for a reason. There are occasions where we have chosen to close them, but rare. More likely is something bad happened in production and we weren't sufficiently resilient to it, but as a way to quickly mitigate, there have been times when we go manually open them to mitigate the problem while we go figure out the root cause and get it fixed and then put everything back to normal. So kind of like feature flags, they do have the advantage of giving you a way to cut certain things off quickly. Question? The patterns and practices team in their cloud patterns has put together circuit breaker implementation in C-Sharp. Okay, great. Thank you for that. Just similar to that, there's a project called Poly which is just joined the .NET Foundation. Oh, really? The polyproject.org. Pretty good implementation circuit breaker and retry. Oh, that's great. Perfect. Couple of recommendations here. All right, so I'm gonna move on to resource utilization next. And this is also about resiliency, but it's a different form of resiliency. It's very much targeted resiliency or targeted controls. So this is about limiting the load by an identity. And one of the best examples we get hit with quite commonly is somebody's got a build running and it does all sorts of crazy stuff. Or this happens a lot internally. Somebody, let's say in Windows, writes a new tool that queries the system for some piece of information and everybody else goes, man, that's cool. I need that as well. And so they start running this tool and somebody got it running on a desktop under their desk and they start adding all their buddies to it, to query work items or whatever they're doing. And pretty soon this tool is just making these queries just running like mad. And so that one identity is consuming an outsized amount of resources, be it CPU or whatever. And we wanna be able to react to that and not let that cause other people problems. So as I mentioned, this is also much more fine-grained than circuit breaker. So this is targeted toward, the goal of resource utilization is to target and limit the offender, not other people. As I mentioned before, there's a noisy neighbor problem is kind of the common term for it. If you are able to use way too much of the system in a multi-tenant system, that means your experience is coming at the expense of somebody else and that's bad. So we wanna prevent that. We also need to be able to let people know, when are you hitting the limits or when are you approaching the limits? What are the limits? How do I deal with this as a user? We need to be flexible enough to be able to respond to a range of issues. So let's start with an example of what it looks like and then I'll talk about how it works. And many of you have seen this for one reason or another. Some of you work with customers who like to trip this. So you can go to this page on your account and you can see the impact you're having. Now in this particular sort of test example, if you will, this user here, Edwin, is getting throttled a lot. And what's happening here, and you can see it in the highlighted red box, is it's delaying. And there's sort of two main mechanisms here. There's delay and blocking. And the first thing you wanna do is delay because blocking can be downright cruel, right? Because if I start blocking you, everything you call just fails. And that's rough. Sometimes it's necessary, but we wanna start with delays and see if that's enough to kind of slow things down. And so this is key to helping people understand if you're getting throttled, why are you getting throttled? And you can see in here, it'll show you the calls, give you some detail on kind of what's going on. These are all the same command. E-reset default client. I had no idea what it is, but somebody just generated this for me so I could get a screenshot. But let's say you were calling a get endpoint to query push history. You would see that here in that column so you know what you're doing. So how does this work? Because we've got multiple things in play. I mentioned earlier, resource utilization is actually quite complex. And it's complex because we have a lot of dependencies. Part of it, of course, big dependency, of course, is database. So for example, we wanna be able to track database CPU time and I'll show you how we do that. It's actually pretty cool. Then there's some window of time that you wanna look at for how much a given user is consuming in your system. And as mentioned before, there are kind of two pieces, delaying and blocking. So to do this, we want to allow you as a caller to understand when are you getting close to the limits? When are you being blocked? You need to know. So if you're writing a tool, that tool can react to it. And in the end, there's a general concept and resiliency, a notion of back pressure. If I make a call to the server, and the server is struggling or whatever, if the server sends back information to you that lets you know, hey, the server needs you to back off, that's back pressure. It's pressure being pushed back to the client. If the client's intelligent and reacts to it. And again, we're talking about friendly clients, right? We're not talking about abuse. This is not BDOS or anything else. But if the client is well-intentioned, maybe it's something that you've written, then you can look at the headers and take action. You can have the code take action. You might back off, you might pause your calls or whatever. And so the other thing with throttling is, needs to be highly configurable. One of the challenges is, we need to be able to do throttling for work item tracking and version control and release management and code search. These things all work very differently. What's expensive for work item tracking is not expensive necessarily for version control. They're just completely different things going on. Some of that has to factor into the throttling. To react on the client, we give you a set of response headers. So you make a call, you make a rest call, and we're going to let you know, if you're in danger of getting throttled or you are being throttled, you can look at the headers and see what's going on. The server's gonna tell you, hey, I delayed you this much. If you go beyond this limit, we're gonna block you. We're gonna flat out throttle you. So provide the client information, allow the client to react intelligently. And if you're flat out blocked, we're gonna give you a code so that you know that. We're gonna send you a 429. Azure and other systems do the same thing. It's not unique to us, but we'll give you a 429 so that you know it failed. And this is why it failed. It wasn't because maybe the request is wrong or something, who knows? Maybe the request has bad parameters. I have no idea. I didn't even run it. Just give me a 429. So as I mentioned before, the two flavors here delaying and blocking, delaying allows us to spread out the load. If I can simply slow it down, that might be enough. And if you've got a tool that you've written in, maybe somebody's using your tool that you wrote in a way you didn't expect, if we just slow it down, it still succeeds. We just spread out the load. Sometimes that's enough. Sometimes it's not. Sometimes enough of it piles up that we just have to start blocking. And sometimes this can be a multi-threaded tool. It's just making tons of simultaneous calls. Could be any number of reasons. But at some point, it gets overwhelming and we start blocking. Now, the interesting thing here, the really sort of interesting thing here is how do you do this? Like this is all well and good, delay, throttle, whatever. But how do you do it? Since so much of our stuff is dependent on SQL, this is really the key piece for us. There are other parts to it, but X events is the big thing. And X events is a standard feature in SQL Azure. And from it, you can determine all sorts of stuff. Like you can keep track of who's using CPU. Oh, okay, this particular call by this user, this command turned around and called PRC, query push history. And that's been happening over and over in the last five minutes and it's now consumed 90% of the CPU on this database. We need to either delay, start inserting delays or maybe flat out block if it's gotten too bad. And this ability to accurately attribute that usage to the particular call, the particular identity that's causing it is really key because that lets us go after the offender and not just hit everybody with a hammer. And the other thing that's really cool about X events is they're very lightweight. So this is a feature in SQL Azure. It's not us, you know, you can go use it yourselves, but they're very lightweight. So it, which is key because a lot of things that monitor for you, they all eat some percentage of your cycles. Great thing about X events is they're so lightweight, they don't really change anything for you. You don't have to go to the next SKU up or anything like that in SQL. They're also asynchronously connected, collected. So it's not getting in the way of the responsiveness with user calls. Here's a very quick diagram on roughly how it's done. And the key here is SQL Azure, these databases that you see here as the cylinders are pumping the data into storage. We then take that and we actually pump that into Azure Log Analytics. You will probably hear people refer to Custo. That's the internal name for Azure Log Analytics. It's the same thing. It's just Custo is a easier name to say. But by pumping all this back into Azure Log Analytics, pumping it back into Custo, we can actually then have queries that run against that store to figure out interesting things about the usage in each account. So there's a delay here. So when throttling kicks in, it's after a while because there's all this going on in the background to grab the X events, grab other data. It's not the only piece of data. Shove that into Custo, run queries against it. So there's a delay here. It doesn't react instantaneously. Certainly not today. But this has been very valuable to us in order to help make sure that every user has a good experience in a multi-tenant database. So what are some lessons learned out of this? It's kind of interesting because since we didn't always have throttling, we start rolling out limits. And some of them might have been you. Call us and go, oh my gosh, you're throttling me. Can you stop throttling me? Well, okay, let's talk this through and kind of understand why. So when you allow somebody, of course, to go do things unlimited and then you start putting limits on, people get unhappy. And it's very reasonable, right? As a customer, I was paying for this, it was unlimited. Now you're telling me I'm paying the same amount of money and it's limited, what's going on? So trying to put this in afterwards has been an interesting challenge. Now, it'd be great to say, oh yeah, and the thing you should do is from day one, go implement research utilization and of course everything else I've talked about but it's just not practical. But I do tell people today, since we do have this, as you build new features, and we did this by the way we get, is think about the limits. Think about how you need to put research utilization in there from day one so that you don't have to come back and start negotiating with people on kind of taking the limits down. One of the biggest challenges with this has been with the Windows account. They've been using work item tracking for years and then as we started putting limits to try to change the load and manage that load, it becomes an interesting sort of conversation with them as our customer. Hey, you're doing this, we need you to bring it down. You have to start talking to people about running, again, the example of somebody running a tool under the desktop that's just pinging away, running a query all the time or whatever, to try to negotiate and get things back into a healthy state, because again, a lot of this comes back to cogs, like how much are we spending, how do we provide a good service for everybody at a reasonable cost? As I mentioned, delaying is effective if it's a single thread, if somebody's going nuts and calling you with a bunch of calls in parallel, you're gonna have to go and have it block. We need to be able to help users understand why. If I block you or delay you and you can't tell why, that's very miserable, right, because it's a black box to you. Now, I would love to say, and the experience we have for this is great, we still need to improve it. I don't know if you've paid attention to the UI, but it has this notion of TSTUs, these fake units of load that we created. Nobody knows what a TSTU is, so we've got to do things that make it more understandable, but it's at least a start. This also, you'll notice a common theme here, tune in production. If you put in limits for resource utilization and you don't try them out in production, you may not have the limits you think. Resource utilization, circuit breakers, timeouts, all these things can interact in interesting ways. If I start throttling you in five seconds and your timeout is 30 seconds, you're eventually gonna fall into blocking because you're gonna sit there and keep retrying, right? So these are the sort of interesting interactions that happen in the system that you have to try out in production. Yes? Do those limits depend on the account size or not? Are you giving the same limit to somebody who's running 1,000 users versus somebody who's just five users? It's a great question, and so right now, the answer is mostly yes, they're the same. Over time, what's going to happen very much to what you're alluding to is, you pay more, you get more, right? So there is a monetary component that will go with it. There will also be at some point, no doubt, you can buy more capacity, so to speak. So there are all these things that we have to work out. We're still relatively early in the journey of resource utilization. It's actually something though that we've been working on for about two years now because it's a problem that on the face of it seems so simple. Oh, well, if somebody's using this too much, delay a block and whatever, but actually figuring all that out from the underlying system, been surprisingly hard. But over time, it's gonna be like you described is, you pay more, you get more. Same question about reference implementation. No. This one is very much tied into our system and how it works. And quite honestly, the characteristics of our system. So this one would be difficult. Probably the best thing we could do is at some point, really document how it works. So at least you could decide what ideas make sense for your system and go build something that would work for your system and at least steal some ideas. Well, and the motivation is with the FastTrack program, us bringing the Microsoft story to the customers, they're gonna ask these questions and so if we're gonna train the people at my company, we've gotta be able to equip them with something. Yeah, good question. I don't have a great answer for you on that one, unfortunately. I need to give that some thought. Question? Do you publish out the thresholds for your throttling or is it done more like on a case by case basis? So it would be a surprise to me when I was actually impacted. Yeah, so it's a good question. This is again one of our challenges. I'd love to be able to say, if you have more than n number of calls per second, you'll get throttled. It's so simple, so easy to understand and it's not that simple because there are calls that you can call like mad and it won't have much of an effect on us at all and there are other things that are very expensive and a few RPS is a significant load, right? So there's just this wide variation on cost of a given call and so at this point, they're not published over time. I need to figure out, we need to figure out, how do we give you guidance so that you can start to make sense of it and plan for it? We don't do a good job of that today.