 Good morning. All right, well, today we have not one but three topics to get through. So we will jump right in and get started. Little safe harbor slide just in case we talk about things that may happen in the future. We might have hinted a little about it yesterday on the big keynote stage. My name is Mark Velker, and I'm the chief architect for the Octa Customer Identity Cloud. So I want to start off today by thinking about a question. When does scale actually happen for developers? So as developers, we all love a good scale story, right? We all love to hear about the hyperscalers in the world. And we like to think about scale in our own product, because it means we're doing something right. It's a good problem to have, right? But let's think about when it actually happens. Sometimes it's a result of deliberate planning. Other times it's a result of changing business conditions. Sometimes it happens slow and steady, and you can kind of see it coming. And other times it happens really, really fast. Overnight even. But rarely, if ever, does it actually happen in isolation, right? We never get told we can just scale, and we can forget about reliability, we can forget about testing, we can forget about deployment, we can forget about security, we can forget about cost. It doesn't happen, right? Probably never. So with that in mind, let's look at where we were in February of last year. So this is our RPS limits for our Customer Identity Private Cloud. And if you're not familiar, the Private Cloud is a single subscriber offering. That means that customers can opt into this if they need a little more isolation or a higher RPS rate than our public line offering gives them. And they can size this in a couple of tiers, right, depending on their performance needs. So we have the basic tier at 100 requests per second limit, the performance tier at 500, and the performance plus tier at 1,500. That's where we're in February. And then six months later, we started offering a 3,000 requests per second tier, a 6,000 RPS tier, and we had our first customer go live at 10,000 requests per second. But what did different six months make, huh? In six months, we basically took that, not double, not triple, but to six and a half times the limit that we had six months earlier. But like I mentioned earlier, it's never just about scale, right? It's never just about one thing. So in our case, that scale came at us really fast. So we had one customer that you heard about yesterday, OpenAI, that added millions of users very quickly, right? We had to scale quickly to keep up with them. And that means that we didn't have the luxury of taking a lot of time to go look at our architecture and maybe make big changes, right? We had to think about how we can do this relatively quickly. And we also had to think about resilience. So we have a 4.9s SLA that we offer on these environments that we have to keep in mind. And when it comes to a large customer who's really hitting their stride and has a really popular product, nobody's interested in taking a big long downtime so you can roll out more capacity, right? That's kind of antithetical to what they want to do. So the rollout needed to be seamless. It had to be carefully orchestrated. And by the way, it wasn't just one rollout. We took them through each of the three new tiers that we talked about on that last slide, right? So it's incremental ads. When you repeat a process multiple times, you got to get it right or else you're going to cause a lot of outages, not just one. And then finally, there's efficiency. And in particular, we'll talk about cost efficiency today. So I'd wager most of you in the room probably had some kind of year of efficiency with regards to cost in the past year or so. There were a lot of macroeconomic concerns and things floating around, right? And when you're trying to be cost effective, just throwing more infrastructure to the problem usually doesn't pan out, right? Because database clusters, servers, they get expensive, especially at those higher tiers, right? In our cloud provider offerings. So we had to think about those things too. So there's a lot of things going on, but I think you can probably relate to these three themes. So we'll go with those for today. Now, keep in mind, we're not designing a system from scratch here either, right? We don't have the luxury of being in the ivory tower. We're working as many of you are on a mature code base that's been around for years and has millions of users using it, right? And that's the hard part. Modern software systems are really simple under the hood. There are literally dozens of services that are talking to each other through message queues, talking to databases, talking to other APIs, talking to other third party services. And they're receiving requests through load balancers and firewalls. They're logging replies and change data capture into our CDC streams and they're shipping logs to data warehouses. And there are circuit breakers waiting to be tripped and alarms waiting to sound if all that observability data shows a blip. In somewhere, and all that, you, Madam or Mr. Developer, have the task of making it scale not to double, not to triple, but to six and a half times in six months. So where do you even start? We're gonna talk a little bit about our strategy for that today. And we're gonna take you through this in a couple parts. So first of all, we're gonna start with the premise that modern SaaS is more than some of its parts. So you can't just look at the individual pieces. You have to think about the intersections and the interactions between all those components under the hood. And the same goes for your priorities. So we as developers have this tendency to think of, oh, hey, I gotta go do scale as a conflicting priority with, hey, I gotta keep my system reliable and I have to keep it secure and I have to keep it cost effective. We're gonna try to flip that perspective a little bit and talk about how we actually looked at this as a tool for telling us where to focus our efforts. So when you're thinking about a problem like this, my first piece of advice to you is this, those are not three separate goals, scale, resilience, and efficiency. They're really one goal, right? So the best wins that you can find are gonna be the ones that have some impact on all three of those areas. For that part of the story, we're gonna bring up senior architect Tomas Sukup and he'll walk you through a real life case study of how we applied that philosophy to our extensibility platform that many of you probably use today. And for the second part, we're gonna talk about something else we said earlier, which is we're not starting from scratch, right? So we don't have the luxury of designing a whole new system. The assignment here is not to build a really scalable system, it's to start with the system you have and get the users you have to that scale point. Achieving a six and a half percent, six and a half times increase in scale doesn't be any good at all if you can't roll it out into production without heavily impacting your users, right? So the assignment isn't just scale, it's getting users to that scale. And that's another powerful tool for evaluating your choices and how you go about doing things, right? Keeping that in mind, not just the goal, but where you're starting from. So for the second part of this story, we'll bring up director of engineering, Dennis Henry, who will walk you through how we actually got all this stuff from the test beds and from our labs into production without impacting our users. So with those two premises in mind, let's bring up Tamash and he'll walk you through a case study for us. It's great to be here. I'll start this part with a well-known picture on relationship between time, cost, and scope. It basically says that you can't improve one without affecting the others and also that you can't achieve all of them. It implies that you should pick two and the last one is given. It seems intuitive to look at scale, resilience, and cost efficiency from the same perspective as competing priorities. But we at Okta look at these three from a different perspective. That they are not necessarily competing and then there are areas and changes that you can do to improve two without affecting the third one or even improve all of them. So what are these, right? What is the magic here? Over the past years of working on scaling, on resilience, cost, scale, we found a couple of examples which I'll be now showing you. First, reducing waste. It's kind of boring, right? Surprisingly, we found it as one of the most important on the list and also restrisky. Do you want to scale more? Look first what you will not do. Look where you have unnecessary baggage that you are dragging along. For us, one of the examples was logging. Over the past, over the period of years you are developing the system, you are logging a lot of things and often you are not really using those things. So for us, by carefully looking at our logging, reducing things we no longer need and also consolidating other logs together, we immediately got a lot of cost efficiency improvements and not only that, we actually significantly reduced also CPU usage of some of our key components and that increased throughput and again improved our margin and cost efficiency. Second area is generally about improving performance and efficiency in your components. You don't need to redesign the system to achieve scale. Actually, I think it's mistake to start with that idea. Often it is enough to first identify the biggest bottleneck in the system, then start improving its performance or efficiency. This way you can N plus one iterate towards the scale you wouldn't think with your 10 year old code is even possible. An example from our recent history was Deeper Luke at our CDC processing on the database. We basically looked deeper at that and carefully designed the filtering and also adding compression on the outgoing stream and with that we significantly reduced the CPU usage and this access on the database which reduced cost, improved efficiency, improved throughput and scale. Third, use the right tool for a job and also understand and master your tools. In modern system and modern tools, some of them you can use for wide variety of use cases but that doesn't mean that it will scale in all of them. In our case, we used traditional database to store the state of the login's transaction when user is going through our dedication flow. This is high throughput, relatively short term, relatively simple data set and by using in-memory database instead of traditional database we actually significantly improved again the performance of the system, reduced cost without affecting reliability. I'm not saying move everything to in-memory here but reuse the right tool for the job and the last thing, the last point on the list is about simplifying the architecture. Think about your architecture so that it fits to your specific business needs and not much more. By that, by simplifying, simplicity often means scale, resiliency and efficiency all at once. So to illustrate some of these points, let me walk you through a specific use case. In this example, we'll go on a path of taking original design and iterating over it while balancing scale, resiliency and cost efficiency. First, a little bit of context. During authentication flow, we provide extensibility points. Those are the purple dots on the slide where you developers can define actions to enrich or extend the authentication by your own JavaScript code. The extensibility engine that powers that, powers actions, we call that task. It runs your code on our platform and needs to ensure that it's secure, scalable, resilient and also cost efficient. So this is a simplified architecture of the web-task subsystem in our public cloud environment. In a nutshell, it's a bunch of EC2 instances, each running a couple of hundred of hardened Docker containers with our sandboxing code. The important piece is that for security reason, we draw strict boundary that each container can never be used by more than one tenant. By tenant, I mean here, typically customer or organization. Each instance also runs component code with web-task proxy that manages these containers and in front of that is load balancer that distributes load to the instances. Now, this is one of the early iteration of the subsystem and in this iteration, we are using round robin algorithm to root requests between the instances. The problem with this design is that as you scale and traffic grows, you start having containers for each tenant randomly on all instances, in extreme case on all of them, as shown in the slide. So as there is a parallel limit on how many number of containers you can efficiently run on one machine, the whole system eventually doesn't scale horizontally but only vertically, so you reach certain point where you cannot do anything. The other problem is that the multiplication there is obviously costly. So in this iteration, the team added a new component called web-task router, which using combination of cookies and sticky sessions holds maps of tenants to instances and can root requests from a given tenant to only one instance. In our example, tenant one to instance A or tenant two to instance A, tenant three to instance B and so on. Now, we got rid of the duplication as we add more tenants and executions we can scale. That said, we have a resiliency problem here. There is only one replica of the web-task router and if that goes down, the whole subsystem fails. So we'll fix it and we use similar trick as before and add load balancer in front of those web-task routers and start scaling them. The problem we face with this design is that each web-task router now has its own view of the world. The green router roots tenant one to instance B but the orange to instance A. So this causes again duplication and as we scale and keep adding web-task routers we will have even more duplicates, one for each router. So this is definitely better than previous. We fixed resiliency but as we scale it will impact our costs. We can solve this problem by adding another component which holds the routing state of the web-task routers so that they all now have the same view of the world. We have obviously fixed the cost-related problem but we have some more problems with scaling routers as when we are scaling quickly and adding new routers we need to synchronize the state there and that can cause some delays and potentially problems with scaling. But the bigger problem here is that the complexity demon creeps in. The solution starts to be pretty complex. We have many components to handle. It's difficult to manage hard to reason about. So now at this point I will go to the design board and challenge the original assumptions. The requirement here we're working on from the beginning was that we want requests from one tenant routed exactly to one specific instance and container regardless of our scaling. Now my updated requirement will be that one tenant is most of the time routed to a specific container. I can't use one container for two tenants but I can for some period use two or more containers for the tenant as long as I don't impact the cost too much. So once I do this slide change in my mindset and the requirements, suddenly there are other potential solutions. And one of them is actually very close to that very early design I've shown at the beginning. The only difference here is that instead of using round robin on the front load balancer we use consistent hashing. Consistent hashing uses mathematical function to calculate which tenant in our case goes to what instance. So we no longer need to manage the state. It creates some overlap during the scale up and down as we are redistributing some percentage of the load to the other instances but it guarantees that this is only some small portion and so that we are still cost efficient. So by slightly changing our requirements we have eventually found solution that is scalable, resilient, and cost efficient. On top of that, that's simple and elegant. So to summarize, in the previous use case I demonstrated that starting from all design using iterative thinking process waiting different aspects of scale resilience and cost efficiency, then going back and challenging the original assumption and this is important part. We eventually came up in a solution that is the magic in the middle, scalable, resilient, and the same time cost efficient. So what we together walked through in detail was just one example to unlock scale without sacrificing resilience and efficiency. We applied this methodology to find other improvements such as how we store sessions. And that all allowed us to grow 6.5 times in less than six months. With that, let me introduce Dennis, our director of engineering, who will walk you through how we are deploying all these iterative changes into our platform without affecting you, our customers. So please welcome Dennis. Thank you, Tomas. So, hey everyone, I'm Dennis. I use he-him pronouns and I'm the director of engineering and I'm focused mainly on resilience. So how do we put all of this together? How do we actually execute on this promise of scaling with reliability and efficiency? Remember, the assignment is not just to design for a massive scale from scratch. It's actually to start with your existing system design and your existing users and grow with them to higher scale without interrupting their business. So let's look at how we do this at Okta. To scale with no downtime is difficult and doing so requires you to think about your deployment process with resilience in mind. At Okta, that means thinking about how we can update our infrastructure and code safely on a weekly basis and do so in a cost-effective way. So to accomplish this, our team adopted a deployment strategy known as Red Black, which allows us to ship changes in a way that fulfills our promise of resilience while considering how we keep cost efficiency in mind. So in a Red Black deployment process, we stand up new load balancers, Kubernetes cluster and deploy our stack on top of it. What this does is allows us to validate that the changes that we're making, whether they're routine changes or scaling our environment to a larger tier are tested, validated, and will not cause negative impact to the customer. Once this new cluster and load balancers up, we can test and ensure that the application is functioning as expected, then gradually cut traffic over to the new cluster. This gradual process allows for a gentle ramp to the new cluster, ensuring requests are not dropped, harshly in the process. Once the new cluster is live and the traffic is cut over, we can clean up the existing cluster to ensure cost efficiency is still kept in mind. So how does this actually look in practice? We dug through the data and this is what we found. So we were scaling this customer from 1.5K to 3K, and this is what we saw, 0% error rate. Then again, we were scaling from 3K to 6K, and what did we see? 0% error rate. And finally, we were scaling from 6K to our top 10K tier, and what did we see? 0% error rate. And this is an actual screenshot of a Datadog dashboard that shows one of these scaling events. And you can see that this was done with over 1,000 RPS flowing at the time that we were doing the scale. We used our N plus one is greater than an approach to learn and grow our platform through these three distinct transitions to reach 6.5 times scale in six months with zero errors experienced during each. We're incredibly proud of what we accomplished here, and we're so happy to share with you what we've learned and how we accomplished this with you all today. Now I'm gonna bring back Mark and Tomas to the stage with some closing thoughts. Thank you, Dennis. So we're running toward the end of the session here, but we'll try to wrap things up a little bit here. We hope that a look at how we strategize about scale at Okta is useful for you, and get you thinking a little bit about how you can scale your own systems, right? And remembering that it's never just about scale, right? There's always more than one thing on our plate. So when I look back at kind of keys for success for us here, right? First one's perspective. So we have to think about having a single goal, not three separate goals when we're thinking about these three things, right? And when we can find the overlaps between those, those are the best points for us to go after. When it feels like you've got conflicting priorities, you gotta turn that into a tool, right? Because that's actually a great way to figure out where to focus and where to spend your time. And we know that modern distributed systems are more than some of their parts, right? How all those components under the hood interact with each other is a great place to start looking. After all, if the gears in a motor are all just spinning independently, the car doesn't go anywhere, right? You gotta get those teeth mission together. So look at those intersections between all those different components as Tamash walked us through. Now, once you start scaling a system, it's all about refinement, right? So don't be afraid to iterate. If you're looking for a six and a half times increase in your scale, you don't have to do that all and we'll go, right? You can start with a smaller step and a smaller step and a smaller step. And that's the N plus one is greater than N methodology that Dennis talked about, right? So iteration is key. And while you're iterating, don't be afraid to remove and simplify. Sometimes the things that you're removing from a system or the things that you're not gonna focus on to scale are just as important as the things that you think you might wanna add or change. And it's okay to challenge those earlier assumptions, right? Sometimes what you did five, six, seven years ago, maybe those conditions have changed. If nothing has changed, why would you be messing with the system anyway, right? So it's always okay to challenge those original assumptions. And finally, don't forget about that last mile, right? So start where you are, not with your ideal state. The assignment is never to just build a very scalable system. You have to start from where you are today and get to that place and bring your users along for that ride. It's never really done until it's actually deployed in production, right? And that's where the hard work is often found. So that will be out in the halls and in the developer zones. And thanks for coming.