 Okay, so this first, it's been a lot of fun actually helping put this conference together with Laurie and the rest of the program committee. So I am Andrew, I'm gonna talk to you about my experiences today with what it was like building systems in hyper growth for deployment. So we're gonna talk a little bit of a hyper growth and then we're gonna talk about three problems that we saw and some basic solutions because I don't think there's any great solutions to any of this right now. But just hopefully food for thought and like a big part of what I hope people do is take away is this is where I believe a lot of where the industry is going on deployment. So who am I and why am I here and why am I talking to you all? So I'm currently CEO and co-founder of a small seed stage startup working on deployments. Formerly I was CTO of FinTech called Byes and then I ran infrastructure at Dropbox. So I spent almost a decade there of little under nine years where I joined I think about 100 employees and saw up to about 3,000. And then at YouTube I joined about 100 employees also saw up through about 3,000 and worked on the infrastructure team of both of those and started my career at AOL where I kind of it's called the end of hyper growth phase. Early 2000 sort of watching dial up do what it does. Which is yes. Okay, so who's been through hyper growth before anybody here? Show of hands. Okay, so I think of hyper growth like this. This is like the long time span of how organizations typically work, right? Like you can crawl, you can start to walk as a toddler, right? And then you kind of like run and you get to do this over time. Hyper growth kind of feels like this. Everything you do gets compressed into like a three to maybe a three year window at best. So you mature from zero to something and usually it's something large very, very quickly. The non-technical story I have about this is I remember going to campus recruiting up to University of Washington and my mentor at the time said, it was like lamenting that I was only going to go do this once. And I was like, Kevin, I really wish I got to go do this recruiting more than one time. And he's like, no, like this is hyper growth. You get to do this once. You get to learn some principles and you move on with your life and you teach some people how to do this. They're gonna go do the next five schools. And they're gonna do that and you're gonna just repeat that all the way down. That is software engineering and hyper growth as well. You do something once and you move on. There's very little time to perfect processes or make things great. You focus at the why and the how and you don't really care about the what. And you'll probably see a little bit through this presentation. So that's hyper growth. So just to set sort of context here as to what we saw. I joined YouTube in 2008 through and I was there through 2012. If that gives kind of like that's the ramp of user base that we were seeing at the time. We're talking, we went from sub, I think 100 million users around 100 million, 250 million users a month to almost a billion active in four years. And that's kind of the growth curve that's kind of still going there. During that phase, we saw 24 hours of video uploaded every minute. If that gives sort of context, I think it's in the hundreds of hours of video every minute right now that's uploaded. And that was in a four year window. We went from like something like 12 hours or 24 hours within a four year window. When I was at YouTube, similar sort of growth story or sorry, Dropbox similar growth story. I joined 2012 right after YouTube. And I was there through 2020. We went from I think about 65 million registered users when I joined to 700 million on the infrastructure when I left. And one of the major projects I worked on was moving with a system we call Magic Pocket which is we took 700 petabytes of storage from Amazon. It's on S3 and we moved that to our in-house storage system which we built from Maine and on custom hardware like the whole thing data centers across the US. Basically moved 700 petabytes out and into our storage system which replicated was probably about two exabytes of data at the time. And so we did that in a 24 month period. So we had to build all the testing infrastructure, all the deployment infrastructure, everything had to be stood up including the hardware in that window. And so okay I like to affectionately say like how do we think about GitOps? Because like just to tie it back to the theme of the conference and what we've seen here. We used to, I mean these are the questions we were trying to answer. I saw slide picture this morning very similar sort of set of questions. It's like how do I do this declaratively? How do I do it repeatedly? And so we built systems to do a bunch of these things. The thing is we didn't actually have the time to build something that looks like Argo or Flux or anything like that. And so what we had was something that looked generally like this. SyncDBOps is a script that runs on every single server and a one minute cron job that syncs down. I see what I'm saying. Syncs down a bunch of configuration data and then puts the box in the right state effectively. Predates puppet chef, we had a little CF engine but mostly it all ran through that. And then we deployed code to the Edge via HD pool and distributed SSH. And then we ran it all under Monit. And if you thought about that today, it would be pretty much this, right? Like that's basically what we're saying that today's world looks like just with a little bit better tooling. So that's the world we lived in. That's scaled for us to about 10,000 plus machines, daily deployments, team sizes ranging from 10 to about 40. So you don't need a lot to actually make this all work at very large scale production environments and make it pretty reliable. So you don't have to overthink it. You can move much faster in a lot of cases. I guess this is organizational friction. Show of hands. Who feels more intimidated by organizational friction than your technical friction? Guessing every single hand in the room is gonna go up. Who's a manager here? Okay, it's like the inverse of the hands that went up by the way. But that's the, that's sort of the story. And like what we saw is that time to deploy started going up though with this workflow. And the main reason was operational overhead. So what do I mean by operational overhead? YouTube circa 2008, 2009 looked like this. We had a bunch of sharded databases. We had a caching layer. We had some web pages like basically isolated into individual clusters. Isolation is a theme later in this talk, but we would isolate these as units of work. All running the same stack though. And then we had some batch jobs and some stuff. Now, what happened was like YouTube acquired, or YouTube got acquired by Google. And Google is really good at some things like search, like serving lots of small images really, really fast. And so that became, okay, let's move things to Google because we're really bad at this and like this is actually really hard for us to do. So we ended up with this, this architecture. What does this look like in today's world? What would you call this if you had to like describe this in today's world? Multicloud, yes. So we had multicloud in 2008, right? So all the same problems we're all faced with today, we were like, how do we manage this? And if we remember, we had this stack. This is the stack we were like, we're gonna use this stack, we're gonna manage this on top of Google. No, it doesn't work. Because Google looks a lot more like Kubernetes than it does a bunch of physical machines spread across the world. And so that led us, that causes a ton of pain, right? How many people have multicloud here, a show of hands? One, okay. Scale of one to 10, let's just shout it out. Like 10 being, this is super easy, one being this is like an effing pain in my butt, like where are you guys like closer to five or closer to 10? Two? It's easy? Oh, okay, so closer, like it's super hard, right? Like yes. Because you kind of have something that looks like this, right, like you have these silos of tooling and everyone is like, what am I going to do? No one really understands what's happening. And like the cognitive overhead is just like, really just like crazy. Engineers can't move between systems, teams can't transfer. It just becomes like really high overhead for everyone to manage. And we used to say, if you can't fix it in five minutes, just like escalate to the whole team. And we would literally just like have the secondary pager, we would just page everybody. We were still carrying Skytells at the time. We were just literally page everybody because there was no way to resolve it. Like you needed someone that understood the tool chain on the other side to fix it. And so we went to First Principles and we said like, okay, why is this? Well, we've got these two clouds. We have this really want deployment to look the same though across it. And so what we ended up doing is started to pull out the infrastructure from the deployment layer. And so in today's world, my recommendation is don't commingle, terraform, or go on those things with your deployments. Like they're fundamentally different things. And we learned this the hard way. We commingled them. We tried to use things like HD pool, like those type of systems to make that work, but really it just caused more friction for the organization over time. And we learned that lesson and at Dropbox, we actually ended up building a very different system where we had a provisioning system for infrastructure. And we left that, we think of that as creation and creation of these resources. And then we had a consistent interface for service management on top of that where there was one way to deploy services, one way to manage services with automatic dashboard creation, all of that was done. And we said like, that's the application level. Like we actually are not gonna concern ourselves managing the infrastructure level. That changes very rarely from our perspective relative to the amount of code changes that are actually happening. And so we tried to separate those two things very hard. So that was like one major lesson we learned and I think this is something I try to carry forward in all of the infrastructure I think about now. And the recommendation we make to anybody we work with is you probably wanna start to separate these on two different horizons because they actually have different deployment cadences, different deployment dependencies. I'm overloading the word deployment, but I can't get what I'm going at here. So that's the first lesson, first sort of thing. Okay, now the second one. So it's a little more interesting. These are both SaaS products. How many people manage the SaaS product here? Okay. Okay. This is probably much more applicable to SaaS. Be interesting stories for everyone that doesn't manage SaaS. Maybe there's some takeaways here, but this is really what we're seeing in this world of SaaS. And what we mean about isolation we're gonna talk about right now. So the first type of isolation I affectionately call sort of locality based or segmentation based. And I like to rewind the clock and think back to like how do we get where we are here today? Before the cloud, right? Everyone put things in data centers. Who still runs data centers? Most, see people, okay. Everything got put in this box, right? We put this, everything in this box and we're like, okay, I control everything in this box. Like because I own this box or I lease this box. And so the things that go in this box are mine. And I can control every single little thing about this box. I can set the network up the way I want. I can set up the policies the way I want. I have complete control over this effectively, right? If I'm the CTO of an organization and I have a data center, it is mine and I'm gonna run it the way I wanna run it. The problem with this model is that it's not the model of the cloud or SaaS or IaaS products anymore at this point. Because people, data centers are expensive investments, people move to the cloud. And so we tend to see this, this locality base or segmentation base things in SaaS now, where you end up with multiple geographies, potentially with the same stack, maybe not the same stack, but like generally the same stack with different configuration. And now this is a real pain in the butt to manage when it comes to like distributional deployment systems that don't really know how to reason about these like vast number of different geographies. And this is sort of an easy case here that we saw this is like basically a GDPR data locality case. You can extend this to at Dropbox, we would constantly get requests for single tenant infrastructure. Meaning, can you stand up all of Dropbox just for me? And it's like, no, and it's really easy as a 300 pound gorilla to say like, no, we're not gonna do that. Like that's not how the product works, right? But one of the things we also saw was that we would transitively start pushing that down to other vendors that we would have. And we'd say, we will use you if you can do an instance just for us, right? Which is kind of ridiculous when you think about it, right? But like what we were doing is basically saying, hey, SMBs in mid-market, like for us to feel this, for us to be confident in this, we want you to have these characteristics of isolation. Now, I'm debating whether, let's keep down this. And really what we cared about was sort of secure, like the sort of security isolation zones and things like that. And what we didn't trust is that mid-market and SMBs could build isolation the same way that say a Google or Amazon IAM system works, right? Like that's what we wanna plumb through all the way. And so we said, instead of that, why don't you just copy your stuff for us and give us a copy of it? Which is what like the larger enterprises to us, we're saying to us, right? Like, you know, it's like the banks or those would say, just copy this thing, like we'll run that, like give us our version of it. I mean, the federal government does this, right? With like to Amazon and Google and they say like, give us GovCloud. So it's a trend, right? It's there. We saw it like, and it was a pain, it was just a cycle, just kept going in cycles. Like, no, we're not gonna do this, we're not gonna do this, we're not gonna do this. But we lost sales deals because of this for sure. So if you run SaaS products, this is something that we can really think about. Okay, the second type of isolation was request isolation. I'll go to the, and so the request isolation, when we thought about the desktop client and mobile clients when we deploy them, we have multiple release channels for them. Meaning that for a given client, it's on a given release channel called alpha, beta or production. And this has attached to a specific backend. Because if we do the full end to end mesh, right? Of like clients in the wild versus back ends, it's really hard for engineers to debug what's happening. They kind of look at it and they're like, I really wanna just reason about like my version two desktop software talking to the current version of the backend or to the next version of the backend. They don't wanna think about it as, let me think about it as all versions of the backend, which is, I get where they're coming from. It's not perfect, but it gives enough confidence that we can kind of move the product forward at a reasonable pace. So that's like just pure request back ends. But then we layer on top the next request that we get at this phase, which is we want to do version pinning on the clients. Which now becomes, okay, this is now becoming a silly, this is now becoming silly, right? To try to manage this becomes crazy. So effectively what's happening here is let's take the first slide of data locality and say I'm going to put some storage in Europe. I'm gonna have a company that's in Europe that wants to store data in Europe that wants to pin the version of their client at version two because they have a weekly or monthly or some other release cadence than we do. And they would say like, that's what we want to do. Okay, so build a deployments like show me a deployment system in the world today that can handle this in any reasonable way that makes it easy for the operator to handle this. I don't know one, we struggled with this, we played with the business requirements, we just kept going, like, this was probably one of the hardest problems we had to solve was this. We ended up for the most part not allowing this with the exception of you start to get into like government style on the FedRAM side and things like that where there's like a little bit more leeway. We wouldn't make guarantees though on service level, things like that, like things can break. But managing this today is like a real requirement. I don't think I know a B2B SaaS company that's like pushing desktop software down that does not have some variation of this. Some good friends at some very big chat companies have some similar scenarios where they're thinking about how they handle this. Almost everyone we talk to at this point has some variation of this problem where they're looking at how do I handle isolation across this, how do I handle the sort of the underlying deployments where it's both backend and there's like a desktop component or mobile component and then the end user has some requirements they're laying on top of it. This to me is where B2B SaaS is moving. I don't see a world where B2B SaaS does not have this. And so like now let's talk about ways you can manage this. Like what are the scenarios? How can you handle this? I don't think I'm gonna not leave anybody satisfied with the architectures we go through here but like this is really the state of play. This is the easiest probably which is let's segment everything. When I say the easiest, it's the logically the easiest to reason about, right? It's like let's take everything and put them in verticals and let's just copy it, right? I was talking to an engineer friend the other day that was telling me that he's a Google and he's telling me that like even there on some of the ML and AI stuff this is like the easiest pattern they have is just like glitches copy but this becomes hard to manage if these are all tenants, right? And they wanna have some set of exceptions across it. So now let's take number of tenants times number of services they have times number of adult variations, right? Like this becomes very quickly thousands of things to manage across the board. And this is the reality of like B2B SaaS at this point, like this is the large enterprises are pushing this down as requirements to like the whoever's beneath them and then that just transitively keeps going down. So that's the reality of like this is like probably the easiest approach and where most people start for like GDPR and the sort of like those types of data locality. This is actually becoming more prevalent. We're seeing this a lot more. This is scenarios that we tried in sort of the hyper growth phases try to get people comfortable with this. Actually at one point I believe we'd use architecture like this for Dropbox in Europe which is let's give compute isolation but like data at rest is in one single place. It's a solution. There's a lot of pros and cons to this one. The shared data plane needs a certain type of ACLs needs certain sets of like guarantees on it to make it actually worthwhile for this to work. And really what you're guaranteeing is like, okay, if we're doing any mutation of data that's not gonna get co-mingled with things but at rest we're okay with it is effectively what you're saying. This pattern can work well. It's pretty prevalent now. I think I saw a report, Boston Consulting Group did wrote a paper about this earlier this year saying like basically this is like what they're seeing on the data companies. Like this is like the most prevalent architecture that data companies are going with at this point. But this also still has the same problem, right? If you wanna do version skew on the compute you're out of luck, right? Like this becomes still a hard one variable but you've got enough that the multiplicative nature of this becomes a problem to manage. And I think the last architecture that's pretty prevalent is this one which is let's do things on prem for people. Somebody gives me a VPC, I install this in a VPC. This one I, if you're do B2B SaaS I would like highly recommend staying away from if you can. This is probably the hardest one of all of these to manage because you don't know what the customer is going to do to the underlying substrate that you're deployed on, right? And so that's now a concern, right? Where we talked about earlier separating deployment from Terraform and those things. If you don't have a stable foundation and you try to put the next layer on it it's probably gonna be not such a great experience for the end user. And this one, we never even explored these. Like the data sizes were just too large this was never gonna be something but just for completeness this is kind of like what we reasoned through and sort of the journey that we went through. In the end, for most of what I've worked on we ended up really just pushing back hard on the business requirements. My sense though is that it's not really an option anymore it's becoming way more prevalent to say like I have to have these types of environments. I see a lot of nods in the room so I'm guessing people have seen this. I don't think there's a way you can avoid it at this point and fundamentally this makes deployment much harder. You end up with custom approvals you end up with configuration differences various versions. All of these things that are something your deployment engine actually has to consider now if you're doing anything in the realm of B2B SaaS. And it's not simple to build something like this. And I challenge, I've been trying to figure out how to model this in pipelines I don't know if it's possible and fast forward to the end we'll kind of get to like where I think things are going. So these are the two I think on the SaaS side that like really if you had to pick an architecture and you're gonna do SaaS something like this is probably gonna make the most sense. Happy Talk about pros and cons afterwards I wrote a blog post a couple of weeks ago going into a lot more detail on this and the cost. So I think we'll end up with isolation-aware deployments. I think the deployment systems need to be aware of either tenancy isolation levels they need some hints in order to solve this problem. If you don't give them the hint it's all gonna be back to operator overhead it's gonna be back to operator complexity at that point. I know people that have basically said like yeah we'll just hire three people that manage this single tenant infrastructure and like that's their solution to this. If the deal size is big enough it might work. It's kind of the takeaway. Okay this one is my favorite. Communicating is hard and it's even harder with distributed systems. So another pop quiz. Put your hand up if you have one system. Nobody? Two systems? Nobody? Fives to 10 systems anybody? Okay. 10 plus systems. You should probably see everyone's hand go from the most place here. My takeaway are distributed like I think in today's world distributed systems are the norm. I don't think that there is a single system that's getting deployed to public cloud at this point that is not a distributed system. We can call it a monolith. We can call it whatever we want to call it but the fact that it's on public cloud means that it's going to be a distributed system because I'm gonna have a database and a web server. That's a distributed system. That's two components that have to stay in sync right in some sort of way. Okay. So that means there are system interdependencies, right? Like we can try to model this. We can try to do all these things there. I think the more interesting thing in the world of deployments is less about the system interdependencies and more about the human interdependencies that emerge from this problem which is that distributed systems aren't, almost none of them are truly designed really well. Like very few that I've ever worked with actually have like actually good characteristics and they lead to this like really human sort of like full mesh of like what's happening? And my story here is we all like, who thinks YouTube's stable and seems like it's like it generally works for really, yeah, most people, okay. What I told you that the way that the site for a very long time was deployed was that we would get 200 plus engineers into an IRC room, IRC channel and we would ask everybody, is your thing good? Like please tell us that the thing that you're releasing is okay because it's a monolith. Like I mean, you saw the architecture, right? Like every owner and stakeholder that we had would literally come to this IRC channel, a couple of hundred of them and they would, we'd push a button, like one of us would like do the release and we'd say like, okay, we approve it to production and then they would all go and check everything. Like this went on for years, by the way. Similar story to Dropbox, we do the same sort of process where like we release code and like we release everything and then every little product team would run around and try to figure out like if the thing is working and they're part of the distributed system is like doing the thing it needs to do. And that's like super complicated. We basically taken like, we basically put a convergence algorithm like onto humans and said like, converge this thing, make sure it's correct. And so we basically bet back to this, right? It's like now we have lots of humans involved and time to deploy is going up. We went from like being able to deploy 20 minutes at YouTube to like it took three days at one point, not because like someone would find a bug and then we'd have to like go in the cycle of like trying to figure out like, is it really a bug? Can we keep going? And that just like had huge cognitive overhead. I'm gonna speed up here so we can make some time for questions. So that put us back to first principles, right? So like what do we do? We said we want autonomy. We want reliability. And this one is the most important one. I want it to not be surprising when you do things. And so what we think through, through this is like, we need to like have a way more application-centric approach to this like we need, and when I say this is more service-centric, like we need to have the users define like what the invariants are for their system. We can't try to like do this all together. They need to know the invariants. They need to actually specify these in some way so that we can actually think about this in a much more intelligent manner. It's probably the best way of putting it. And that really led to like, okay, we need convergence here in some way so that we can actually fast forward into Dropbox do a little that they do a little of this at this point in terms of like, how do they think about this? The big advantage is right, convergence over pipelines, right? Like who use the Kubernetes? Okay, so you all, you're using convergence algorithms at the end of the hood there, right? So like, basically it's just handling, but it only knows about Kubernetes. It can't converge like everything else. And so we started thinking like, how do we build this at a more, at a more higher altitude, right? Like how do we think about this at the application level, not just at like sort of like release single service because there's other pieces, like databases have to get pulled into this, right? You have queues, you have Kafka, you have a million other components that also all need to interact with this ecosystem. And like these are the main benefits, right? Like for convergence, it's generally idempotent. The parallelism can be much higher. They has state awareness of like what's happening. It's way more flexible than pipeline. Like you especially can like, do things that are declarative with it and it becomes self healing, right? And so generally this we've found is a way better approach to think about deployment than sort of than the traditional pipeline mechanism. And a lot of that comes from like just looking at the communication patterns of people and how they were working together. Like we have to like just solve that problem because there has to be a technical solution for that. So how do we empower them and give them back the ability to like write little pieces of code to handle that? And so that leads to like a very different sort of outcome based decision, outcome based engine than a engine based sort of on the traditional, okay let's do if a bunch of stages in a pipeline that can't really be rolled back easily. So with that, I'll say my conclusion for this one is like, okay, I have like two slides for this. Like, tenancy convergence and workflows are probably like the more isolation convergence and workflows depends how you wanna think about it. I think for any modern deployment engine like these are going to be the things. And this is what hyper growth taught us is that these things are really key components. And I will go all the way back to the beginning and the way I think about it is that hyper growth gives you this really focused narrow window to like get everything correct and like you just grow so quickly that you see all these weird little lessons that the rest of the world may take five, 10 years to sort of like go through that phase. And we saw so many of these like random little patterns that were like, okay, these are actually coming to fruition now and like in sort of the industry in general. And I think if you think about building and like up-leveling deployments now, it's gonna be about thinking about conversions and workflows. And then where workflows is like, how do you wanna separate your infrastructure from your deploys? And then tenancy and isolation I think are like really big key things. That's what I've got. We have like two minutes for questions here. I'm happy to tell outer stories or I'm also happy to grab drinks with somebody if they wanna hear outer stories, but yes, I'll pause there. Questions? I don't know if I've satisfied answers, but yeah. Can you expand a little in a fundamental level to the differences between infrastructure and deployment that you're referring to, please? Infrastructure, best way to think about it is there's like a baseline of things that just generally don't change, call it machine management, databases. Okay, I should give one key thing here is that in both of these scenarios that we're talking through in HyperGov, we believe in one of, we believe very strongly in one of. So there is only one database system for all of Dropbox for example, and it is basically a mini version of Spanner. And so like that is what they store all data in. YouTube had a very similar one. There's like two database types, one for user and one for video if I remember correctly. And so like that's it. And so that lets us get away. The only thing we're scaling there is number of. We're not adding a new database type every week. So that gives us like this ability to separate these things in a very clean manner. Thank you. I think the other portion will be like load balancers and things like that too. Also, there's only one front door. So that becomes like we're never changing that front door. Other questions? Well, we're exactly at 30 minutes, which is actually two minutes over. So I will end there and I'm happy to chat more. I'm also supposed to plug that we have a booth and we're also raffling off of Bonsai Lego. So if anybody wants that, please see Sue and she can like help get you a raffle to get her get you into the raffle. So thank you all.