 Hey, welcome to Wayfair Same Day Delivery. This is a presentation that outlines a narrative in painful anecdotes about CI at scale. First, let's introduce ourselves. Hi, I'm Lelia Bramuso. I'm only slightly less of an emoji enthusiast than Gary, but I make sure to keep my skills fresh by serving as an active and sometimes unwilling sounding board for Gary's wild and wacky, memeable clip. When I'm not emotionally supporting my co-worker, I love automating things and turning tedious, repetitive processes into streamlined systems with the ultimate goal that I'll never have to do that tedious, repetitive thing ever again. Well said, Lelia. I'm Gary White Jr. We work together in the open source program office or OSPO for short, where we organize Wayfair projects and teams to maintain them throughout the organization. We do this with a healthy amount of automation and guides, including a website that you can make use of, Wayfair.github.io. Don't worry, we'll put a link and plug it again at the end. Lelia and I have worked together for most of the time that I've been at Wayfair, not just on the OSPO. Unending days of bliss, that time has been. But we weren't always on the open source team. In fact, our true bond was forged in the laughter and the flames of our first team at Wayfair. We started our journey together on test-enabled in, 2018 to 2021, rest in peace. We were on a team with the stated mission of allowing Wayfair engineers to deliver value rapidly, repeatedly and reliably. We did this under Jay Farron, who's pictured here as a smiling face in a box. He's left the company, but his Slack emoji legacy lives on through his picture. One of the biggest issues that we found while we were on that team was that our continuous integration, testing and other automation pipelines needed help. Lelia and I worked together to scale up the offering for Wayfair that had grown from hundreds to thousands over the course of the time that we were on test-enablement. Poor one out for a real one. Many of the processes we used when we were a smaller company stopped working as we grew. We had a lot of lessons that we learned along the way. Instead of just giving those lessons, though we figured it might be a little bit more fun and just a little bit painful to recount some of the things we used to do, how we learned from them and how we use what we learned. We hope that recounting these tragic tales might bring some well-earned levity or at least some joy to our past experience. All right, let's get started with some stories. As we walk through these examples, keep your eyes and ears open for some guiding principles that made our lives easier over the long term. Whenever we focused on these ideals for our organization, we tended to find a good solution and we'll use these ideals as a blueprint while we revel in some of our experiences. Oh, dramatic. What can I say? I have a flair for the dramatic. Dary, did you freeze or are you waiting for a reaction? You know this talk is virtual, right? Let's move on. Okay. Our first anecdote is that one bad commit can ruin the whole batch. We'll start with a process that almost every waferian used to have to deal with, the integrator, and we'll talk about how one bad commit can ruin dozens of developer hours and then we'll talk about how we fixed it. Our wafer engineers would write code in a PHP monolith. And by monolith, I mean a multi-gigabyte behemoth that would sometimes suck up 40 minutes of your first day setting up a developer environment. Engineers would commit to their own branches as they sanely developed on this and we would have a process to integrate as many as 40 or 50 branches together into a single integration branch. The thing that made integration branches was called the integrator. It puts simply it gobbles code and makes a big code with all of the little codes instead of a bunch of Snick Snack codes going into the big code at once. We all know that developers literally can't write code that doesn't work because it's against the law, but sometimes code works differently when it is on your laptop and when it's not on your laptop. So we put testing before we do the integrator merging to make sure that it doesn't make its way into a batch if it's failing in CI. We following the same principle run tests after the code is merged together so that it doesn't get into the main branch and deployed to machines. So pretty simple pipeline. We would run tests to ensure that the code works before it goes into the integrator and then after it gets merged together. So what's the problem here? What if I told you that sometimes there were only two nodes running the jobs for the first part of this workflow serving hundreds of engineers? The lack of throughput that we had would sometimes lead to 45 minutes or more from the time the tests start to when they're added into an integrator batch. The tests after the integrator batch was put together were a little faster because that's how the team maintaining the integrator would judge themselves. But it's a really ridiculous amount of time for developers to just sit around because this is about an hour before they're even able to see their change in production or see their change merged into the main branch. Even worse is if your tests fail then you may have to run the entire testing suite again. That makes the first step on the journey two or three times longer than it was when you start than it was originally. That cycle is something that a report of mine once went through for six hours before they managed to get their merge into production. And they were deploying two lines of code something that took maybe 20 minutes to investigate and test locally and spent literally the rest of the day getting that through our deployment process. Our expectations at Wayfair were much higher than maybe what some companies have taken about a day to deploy felt unacceptable. It felt like something we knew that we could do better because originally we wanted this process to take 10 or 15 minutes. So how do we fix it? We postulated that our problem might be that our resource intensive build nodes would be able to scale up to serve the entire Wayfair organization. When Wayfair was a really small outfit it was easy to send developer jobs into a couple of nodes. It would work so fast that the minor inconvenience of waiting a few minutes for somebody else's job to pass didn't matter. It would go to the integrator and we would move on and not think about it. But as we started to hire more developers the wait time got longer. So it took a lot more time and it got slower and slower to the point that it was just crawling. And it didn't scale at the space we needed for the business to run effectively. Developers would wait for longer than they were willing. They would batch commits together and it would just slide further and further away from the expected speed and timing. Instead of using large machines that were harder to maintain we figured that it would be faster to use a bunch of smaller machines and distribute tests across many of them. The unit test pipeline is now intelligent enough that it can decide how far it has to split tests and run all of them within about five minutes or so. Most of the time we managed to keep the pipeline testing in five minutes or less with the total process of running all the checks that you need taking 10 minutes or less to run everything to get into the integration batch. By shortening that time we were able to reduce the likelihood that people missed batches and would have to do a rerun. That created an enormous amount of goodwill and trust in the system that had so much frustration for our engineers in the past. With some of this goodwill we figured that it would be easier to break things but before we decided to break more things we wanted a little more goodwill. So Lilia, can you tell me what happens when we did break things? I sure can. Thanks for that flawless segue into our next story. Uh-oh, something broke. Don't ask what. Now the integrator is definitely in a better state today but things can still go wrong. How do we let engineers know when something goes wrong with their deployment? After all, sometimes a seemingly benign pipeline failure can cascade into a total failure of the user experience. Let's use a real-life example to illustrate. Oh God, this slide just threw up in my eyes. I have to tell everybody that I had nothing to do with this. This is all Lilia. I'm so sorry, everyone. I can assure you though whatever you're feeling from this slide is nothing compared to how our protagonist felt. Once upon a time a backend engineer spent 45 minutes waiting for her pipeline to finish. When suddenly she received a dreaded ping on Slack. Something is broken, uh-oh. She knew this meant her change would be rolled back but had no idea where to go to find more information so she could begin debugging. The engineer spent the next several minutes clicking around on Slack and Git looking for a clue but to no avail. The PRs weren't helpfully annotated and she kept getting redirected to different Slack channels. Eventually she grew frustrated and had no choice but to forensically examine the entire lifecycle of the pull request to determine where things went wrong. At the end of the day, the root cause and fix wound up being pretty trivial but the time spent chasing down critical context for the issue was anything but. Although delays in time to resolution could be written off as a mere frustration in isolation, we quickly realized that the collective cost of engineers being unable to diagnose their own pipeline failures had a huge impact on our deployment velocity not to mention developer leverage. So we set out to create an incrementally better user experience by partnering with our release and sustainability engineering colleagues. Now when deployment pipelines fail. When they definitely do. We send targeted Slack notifications to the user with a direct link to the relevant failure on Billkite RCI system. Some pipelines even provide detailed failures through PR analyzers and GitHub checks while others use comments to link to failing line numbers, failing tests, builds, et cetera, all for developer convenience. As we all know, pipeline failures are about a certain and death in taxes. Unless you're mega rich, then you don't pay taxes, am I right? Big true. Yeah. But at least now our engineers are empowered to investigate and fix their own pipelines. With a combination of automated tools and smarter context sharing, we drastically shortened the feedback loop between the time a problem is identified, time that the correct engineer is notified about the issue and the time it takes to gather enough information to submit a fix, all in pursuit of those illustrious green check marks. Wow. Thanks, Lillia. Now on to a completely different story. Snowflake build infrastructure. We'll use this anecdote to talk about how snowflake agents and other dark sorcery have worked their way into Wayfair. But how many of us are familiar with the term snowflake machine here? One, two, three, okay. A bunch of us. Well, for the ones who haven't heard of it, snowflake machines are the kind of machines that were created without an easy way to recreate them, to serve multiple purposes. Pragmatically created, usually machines like this accidentally become critical to operating basic business functions and become critical for the business as a result. I like to think of these machines as being part of like dark magic and dark engineering because by using this practice of not really writing things down or making them reproducible, we're kind of desanctifying the work that we do and we make it very difficult or impossible for others to follow us in any cohesive way. But that's the theory. Let's see what this actually looked like. At some point, we had Git servers running, keeping our source code tidy and serving Git operations as needed. Then an evil dark wizard came along and he installed some software. And the software seemed innocuous at first. It even seemed useful at first because Git doesn't always use other resources given to the machine. So the wizard figured that installing a job runner like Jenkins, it wasn't Jenkins, but we'll pretend. Alongside Git operations would work fine. And most of the time it did. Most of the time our job runner, which wasn't using Jenkins, but we'll say Jenkins, Jenkins would eat up more resources than we expected while most of the time it worked, some of the time it didn't. Jenkins would eat a lot of resources. Jenkins would eat so much resources that the Git operations on the machine would fail. And when the machines become this overwhelmed, they can't do Jenkins well and they can't do Git well so they essentially become useless. So thinking again, logically through the issues, we would see most traffic on the Jenkins nodes during critical bug fixes where many operations were being run to try to test and deploy the fix. Runners for our website might also be calling on Git a lot to get Git operations and overburdening the Git processes. So Git and Jenkins get in a little fight and a cyclical relationship forms that would keep us from being able to solve our problems without making problems worse. What comes across as an innocuous and pragmatic and even a good idea wound up breaking our ability to deploy, test, or even update our code base. Something obviously had to change. The first and most obvious option is to replicate the runners about the same as when they were co-located with the Git machines just with their own resources. And we found that the Snowflake build infrastructure was unreliable in and of itself and it was very hard to use. Some of the jobs that were, some of the cues for running jobs at this point were just labeled run where you might label Java or Python. The name run is just not descriptive. When we searched to find what these run machines did, the answer that we got was everything. They would try to run everything that you possibly could. And that just didn't seem sustainable in the slightest. So we opted to try something else. We made again smaller machines that could do less things and we put them in cues so that they were identifiable by language. Some examples are PHP specific machines designed to suit monolith jobs, Java specific machines designed to deal with Maven, Gradle versions of Java that we use at Wayfair. By distributing this workload across a lot of machines and a lot of different sets of infrastructure, we increased our systems reliability. By making the pools of runners a more reasonable set of names, we increased the likelihood that folks would intuitively understand what the machines were built to do. No more asking around until a dark wizard in the woods tells you what hosts to target for your critical workflows. There was one more piece of dark wizardry that we untangled in writing our code this way. When we created this new set of build infrastructure, we focused on being able to recreate it so that if we ever needed to make more, we could tear some of them down or if we ever needed to get rid of some, it was a simple process. If you're not familiar with Terraform and Puppet, you should take some time to get familiar with them. I can't talk about it in this talk because it's just a whole thing in and of itself but it allowed us to create, replicate and maintain our build infrastructure with significantly less work than dark wizardry and using Terraform and Puppet to stand up, configure, maintain and tear down machines was pretty much simple. This turned our dark magic into replicable repeated documented understandable code. Wayfair is better for it and that's why the dark wizard is now in code. This sustained us until we hit the next hurdle of maintaining modern enterprise scale of applications. Sliding right into our new anecdote, containers will solve all of our problems. Before I go any further, I wanna take a moment to soapbox about legacy code and hindsight. Most of the anecdotes here were made by people making pragmatic decisions that eventually hindered our ability to function as well as we could as a business. We're sharing some anecdotes that we caused to emphasize that we are not perfect either. We inherited some things that we changed and you will likely do some things that somebody else will inherit that they will wanna change. This is not a takedown of those people. This is not a takedown of ourselves. It's just a humorous look at how sometimes we make mistakes and hoping that you can learn along the way how they can be fixed. So anyway, getting back to the talk. When we sped up a new set of agents to serve concerns with the new tooling, containers were on the table as a tool. Containerization is an amazing tool for orchestration and usage of different environments and we made heavy use of it at Wayfair. It made sense to allow for a build guide queue that could run containers as a part of CICD. We had very little idea what people would wanna do when we opened up the queue though. A lot of forward thinking teams at Wayfair had found a way to run their workflows, especially using containers, but we needed a flexible way to accommodate them while bringing in new adopters. This queue was very unrestricted as a result. It was experimental in nature. Experimental with a technology that many of our engineers were new to opened up at scale. What could possibly go wrong? Let's talk about it. These new agents were purpose built for container jobs and they worked great for a long time. They're easy to isolate. The default configuration in BuildKite makes it simple to set up and tear down what were very complicated and difficult jobs and took longer when they were on VMs. And we continued to iterate on this build infrastructure when issues came up. If we found a particular node was behaving differently than others, we would isolate it, bestigate it and put up a fix so that no other node would fall into the same trap. This meant that we spent a lot of toil on the queue, a lot of log combing, metrics digesting, going into folders and observability reports. It's very labor intensive, but it was okay because this was an experimental queue. It was an isolated queue. Okay, well, experimental is a loose term. If the capitalized E in the corner wasn't enough of a hint and the air quotes weren't enough of a hint, this queue wasn't experimental. Not that anybody told me that. Daria literally did tell you that. Everyone told you. Well, I don't remember that. So we eventually started to see more and more people using the experimental queue as a queue that was critical for their application and deployment and testing processes. The more that teams were doing things and doing different things on the machines, the more likelihood that machines became unstable. If you're working with one language or workflow or a couple of teams, you can keep things under control because basically every machine does similar things during the jobs and you can work around all of the lead behinds that they might have or any of the issues that might come up unexpectedly. But that gets to a critical point if you have so many teams working at the scale that we did at Wayfair. So let's see an example of a problem that we actually saw. On some of our machines, we noticed that the var volume was filling up pretty regularly. We had no idea why because most builds shouldn't have even been able to access that volume from where the build type builds started. We found out that because the Docker service had elevated permissions, folks thought that caching their dependencies in var.m2, which is where Maven keeps their dependencies, would speed up their built. Without asking the build infrastructure team, they just mounted that volume on the host machine. And var would fill up crash important services and then the container Damon for Docker wouldn't even start. We'd have to do surgery on the machine and the Java team would be upset because now their builds were slower because we had to clean it out. We needed to be able to solve problems, but we needed to solve them in a way that wouldn't cause downtime or disruptions, but still in a way that we could investigate and triage issues with. That's a tall order, Gary. It is, it is, but it's not impossible. I mean, the solution is a bit simpler than I first thought. The most important part of this was making sure that nodes were problematic, wouldn't be available for very long. If we kept the nodes from reaching a lot of developers, then we could get ahead of the issues before most of our community knew that anything had happened. The solution that we came to is more subtle than the others we're gonna talk about and that we have talked about. So I'm gonna give you a moment to think about it. The answer is on this slide and it's very subtle. I'm just gonna give you a second to think about it. Nope, not a cue for you. I'm asking the audience. Okay, maybe a hint. See the names on the slide? What kind of names are those? Are those pet names that you might give to a pet? It's because we were treating our infrastructure like pets. We were standing them up with custom Terraform code with custom puppet profiles that would start the build guide service and create the container profile. It's repeatable, but we didn't go the extra mile of treating them like cattle. We would try to keep these nodes up. We had cron jobs, cleaning them out. We had all kinds of work being done to prevent the machines from going down. And we should have just been resetting them and using our logs and telemetry to figure out why they crashed rather than trying to keep them up, no matter what. As we kept growing our usage of containers and the need for machines swelled into the hundreds and up to the thousands now when we're deploying a lot of code and when we're doing a lot of testing, we decided that we needed to treat the infrastructure more like cattle. The resources that we set up would be named uniquely with a GUID set by GCP automatically and the auto scalers that we get from GCP, but you can get them from any cloud provider would kill any agent that failed five jobs in a row. So sometimes that meant that we killed agents that were fine and just had a couple of jobs afield but we would rather be safe than sorry. And most of the time it led us to catching issues well before we ever got feedback from users. That is a seamless upgrade of the UX of the project. We were able to give developers what they needed to experiment in this experience the same way that we were experimenting. This story was all about scaling up our build infrastructure from a little bit to a lot of it. And now I wanna hand it back to Lilia who can hand out or can show when less is actually more. Less is more indeed. Especially when we're talking about a needle in a 2000 line YAML haystack. Now, some people like to start their days off with morning brew and a relaxing read of the local paper. I prefer to start mine off with a steaming cup of nihilism while furiously scrolling thousands of lines in decipherable YAML. Okay, just kidding about at least one of those parts. Which part were you kidding about? I'll never tell. Okay. But seriously, is there ever a more heart sinking feeling than innocuously clicking on a file in GitHub intending to make just a simple one line change only to be told GitHub can't display this file right now because it's literally too big and you have to load the repository into your IDE and control F frantically until you find what you're looking for. Well, that's pretty much exactly what happened in our next anecdote. This was, of course, in the Frosty snowflake era when the CI agents were subject to flake or corrupt at any time. Many moons ago, a motivated engineer was interested in improving the state of his pipeline by using a more reliable agent. He heads over to the Mono repo where he intends to make this change in search of the CI pipeline config. Now he finds the YAML easily enough. It's in the repository root but he's pretty immediately horrified by what he finds. Pipeline look, something like this. Not only was it unwieldy to the point of being nearly indecipherable it was also the single orchestration layer responsible for building the critical production asset. Spookier still, some of the pipeline codoners are rumored to be members of that dark wizardry collective that we talked about earlier. Dark wizardry. And they were a little unenthusiastic about approving pipeline changes from just, you know any Joe Schmo. In retrospect, it's easy to see why they were so cautious. After all, a single typo or misindented line could potentially take down the entire production pipeline in one fell swoop. Best to let sleeping dogs lie, I guess. Or is it? What if we could actually break up that gigantic YAML monstrosity into smaller digestible chunks with clear separation of domain and ownership? Better yet, what if we could ensure that we don't bother reserving a bunch of CI resources on steps that are pretty much pointless to run after a critical step failed earlier in the process? Boy, do I have good news for you if you're a Billkite user. It's already possible. Thanks to their built-in support for dynamic pipeline. Although Billkite agents do expect to receive pipeline instructions in YAML, you can supercharge your everyday pipeline.yaml file by converting it to a pipeline.sh script, which in turn can concatenate multiple chunks of YAML based on environment variables, build conditions or even context from a commit diff. To the unsuspecting end user, this all magically appears in the Billkite UI as a unified pipeline. Wow, magic, it's magic. Wow. And magic it was. Using dynamic pipelines, our platform teams were able to define a highly reusable pattern for engineers building moderately complex pipelines. And because dynamic pipelines provide much more flexibility and modularity, they were pretty easy to share across teams, allowing neighboring teams to pick and choose the useful stuff their own team needed and skip the rest. Beyond the ability to programmatically stitch together pipeline steps on the fly, dynamic pipelines also provided a way to really unleash the power of YAML templating and dry or don't repeat yourself principles. By combining YAML templates with the notion of a common.yaml file, which gets read in by pipeline.sh before any steps are called, we could store required environment variables and build configurations for multiple steps, which needed to extract many of the same values over and over again. It's also made it much easier to propagate a single change like a plugin version, an image tag, or an agent queue specification across numerous pipeline steps. That ever elusive one line change that our engineering Fred had once sought in vain, it was finally a reality. Sorry, it took a little while, pal. Speaking of scripted pipelines, let's harken back to the snowflake era once more, where we finally recall not only the dark agents of their mythical configurations, but so too the endless shell scripts cobbled together and closely guarded by each team. This is the story of she sells shell scripts by the CI shore. You see, before we learned about the magic of dynamic pipelines, and we're left frustrated by the constraints of YAML when we needed to do complicated things in our pipelines, we turned to our old friend, the shell script. I'll call her Shelly for short. Oh, Shelly, she's a mess. Hey, be nice. By all means, Shelly is a well-meaning friend I have. She can spin up a quick and dirty solution in no time, and fix lots of narrowly scope problems in an easy to write scripting language. Sometimes Shelly gets involved in some wild stuff, like that time she invited like six of her cousins over, and then they invited two uncles who in turn brought their dogs, who are now awake and grumpy. This is, of course, a hack-together metaphor for the shell scripts that we used to hack together anytime that we faced a moderate challenge with our build pipelines. More often than not, these shell scripts would call a secondary script, which in turn called another script, and so on. Yeah, layers of abstraction are best-consuming moderation here. Yeah, just like Shelly, better in moderation. Don't you know it? Not only were these shell scripts pretty difficult to troubleshoot, they were pretty hard to discover and share with others, too. Often undocumented, rarely versioned, these were basically bespoke collections of scripts that would sit around and slowly atrophy in various repositories, never really knowing how many commonalities these scripted workflows had with countless others nearby. Enter build type plugins. No, plugins are not a new concept in CI, but the way that build kite does plugins is pretty nifty. At their core, and somewhat ironically, they're actually really just fancy shell scripts, but they're much more powerful and organized than the ones that Shelly usually hangs out with. No offense, Shelly. Don't apologize to Shelly, she's the worst. You're right. Anyways, since all build kite plugins begin with a shell script entry point, there is pretty near guarantee that your plugin will run on any CI agent. It's deployed onto. Of course, if you wish to leverage higher level languages, you can do that, too, as long as the agent you're running it on has the necessary dependencies to support it. Using plugins allows us to centralize and track commonly requested pipeline functionality, version it and create an easy path for anyone to improve and extend the plugin as needed. Additional parameters are simple to integrate and they create a documentation trail on how those changes happened. Now, as a member of Wayfarer's open source team, I'd be remiss if I didn't mention that there's actually a bunch of awesome build kite plugins which have already been over sourced by the broader community and you should probably go check them all out at buildkite.com slash plugins. I mean, at least stick around for the rest of this talk, but you know, like at your next earliest convenience. And before you ask, yes, we are in fact in the process of open sourcing a few nascent build kite plugins that are currently incubating. No, like really we have a project incubator for newcomer open source projects. It's actually pretty cute. Maybe I'll share the link at the end of this talk if you suffer through just a couple more anecdotes with us. Speaking of suffering and anecdotes, we are onto my last one and then Lillia will do her last one. Mine is called, in my day, we used FTP to deploy code. For this one, I need to take y'all back to the olden times. They called them the 2000s where FTP was king and everyone used the lamp stack and the wheel had just been invented. So old right now. Yeah, I'm kidding. Back in these times, using version control was not as religious as it is now. It was common at Wayfarer, consequently to use basic protocols like FTP and SEP to keep machines up to date with code that one intended to work with or deploy. When the code was ready to be deployed, it was a manual process. We'd get engineers to manage the code deployments for the organization. They would use Jenkins jobs set up specifically to deploy and normally they would do what they needed to do. You could press build with a specified get commit revision into branch and Jenkins downloads a copy and would use FTP to transfer any and all files to the host machines that needed the changes. Jenkins could handle halting services, updating and reapplying the services as needed. This works great in theory, but in practice, things sometimes go wrong. You could have stale pointers to services, files might be interrupted, the bandwidth might be consumed by sending so many files while the machines are still doing business for the site, just a couple of things. But if they worked, they worked, but if anything about this relatively loose fitting process went wrong, Jenkins jobs would fail or some machines would be corrupted because files get overwritten and we could have serious problems. These issues were normally very difficult to reproduce and they were very infrequent, but it is really hard without observability tools like we have today to identify when or which nodes were misbehaving. And in most cases, it would take about the same, it would take a bulk of the time to deploy something to make sure that when you deployed something that it was actually working, none of that's ideal. And we take many steps since those days to fix the practices that we had. Going through every single thing that we made better would take a really long time. So I just wanna highlight the things that I think you should take away if you're dealing with the same problem, which are automated deployments and using configuration management. We graduated to deploying with managed configurations, trusting open source products like Puppet to keep state. And the nice thing about Puppet was that we got this nice view of all of the machines and what their state was at any given time. Then if they had a problem, we could turn it off until we had a chance to investigate it using the Puppet Master. We had already built some of the goodwill with the infrastructure teams like we had built with development teams. So we were able to work together to create a continuous deployment model using Kubernetes and Bill Kite. That's a really in-depth story. That deserves its own anecdote and maybe it's on talk and it's more tooling than I make it sound like just saying it that way. But let me get to the point. Just because this is a partially funny pain in the butt for us as maintainers and a platform team, business leaders and developers don't like unreliable deployments and they don't love being roped into deployment failures that might not have much to do with the code that they wrote. If we can make that experience better for operators, developers and business leaders alike, then we improve the state of our business entirely basically by making our deployments better. That's not the end of the continuous deployment story though. I wanna hand off to Lillia for our last anecdote. Thank you so much, Gary. You're so cool. It is indeed time for our final tale. Create a new app and just 25 EC steps. Gather around folks and let's take a walk back to the somewhat distant past and up this kind of sketchy scare case. You see, in the early days of Wayfar engineering, we invested heavily in developing a cohesive monolith that could be shared and used by all. This worked really great for a while since everything was self-contained and no one had to worry about how it all worked under the hood. So it kind of made sense that in the monolithic world, we didn't really need to have an incredibly well-paid path for building out net new fully decoupled applications. That said, if you were one of those pioneering spirits who really did need to start fresh, the end to end process for creating a new application from scratch looked something like this, step one. All right, well, let's start coding our application up from scratch. Sorry, there's not really templates available or patterns to work off right now, but maybe Stack Overflow can help. All right, well, you know, a couple of thousand lines of code later, but totally crushed that first step, feeling good. All right, so actually want to run your code now, not just write it. All right, cool, cool, makes sense. Better get started requesting a development VM with all the proper dependencies installed. Woo-hoo, request is granted. That was fast. Now, hold up. Not only are you saying you want to write your code and run it in a developer environment, you also want to deploy your application somewhere. Sheesh, you better spin up a new request for an application VM stat. Oh, all right, congrats. Take a few days, but now you have your freshly commissioned VMs. But wait, how are you going to automatically manage their configurations in case something happens to them? Do you miss writing Ruby code? I hope so, because you might be about to file some puppet PRs, but don't worry. Just in case you thought this would be the same process as filing a standard PR, it's actually a little bit different. There's even a separate review tool you get to use. But I mean, how bad could it be? It's only going to get better from here, right? Wrong. Now that you're a fully indoctrinated puppeteer, welcome. Let's talk about networking. No, not the social kind. I'm talking about that old school, cool networking, routing and allow lists and DNS galore. Hopefully you studied up because you're going to need to anticipate pretty much precisely what requirements your application needs for networking up front. And pray you get it all right on the first ticket, or else you might get stuck in an endless Jira recursion loop. Oh boy, 20 Jira tickets later. Hey, so with all this newfound knowledge you're gaining, what's one more subject area to gain a little expertise in, right? Yeah, that's right, storage. How big is your app? How big do you think your app can get? Does your app require other resources that will impact its storage needs? Well, pick wisely, because you might not be able to resize later. Also, that'll be another ticket for a different team with a different SLA in the last two. Oh, okay. Feels like it's taken almost a month, but I think we're finally out of ticket now. Wait, how do we actually let anyone know about our new app? Do we have to register it somewhere to ensure it gets deployed? I mean, I've heard these vague whispers about some all-powerful Jenkins job that builds artifacts for distribution, but honestly, I'm not even sure which of the many Jenkins instances we have. I'd even begin my search in. Oh, the dark wizardry looms again. And it did several sorry days go by before we eventually and happily stumble across a somewhat battle-worn colleague who'd recently survived this very process. Dark wizard. Indeed, and a stunning display of bravery and magic. He helped us finally get that application registered and deployed, or, you know, spiraling off into the night. But I hope he's doing well wherever he is. Wow, okay. That was a lot. And it was a lot, understandably, for our engineers, especially newcomers who didn't yet know the intricacies of where to find this information or how to file these requests. Over time, creating a new application developed a bit of a reputation of being time-consuming, confusing, and opaque. It also possibly incentivized many engineers to stay in the monolithic code base where they wouldn't have to worry about any of the stuff rather than step out into the brave new world of microservices. And I can't say I blame them, but we knew there had to be a better way. And thankfully, soon enough there was. What started out as a side project on our Python platform team, this humble CLI program called Mamba. Oh, wait, hold on. I just got that. It's Mamba, and it's a snake. And Python is also a type of snake. Snake. So they're like snakes. That's, it's like a theme, yeah. It's like a theme, it's like a bit. You're so smart. Yep, thank you. Mamba eventually grew into one of the most widely used pieces of platform software at Wayfair. The beauty of Mamba lies in its simplicity and accessibility. At its core, Mamba is an interactive CLI tool which helps users make informed decisions about their new application, all while abstracting away those garish layers of TDMs that we crawled through moments ago. It quite literally reduces a process that used to take days and maybe a couple tiers down to mere minutes and ends on the instant dopamine hit of visiting your shiny new project on GitHub right away. Plus, once you're there, a fully functional and dynamic build kite pipeline loaded with plugins awaits. And yes, even those pesky Kubernetes configs have been scaffolded for you. Although Mamba originally only worked for Python apps, it gained so much popularity that other teams began contributing application templates for their own languages and different patterns and the ecosystem continues to grow each day. Now, these days, it's pretty rare for me to stumble across a newly created service that doesn't use Mamba as its starting point. And honestly, that's not even the real power of developing platforms that reduce friction and centralize complex logic. The turns out, building tools to provide users with a common framework and a shared understanding and eventually create a community of engineers who are infinitely better equipped, help stock coworkers, collaborate across teams or help, even move departments in the highly unlikely event that their team implodes. Highly unlikely. RIP. All right. That was a big whirlwind of times that we learned from our own mistakes. I'm sure that there were plenty of good lessons and things that you picked up throughout this presentation, but I wanna take a moment to reflect on the themes that helped us improve and the things that we saw throughout. We found that throughout our journey at Wayfair, intending to focus on these ideals made us more successful as a platform engineering team and ultimately is why Wayfair continues to grow and why we continue to deliver engineering excellence. We care about UX a lot. We care about making sure that as developers work at Wayfair, they have a strong experience, whether that's working on open source, whether that's developing an application, that's something that's definitely permeated our culture and I feel like every team takes a personal stake in. Care a lot about developer leverage as well, as in making every action by a developer have much more impact relative to their time invested. Plus, we care a lot about code reusability so that folks aren't continuously recycling the same effort in the organization again and again and can more easily share their learnings with other teams. And lastly, we care about infrastructure stability. Specifically, we learned a lot about build infrastructure stability. There were a lot of great infrastructure teams at Wayfair that helped make this possible so that we had a foundation to create the good build infrastructure and honestly, as users use it, they will trust us more the better that it is. That's the end of the presentation. I hope you enjoyed it. I wanna emphasize that none of these anecdotes would have been possible without dozens of dedicated and talented Wayfair colleagues working with us to make them happen. Just because we're talking at a conference doesn't mean that we are taking all the credit. We're taking all the glory and notoriety which isn't the same thing. We want them to have the credit so that if something breaks then they get blamed for it anyway. Hi, I've been Gary White Jr. Hi, I've been Lillia Bramoso and don't forget to check out our site featuring the Ospo and projects currently incubating at Wayfair and be sure to keep in touch with us using the GitHub and email links on the slide. Thank you so much for your time, Open Source Summit. We'll see you in the Q&A.