 Oh, it's working. Hey everybody, I'm Antoine Grande. I'm a software engineer at Digital Ocean. At work, we have a chat bot that serves as a glossary. If you're on our chat, you can type what is Antoine to figure out what I am. So this is what I am, a French-Canadian Digital Ocean engineer with the crazy idea to get rid of all these RTPO sessions to live on a boat and sail the world. Other things he's passionate about, meetings, meetings, and more meetings. Long term, he'd like to sell his boat, which he doesn't have yet, to start a colony on Mars. But don't worry, I didn't come here to talk about selling all your hurtly possessions and living on a boat and then establishing a colony on Mars. No, instead, I want to talk about something relevant. I want to talk about failures. I want to help us learn to accept them, accept that we lost the war. The computers have won. We need to give up trying to fight failures. We've already lost. We just need to accept it. We need to celebrate failure because that's the only way we will survive. We can't defeat failures. So let's talk about something. A method to embrace failure in a systematic manner when designing software systems. It's called crash-only software. So what is crash-only software? It's a catchy term that describes a way of coding and organizing your infrastructure. Maybe this will all sound like a crazy idea, but trust me, it's not a crazy idea. In a crash-only system, failure doesn't throw your system into chaos. In a crash-only system, the equilibrium is the word progress and consistency. But before we go deep into this topic, I need to verify something. I need to make sure that you really gave up and abdicated to our computer masters. Maybe you think you're different, that you haven't lost the war against failure. So let's do a small test. I'm a giant that for some crazy reasons, you gave me access to all your production systems. And then what I do as any normal person would do is randomly kill nine of your services every 10 seconds. Is progress still occurring in your system? Are you sure? Assume I'm doing this while you're sleeping in the middle of the night. Will you need to urgently wake up? Do you feel confident that you could be like, it will fix itself, and then still have your job in the morning? Will you lose important data? Will your system remain consistent? If you answer no to any of those questions, then I'm sorry to announce to you that, yes, in fact, you have also lost the war against failure. So far, you've felt safe because nobody's going to go in your production system to kill nine services. Anyway, I'll have to tell you the bad news. If it's not me kill nine in your services, it will be something else. It will be a disk failing or simply corrupting your data. Someone will pull the plug on your server or your process will segfold. A network route will be blackout then your database will become unreachable. Your logs will fill your disks and then all your rights will start erroring. That's what will happen with your code when all your rights error. I don't need to get into your production system and kill nine in your services for failures to happen. You can't prevent failures. You can try to push failure into corners. You can try to do that by adding a bunch of failure minimization code, but then here's the catch. That code adds complexity and as you may know it, more complexity equals more failures. So the more you try to fight failures, the more failures you end up creating. It's a self-reinforcing loop. You should give up like the rest of us and embrace our new masters because in the end the only way to win is to choose not to fight. Here's what we know. Failures will happen. Since they're going to happen, we might as well be pessimistic and assume that they'll happen all the time. But how do you design a pessimistic system like that while minimizing complexity? Well, I'm glad you asked. If code is what creates complexity and complexity creates failure, then it's easy. Just write less code. So here's where Crash-only Software comes into play. So far I've only shouted Crash-only Software without really explaining what it is. I guess at this point we've all agreed that failure is inevitable and should be celebrated. And we thank the good computers for being such good captors. Or at least let's pretend we agree on this. Otherwise I'll feel stupid here on stage. So what is Crash-only Software? You write less code by only crashing. This collapses code paths by reducing and combining the possible states. So let me explain a little bit more what I mean by this. A typical program that aims to be resilient to failure will usually look something like this. You have a normal boot sequence, start or recover, and our arbitrary number of states in your program, and then a large variety of ways to stop. But we code with a purpose, executing business logic. That means not all the paths in your programs are treated equally. In particular, the part that receives love is usually the happy path. Of all the ways your program could run, only one sixth of it is really given much attention. And this is normal. We focus on the sixth that matters most, the path that happens all the time. That's all good, except a very small number of possible execution paths are frequently tested. This leaves us with an awfully large amount of ways for things to go wrong. All those unexpected, unusual execution paths are lurking in the dark. They're waiting for you to close your eyes and sleep before striking you when and where you're most vulnerable. So write less code. Reduce the number of states you have. This leaves you with much less execution paths. There's one way to the happy path. There's one way for things to happen. If one way go wrong, and it's tested every time you start. Your recovery code is not a secondary path anymore. It's how you start. You live for failures. You embrace and celebrate failures fully. You do this by taking your normal start code and your recovery code, and then ditching the normal start code. Always recover. The same way, take the many ways your program can stop and collapse them all into crash events. Don't try to gracefully shut down, just crash, core dump. And when I say don't gracefully shut things down, I mean it. Graceful shutdowns are evil. They make you feel all warm and fuzzy inside because you feel you're being a good citizen. But you're not. You're participating in the collective denial. We lost the war. Computers have won. Failures can't be avoided. Your graceful shutdown code is not helping you. It's a vestige from a past where you didn't understand that resistance was futile. It's from a time where you didn't yet know that all this graceful shutdown code was just contributing to more mysterious ways for your code to fail. Graceful shutdowns give you a false illusion of safety. They're evil because they lie to you. They're counterproductive. They add complexity by adding yet more failure modes. So collapsing state path is one thing you need to do in your components to simplify the failure modes. What happens when you have many components? What makes a crash-only architecture? A crash-only architecture is made of crash-only components. Those components interact using the following contracts. Number one, servers will try to process requests or crash. Number two, clients will send requests until success or timeout. Number three, most components will be stateless. Number four, non-volatile state is pushed outside of your app into a crash-only data store. Number five, components will distinguish between volatile and non-volatile data. Number six, we'll select proper data stores based on volatility needs. And number seven, components will communicate in a specific manner. So let's talk about this more because it's one of the most important parts of crash-only architecture. We talked about the crash-only component. This crash-only component has only one way to start, recovery. And it has only one way to stop, crashing. Now, the other important bit is that components that have this recovery crash life cycle need to talk with each other in a manner that works given this behavior they have. Between crash-only components, requests must be self-describing. That means each request will have a time-to-live field. This field will explain how long a request has to respond before the answer becomes irrelevant. It's a deadline. Also, each request would have an is-independent flag. This is to say, if the operation failed the first time, you can try to do it again and again. So a request looks something like this. When you fail to perform a request, and it has been flagged as in-depotent, servers return a retry after field. This should be an estimate from the server of how long it takes it will take before it's ready to try again. If the server crashes as a result of the failed request, that duration could be something like, I'll have restarted in 500 milliseconds. So this retry after field is used by clients to decide when it's a good time to retry. Though it's important that you don't mistake this duration for an absolute value, retries must always use exponential randomized backups. Otherwise, you'll have clients thundering your servers when your servers are at least ready for the load. In fact, even if you're not using crash-only architectures, retries should always use exponential randomized backups. Every time you try to acquire something like a lock, it has to be leased. Every resource acquisition needs to be leased, which means they have deadlines. Here's what I mean more specifically. Say you have a user that wants to pay for something. This user is quite impatient. They'll wait at most 300 milliseconds before they get bored. While in real life, 300 milliseconds is quite low for a whole user-perceived response time, this is just a ballpark number. Anyhow, this serves as our deadline. So our server receives a request and starts working. It's a microservice architecture, so it will ask other servers to help it out. In this case, it will take 5 milliseconds to decide what to do next and then decide that it needs to authenticate the user before starting to process the payment. So it contacts the authentication server, tells it we have 295 milliseconds left and that the request to authenticate is independent. You can retry to authenticate multiple times. That doesn't matter. Now, if our authentication server explodes before it can answer, that's alright. Our server can retry this request again. But when it retries, it doesn't have 295 milliseconds left anymore. It has 250 milliseconds left. In a whole interaction, timeouts are passed from parent request to child request and they're updated along the way. Because what matters is how long the user is willing to wait. In some other cases, we have requests that are not independent. For instance, charging a customer's credit card. So here, our server asks the payment service to charge our user. That payment service then turns around, contacts a payment gateway. Before actually charging the card, it knows that it can request multiple times for a charged token. So the specific request is independent. Meanwhile, it also propagates down the deadline. Each request carries on with the deadline and respects it. Now that we have a charged token, we send a request that is not independent. We have 150 milliseconds left and we want to charge the token. And we're done charging the card. We respond to our front-end service and then respond to our user, which is happy to have spent money with us. Now, we've said that there is only two kinds of client requests that can be done. The independent kind and the not independent kind. Why is that? Arguably in the perfect world, we would like not to care about whether something is independent or not. We'd say, hey, please, just do this action, will you? Or like, hey, please, just do it once. In technical terms, we'd call this exactly once delivery. We want a request to be acted upon exactly one time by exactly one server. So that means, as a client, hello. As a client, I want to be sending exactly one message to exactly one server. But here's the problem. That simple idea is impossible. There is only two alternatives. At least once delivery or at most once delivery. Like the name suggests, they imply that you'll send a message one too many times or zero to one time, but never exactly one time. This has direct consequences on how we can design a client-server or interaction. And this is why crash-only architecture dictates that each request must have an independent flag. Let's see why by looking at at least once delivery. In at least once delivery, a message is independent. When a server fails to act on a request, we will retry our request again. Maybe against the same server using retry after, or maybe against another server. We will be doing this until we hit our time-out deadline. The message is independent, so we can rest easy that if multiple servers end up acting on our request, the result will be okay. And at most once delivery, your message is not independent. If a server fails to act on our request, the only same thing to do is to roll back whatever actions we did so far. This works best when working with transactional data stores, by the way. When we decide that a request is not independent, we immediately choose the at most once delivery path. I'll tell you right off the bat that this path is not a fun one. It means that things don't work out. Most of the time, it's a worse experience for the user on the other side. In many, many cases, you can invert this path and pick the at least once saying a bit how you do things. Sorry about this. Okay, so like I said, in at least once... Sorry. Say what? Okay, so do I just wait here? Okay, so I was saying that at most once delivery, your message is not independent. If a server fails to act on your request, the only same thing to do is to roll back. And this is easy to do if you have transactional data stores. But when we decide that your message is not independent, this automatically implies that we're going to pick the at most once guarantee. So this problem is solved. Failures, you cannot avoid it. So I was saying that there is at least once delivery and when you pick at least once delivery, automatically you assume that message are independent. And then I was explaining that at most once delivery means not independent. And it's not a fun path to choose because in most of the case it works out to be a worse experience for your users. And in many cases anyway, you can invert this path and pick the at least once by changing a little bit how you build your things. So for instance at DigitalOcean, we have this product we call a Droplet. A Droplet is a virtual cloud server backed by SSDs. When a user asks us to provision a Droplet, we want to make exactly one Droplet. However, like I explained earlier, that's not possible. So the alternatives are we make zero or one Droplet or we make one too many Droplets. If we pick the zero to one Droplet when we fail, our users gets no Droplet at all. That's pretty bad in my opinion because maybe our user is on the load and needs to provision extra capacity ASAP and they're losing money because they don't have their capacity. They need that Droplet. If we pick the one too many Droplets then the story is much better for our users. They get their Droplet as they wanted it. But then the problem is we're provisioning extra Droplets in the background and Droplets are a costly resource. They chew on precious, precious SSDs but I think that's a better bargain for our users. We can always come back in the background and clean up the extra Droplets and meanwhile our user got the best experience. So that's an example of trading off between at most once and at least once delivery. There's some cases where the trade off goes the other way. Say for instance you're charging a credit card. Likely you don't want to charge your users one too many times for a product. That'll make them quite angry. Maybe a better deal is to accept or you'll lose some money. The trade off here is purely a business decision. Do we want to make all our money and possibly alienate how is this happening and possibly alienate our users or do we want to have a good relationship with our users at the expense of higher revenues. In this case I think at most once delivery is the right business decision. In the end the trade off between at most once and at least once delivery is often a business decision but that decision must be made at every step of your architecture and this translates directly into your is independent flag. So that covers how communication between crash only components is done. Request have a deadline after which they're irrelevant. They also explain whether they are independent or not. Servers try to request to process requests or crash. Clients can handle servers crashing because they know that they can what can be retried and what cannot. They know how long to retry for given the deadline they were provided. Now let's talk about state. In our world we have two types of state. There is volatile state and non volatile state. Volatile state is what happens when you're doing your work and you're saving partial variables or partial results somewhere. It's state that is not authoritative. It can be useful state like work queue but it's state that can be lost and reconstructed. When you write to crash only components you try to avoid state. Stateless stuff is great but that's not always possible. So when it's not possible the next best thing to have is volatile state. Because volatile state can be lost and it doesn't matter. The more important and less fun state is the non volatile state. To pick another digital ocean example that's like a user's droplet. We don't want to forget that the user owns a droplet. We need to have some authoritative, sacrosanct data stores that stores the truth and knows that this user owns this droplet. Non volatile state is a state that you can't reconstruct. This is the state you would need if you decided to shut everything down and turn it on again and still do business. Non volatile state is super important and this is the state that you want to be conservative about. And still conservative about. You want to keep it in a few very deliberately chosen places. You don't want to spread the ownership of that state to a bunch of micro services just because everything needs to be in a micro service. Non volatile state goes into crash stores. Data stores need to have the property of at least being crash safe. A crash safe data store is one that will remain consistent even if it crash in the middle of a right. The difference between a crash safe and crash only data store is that a crash only data store will be faster at recovery. It will be designed to recover very often. Here's two examples. In the rdbms world there's a crash only data store with it right ahead. And the embedded key value store world is an example of a data store that can be made crash only. I'm not going to go into details about what data store is crash only or not or how to configure data stores to be crash only. What's important is that you know how to understand the concept. Non volatile state goes into crash only data stores and crash only data stores are those that recover quickly and keep your data safe on the crash. Usually that means that data stores are some kind of right ahead log. So maybe this sounds great but you're wondering are you just saying words? Arguably everything I described here might look great and hindsight is what it is. Not everything I've ever put my hands on is crash only and as I start up like digital ocean it's not always possible to do the right thing every single time. Still whenever we can colleagues and I try to apply this principle. I really like the crash only concept because it keeps me happy and productive. I wrote a lot of critical production services and I like to sleep at night. So far I'm doing okay because I've been careful to make stuff crash only or encourage friends to do so. Now I think the project returned off. So here's the real story of a simple system we built on this principle. Like I said earlier digital ocean has this virtual server product called droplets. Droplets have disk image like for instance Ubuntu 14.04 is a disk image. If you take a snapshot of your droplet that's a disk image. Through the lifetime of the company we've changed how those disk image are stored. That means we need to be able to migrate from one format to another. So we need a system that converts disk image to disk image converter. Here were the requirements when we started building this. First we have millions of droplets and even more snapshots and thus we need to convert many millions of image whenever we change format. When we change from a format to another there is a period of time when both format exist in the system. That means the code that makes droplets needs to know about the two formats. That's quite annoying because it means we're running legacy code along with our new code. An alternative would be to convert all the images ahead of time and then do an atomic switch in the code. But then that would mean that we need to store two copies of each disk until this atomic switch is done. So that's quite a lot of wasted space. So realistically we need to store two formats in production for some time and then we want to make that time as short as possible so that we don't need to run legacy code for too long. Converting millions of disk image takes time and it's quite repetitive. Also it's not exactly our core product. It's not, it's something we have to do but we're not in the business of converting disk image. We're in the business of providing scalable infrastructure. So we don't want to have to watch our image converter too closely. Given those requirements, millions of image must convert ASAP, must not page us in the night, we came up with the crash-only design. First question, at least once delivery the easiest path is acceptable for us. We can live with the fact that sometimes an image may be converted more than one time we don't really care about that. It wastes some time but it's a lot better than forgetting to convert the customer's image and then losing that droplets of data. The source of truth on our system is stored in a crash-only data store. This is where the store is coming back. Sorry, it's just when it switches I lose my screen. Yay! So the source of truth on our system is stored in a crash-only data store. This is where we store this user owns this disk image. We don't really touch that state, we just use it to build our disk image. That volatile state is stored in Redis which is not a crash-only database. We're just using it because it's easy to prove it. And we store this non-volatile data in a fancy sort of schema. First, we compute all the disk image that needs to be converted. We put that in a queue in Redis and we can always reconstruct this queue if needs be. Then in Redis we store disk image jobs in hash to represent a job lease that has a time to live. And then we have a bunch of image converter workers. They look for jobs in the queue that don't have a related lease. So here's how an image conversion happens. Step one, the worker creates a lease on a job. This lease has a time to live, say like 10 seconds. Then while it's doing the conversion the worker frequently refreshes the lease. When it's done converting, it deletes the lease. What happens if a worker dies? It leases, its lease will expire. And because its lease expired another worker will be free later to pick up the job again. What if a worker completed a job but failed to delete it from the queue? Then the job will be performed again and that's okay. What if Redis crashes and loses all our state? Well that's also okay because it's all volatile state. We can reconstruct it from the primary data source. So what was the outcome of building this system? When issue arose, arose like some conversion failed, we'd get page by operators and then we'd tell the operators to just mute the page that everything's going to be alright. The process manager will restart the workers and Redis and everything will recover on boot. The system is designed to converge toward progress. It assumes that jobs will be failing often, that workers will die randomly and it's designed to handle the recovery of jobs as a normal event and that crashes are also normal. In the end we never had to fix anything inherent to this design. There were some conversion bugs in the image conversion code itself sometime but this never led to any forward data loss or to image never being converted. We'd fix the conversion code and then the converter would just restart, recover and carry on. And it's still in use pretty much unchanged. New conversion methods exist but the architecture itself is unchanged because it just works and we're happy about it. It's great when you write something that performs well, doesn't wake you up at night and is just a solved problem. So here are the key points about this story. We were deliberate about what non-volatile state was. We left that state in a crash-only database. We saw our volatile state in a crash-on-safe DB that we use more like a message bus. All our resources are leased. The resources are self-describing and we made the trade-off between at least once and at most once very deliberately. So this is crash-only software. Is it a panacea? The solution to the ultimate question on life, the universe and everything? Obviously not. There's a bunch of other important considerations that go into designing a distributed system that's resilient to failure. Here's a few caveats. Let's call them things that are still important. Circuit breakers are still important. Using a crash-only architecture doesn't mean you should stop to use resiliency patterns like circuit breakers or other ones. Fallbacks are still important. They're inherent to the retryable, independent request idea. Error recovery is still important. What crash-only helps you to do is understand how you will recover from errors and it classifies two ways to do it. Retry or fallback. Rollback. The graded modes are still important. If your circuit breaker opens or if your request to a back-end completely fails, you should still consider if serving a partial response or skipping some steps is a viable solution. We're hearing more and more that micro-services and key virtualization means that servers should not be treated like pets but instead like replaceable cattle. Like herds that you manage at scale without particular attention to each individual. Crash-only software implicitly means that servers are cattle. Crash-only doesn't mean crash and never look at why you crash. You still want to log your errors. You still want to know what's breaking so you can fix it in a future release. Crash-only software makes it simpler to live with errors. Crash-only doesn't mean you should not debug crash components especially in today's world with key virtualization. This is a bit more involved than it could be a talk in itself but I'll just say this. If you can take a snapshot of your Crash-only component at the step where it's failing and then send that somewhere for later investigation you should do it. Restart and recover. A good way to do this is to crash by core dumping and then shipping your core dumps to a central location. Crash-only goes along with error reporting. Some people think that cattle versus pet and crash-only software means that we don't care about failures. We still care. Crash-only software goes in hand with all those strategies. Crash-only software is, in my opinion, an overarching philosophy that helps you guide your opinions on smaller scale patterns. When you look at more tactical patterns like circuit breakers or graceful shutdowns you can judge them on their merit circuit breakers don't go against the grain of Crash-only but graceful shutdowns do. So while Crash-only software is not a free launch you still have to take care of a bunch of other concerns it's a good step toward the free launch. So in conclusion Crash-only software is great you don't need to wake up for every little thing anymore stuff will just work in progress you will be able to stare at failures look at them in the eye and be like this is fine you'll feel good about it and then things will go wrong you'll be able to remain confident that nothing is irreversible your system is not becoming all corrupted things are falling and failing but they're all in recoverable state you can fix this your code will be simpler because you'll care about only two kinds of startup and shutdowns recovery and crash oh yeah and you don't need to wake up for every little error anymore yeah I don't know if I said that so if you'd like to explore more about this topic there's a bit of literature around specifically George Kendea and Armando Fox did research on dependable systems in the early 2000s and that's when they coined the terms of recursive restartability which I didn't cover here and Crash-only software more recently the LWM published an overview of the topic where and there's a few blog posts around on the topic I find it's interesting that the topic was researched by Kendea and Fox and then they seem to have switched to entirely different areas and then when you research a topic 15 years later you find some people here and there blogging about it sparsely discovering that yes this pattern they add in mind but didn't know how to call actually has a name but I think this idea is not on people's mind enough so that's why I talk about it I think it's an important tenet for anyone working with more than one computer today so that brings an end to my talk I hope this helped you introduce you to the idea of Crash-only software you can't avoid failures we've lost the war against failures and computers have won you can only bend to their will and accept that failures are there to stay we can't prevent failures but we can learn to manage them and live in harmony thank you I don't know if I still have time for questions I do are there any questions yeah do you have a mic hope I am audible so most of the software which is written in last 15 years or maybe more it is assumed that the system is 100% available what if it's 99% available so did I don't think it assumes that systems are 100% available take a cluster of 100 nodes if one node is not available it crashes right and the node which is unavailable is for an extended period of time I'm sorry I can't hear your question out of 100 nodes the node is unavailable for a long period of time right do you get that can you speak in the mic I can't understand is it better I'm sorry so when we say crash only software so there won't be a shutdown or a startup script it just goes, runs it goes away so out of 100 nodes if one node is not available in last 15 years I've been working in systems and most of the people say all your systems should be 100% available every time it should be working right 99 is a good number I guess so my question is is 99 a good number it depends on what your business needs are so I don't know if you're aware Google just published a book it's called the Google SRE and they explain a lot about what's the trade off about the 99 percentiles or how many nines you want to have versus how much it's going to cost you and you have an error budget that you choose how it matches and up to a point even if your systems are 99.9999 reliable it doesn't matter because your users are behind a phone that is going to lose every 100 requests you have 9 nines of reliability and your user can only see one of them because their internet is bad and you're wasting your money thanks not a question hey hi I just wanted to ask that crash as a attribute or as a characteristic for a software is it more dependent on deployment the way you deploy it in production or in environments or do you still consider that the way you write code would make it crash resilient or over the past in the last 15 years the way software industry has developed the crash has been handled by creating high availability architectures so how do you differentiate a high availability architecture for a software with a crash architecture so your question is out to design a high availability architecture using crash only no so my question is that what is the exact difference you see in designing a high available architecture versus a crash only architecture so isn't crash a characteristic on a software majorly dependent upon the way you deploy a software in production no it's not dependent of how you deploy software it's all you organize confidence to talk together and it's meant to achieve availability that's what you want when you use this pattern so it's not in contradiction with it it's not against it it's something it's a technique you can use to achieve that does that answer your question partly so let's say so let's go back to an era whereby MySQL or Post, Erase, Word not highly available okay so getting a database written or getting a component written which is kind of distributed in nature and which serves as a middle ground for other components to talk to each other I presume this architecture make sense that okay let's say you're writing a database you're writing a message queue writing a message bus or a broker but writing a solution which is an end service to the customer which is leveraging those components do you really think that specific piece of software should have these characteristics yeah I think so because you can't do without failures you're gonna have failures at some point like you want to serve a request to your customer and somehow something is gonna happen and the part as much as possible you need to have strategies to handle those failures that are going to happen so as yeah so I agree with that so over the past like I have worked in the industry let's say for example DB right application talking to database database is not available right so the high availability or the crashness of that piece is being handled by the way your application talks to DB so you have floating IPs which roll over from master to slave so that application doesn't see an impact yeah but if you're in the middle of sending a query and the database is in the middle of answering a query it crashes and then you fail over you move your floating IP to a new database then your client what does it do it just stops there you need to choose what it's gonna do is it going to retry how do you know if you can retry you need to know if this request is in the puttent from the beginning so it's when it fails you need to decide what you're gonna do because you cannot say that your database is never gonna crash in the middle of write because if I go and I pull the plug anyway we can talk more about it after are there other questions okay one more in the back sorry the mic is not working hello audible can you talk about what tools is specifically used to monitor like how do you write the apps what monitors the apps what monitor it is if you use containers what monitors the containers the thing that monitors that and what monitors that and what monitors that so I alluded to this a little bit at the end when I talk about the research by Armando Fox and George Kandia on dependable systems and this notion of recursive restart restartability explains how you have groups of components that form some sort of tree that recursively restarts and grows so I didn't explain that if you want to find more about this there is actual research on this topic but otherwise there's things like Erlang Supervisor trees specifically talking about what digital use at DO scale because pretty much all of us need one tool are using something like Supervisor D are using something else so I was actually looking for the specific tools digital use for this job okay so if you could add more to that we're in flux so we're moving from some weird thing that's just a bunch of like not clean systems and then we're migrating a lot of our stuff to Kubernetes alright thank you