 This session we are going to be having now is called Embracing Collaborative Chaos and I'm delighted to welcome Linsep River on stage. Great, thanks for the introduction, Krim. Hello, my name is Linse. For the last few years I've been leading a group of platform teams that develop and operate a very large platform as a service for a very large public sector organization based here in the United Kingdom. In this talk I'm going to be going through how we've used Chaos Days as a really useful approach for improving the resilience of our platform and learning more about how it handles failures and helping us improve the kind of approach to handling failures so that failures can be handled more gracefully. Right, let's get my slides up. Let's start off by thinking about what we mean by Chaos Engineering and why it's an important thing to spend time on. Chaos Engineering is really important when you're building IT services, particularly IT services that are really, really important for people to be able to access and they take a lot of traffic. It's further compounded when you're trying to do all this at speed. Here in the United Kingdom our government has been providing various services to try and help people that have been impacted by coronavirus by CV-19. The organization I have been working for has been part of building these services and it's been a very exciting thing to do and we're all privileged to be part of this. These services were only built in about four weeks. They were delivered 10 days early and one of the services processed 140,000 claims on its launch day. It did all that without any major incidents. I think one of the reasons that we were able to achieve this is because as we've evolved the platform that this service is built on, we've paid a lot of attention to understanding how to make more resilience to understand how it failures and to use chaos engineering to help drive those learnings and improve that resilience. So this scheme, this government scheme is just one example of a service that has really benefited from chaos engineering. Let's look at another one that is another good example of why it's important to think about failure. When you're building something like Nest, for instance, it's something that's used around the world and it relies on services that are interconnected around the world. It's very easy now through cloud platforms and through things like agile and lean development techniques to build something that's planned at scale really, really fast and when you're building it, you're dealing with components that are hiding a lot of the complexity from you. But underneath those components have a large number of moving parts and a large number of things that are connected as this diagram shows. I think if you're building something using Google's technology they've probably got resilience covered. So what could possibly go wrong? Well, I'm not sure if people saw this just every year ago. There was an issue with one of Google's, I think, one of their networks and that had a ripple effect. It combined with various other problems and then caused people to use Nest to not be able to use some of their smart home automation. People were tweeting that they couldn't turn their air conditioning on because their Nest wasn't working because it wasn't communicating with the particular Google service that it relied on. They couldn't get into their homes to just go using a smart lock. This is a great example of just how when several failures can happen to combine in a way that's catastrophic. When I see these incidents happening around the world I always find it really interesting to read the incident report that really good organisations like Google tend to make publicly available soon afterwards. What fascinated me about this particular incident report was how it typified the importance of chaos engineering in that it wasn't one single issue that caused this problem but it was actually two normally benign issues and a specific bug. They came together that caused this significant outage that had a significant impact. The systems that each of us built, unless they're incredibly trivial will generally fail in some kind of way. Failures are often happening in small ways the whole time in our systems. It's just that they're not combining in such a way that they cause catastrophic failure. I like to spend a few moments just thinking through a system that you're working on currently or you've recently worked on and work out in your head what are the different component parts that you have control over? How are those component parts connected? Are they all running on the same piece of processor? Are they spread across network connections? Are they spread across the world? How reliable are each of those components? How reliable are the connections between them? What would happen if one component failed or two components failed? What would happen if a component partly failed and there was latency in the connection between the various components? What I'm hoping is that as you're thinking through those things you start to realise that actually the types of systems that we're developing these days have a high number of connecting parts that make them very complex and very difficult to predict exactly what will happen when things fail. This is where chaos engineering can really help. Unless you're dealing with a simple system that is just a very small number of components with a small number of connections. I don't think many of us are working in that space these days. Most of us are dealing with hard systems that have multiple components, multiple connections between them. Many subcomponents, and often those subcomponents are things that are hidden from us. Chaos engineering helps us understand these failure modes by deliberately inducing failure into the parts that we want to understand more about. And then observing the impact of that failure reflecting on what it's teaching us about the way the system behaves on the failure, about the way we can observe that failure and also about how our teams operate when those failures happen. Chaos engineering then helps us to take those lessons and improve the resilience of the systems we're working in. It helps us to build resilience into what we're doing. Chaos engineering also helps us to get used to thinking about production the whole time, particularly about what could go wrong in production. Because it's not a question of if something is going to fail. It's more a question of when is a failure going to occur. Resilience is a word I'm going to use quite a lot in this presentation so let's just agree on a definition. I took this from the online Oxford Dictionary. By resilience, we're focusing on systems that can recover well. We're not talking about systems that are indestructible, systems that will never fail. Because the complexity of our systems these days is so high that failures are happening the whole time. And it's very almost impossible to achieve systems that will never ever fail. It's more important that we consider how can we make our systems more resilient and how can we ensure that when failures happen and when they combine in different ways we can bring normal operation back as quickly as possible. We're trying to design systems that are elastic. There's four different ways I've seen of using chaos engineering to help us achieve this. I'm going to step through a brief overview of them in a minute and then we'll focus on the chaos day approach. The fourth one is something that is not a very popular one to talk about and so I'll get to that in a minute. The manual chaos is where things like chaos days or AWS game days or change specific chaos where the engineers that are working on the system they deliberately take time out to interfere with the system and inject chaos in some kind of way and then observe the response, learn from it and then adapt the system as necessary. This can be done either on your own system through your own chaos days where you can learn more about the concepts of chaos engineering or resilience engineering through going to some of the AWS run called a game day where you get to work as an engineer on a system that they've constructed, an artificial system that they start injecting failures to and that's a really fun thing to do if you've not done it before. A change specific chaos is then when teams are looking at each significant change they're making to the system and trying to, as part of the engineering of that change to kind of test out different failure modes. By automated chaos I'm referring to systems and software that you have running either continuously or we run frequently against your system that cause things to fail in an automated way. So things like Netflix's Chaos Monkey. I'd also count the use of cloud technologies such as AWS spot instances and GCP pre-emptable VMs. So these are compute instances that don't have a kind of guaranteed permanent lifetime. They're instances that will be terminated at some point and the services running them will receive notice but by using these types of compute instances then you're having to design the services running them to handle the fact that they will be terminated at some point and failing frequently and that is one way of making your services more resilient for instance when they run on other compute instances that fail unexpectedly. Another example of automated chaos is the picture shown in the bottom. We ran a hackathon day at this organization and one of the teams came up with this great concept of they took a Nintendo emulator and they wired up the emulator so that when you're playing Super Mario if Mario happened to die in the game then a random Kubernetes pod would be killed and that's what you're going to see in the background there. So that's quite a fun way of trying to semi-automate some of the chaos. In process chaos engineering is what I think teams need to try and evolve to. So this is where just as when we helped achieve continuous delivery in many organizations by thinking about quality and shifting quality to left and building quality into what we're doing the same bit can be achieved I think with chaos engineering but if each person on the team not just the DevOps engineers or platform engineers but the product owner devs, QAs, business analysts they're all thinking about failure and what can happen now should the system react to different failure modes throughout the whole of the engineering process then that will help improve the way a system will perform production and improve its resilience. The fourth way of using chaos that we generally don't talk about that much is unplanned chaos. I'm sure people are kind of watching this talk your systems are rock solid and they never fail you never have production instance. Well production instance unfortunately are a fact of life for most of us. They can be viewed very negatively and it can cause a lot of stress but I think it's really important that each time you have a production instance you see it as an excellent opportunity to learn. You can improve the way you learn from production instance by structuring your team's reaction to them and over time you can also use things called post-institut reviews there's various templates and literature about how to run post-institut reviews well. This is a growing field by the way it's something that we as an IT industry don't do very well yet so it's still something we need to learn a lot more about but if you have a production instance then getting everyone that were involved in that incident together afterwards and just step through what happened what did people observe how did they know to dig into particular areas stepping through instance in that way it's a really great way of learning more about the system that you're working with and through learning about it that knowledge to engineers and product owners and QAs are gaining can really help improve the resilience of the overall system there's various ways of trying to automate how incidents are managed and the data that's captured through instance there's all sorts of tools that are starting to emerge now in the marketplace those tools are fantastic but I would encourage you to start simple and look at just simple ways that you can prove the way your teams respond to incidents and look at things like your communication channels how do people become aware of it in an instance and then how do people create a timeline and step through it afterwards test engineering isn't just about trying to build things that are more elastic build things that are more resilient I think it's primarily about education it's helping us as an industry us as a group of engineers and product owners and testers to understand more about how our actual systems work it's a bit like we're a medical team and we have this alien in the operating room represents our system and we haven't got a clue what's wrong with it what happens when we do something to it and the RIT system is so complex they're like that so Chaos helps us build up that knowledge of how our system behaves what does normal look like for our systems and what happens when things go wrong it also helps train people's behavior so that when problems do occur they have more of a reflex action to know what to look at, who to talk to and what steps they can take to remediate the problem next Chaos engineering can help with the process it can ensure that the teams have a much clearer set of steps to go through when an incident occurs it also can improve the breadth and the depth of lessons we take from incidents by helping us do incident reviews well finally it can improve our engineering process so through using Chaos engineering we can start thinking more as we're specifying a change and then speccing it out and then actually doing the engineering on that change to think more about fader and think more about how we can make this change more resilient and improve the overall resilience of our system having addressed people in process only then do we start thinking about how Chaos engineering can be applied to a product but it can help in many ways so Chaos engineering can help us develop systems that are simpler to reason about if you have an overly complex system then when an incident is happening it makes it extremely difficult to work out what problem is triggering other problems when you're seeing different signals or different logs happening what they really mean Chaos engineering can help you improve the observability of the system by ensuring that you have the maximum signal to noise ratio it can help with your documentation so if an incident happens and an engineer that's on call and responding is not so familiar then they'd ideally want to run but that can really help guide them how to respond so it can guide or it can improve how we can write our run books the kind of information we put in there okay we're now going to move on to the mechanics of running a Chaos day so firstly how do you decide when to run it and what are the steps you go through a bit of context for the organization that we've been running Chaos days in it's a very large public sector organization it has over 100 million customers the organization has grown significantly over time it's about six years old we started off with just two delivery teams and they were building the platform as they went along we didn't design this layout from day one it's how it's evolved and as it grew we've now reached about 60 teams and those 60 teams are supported by six platform teams that provide the underlying capabilities like building deployment services observability, so lobbying and monitoring, alerting auditing, persistence those are kind of the platform capabilities it's a very very busy platform in terms of the amount of requests that go through it the numbers on the left-hand side were from our busiest day last January that was pre-COVID so I think those numbers you could probably double them now when we have new services that help large numbers of people in the United Kingdom receive benefits related to COVID we get a lot more traffic now we chose a very constrained set of technologies when we started this platform and that really helped onboard new teams and ensure that teams could land really quickly on the platform and develop services very fast our focus when we started was on getting the platform right for growth so optimizing it to get teams on and teams delivering public facing service really quickly we were also thinking about resilience and we helped improve its resilience by going from one cloud provider which wasn't Amazon because Amazon wasn't available in the UK at that time to two cloud providers just to ensure that we had a spread of the cloud platforms that we're working across I wouldn't recommend doing this now by the way I think that platforms such as ADOS and GCP I don't have any experience with Xero because I can't comment on that but certainly I know from ADOS and GCP that just using one of those cloud providers their resilience is certainly high enough that I'd be happy investing in just one of them as a cloud platform but we didn't have that luxury as we chose to go multi-active now through this period of time we were having a lot of production incidents they were a kind of a fact of life and they made people that one called us lives fairly miserable and because of that we knew that we weren't ready to start doing chaos engineering because we had so many like so much free chaos engineering in the form of unplanned production incidents then finally AWS came along in the UK and we migrated to it we moved our entire platform across to AWS we also achieved quite a significant change in our process in that the organisation gave us permission to allow every team that worked on a platform to deploy to any production any environment whenever they wanted and this was quite a big achievement it gave the platform teams who at that point were doing all production deployments for all services a lot more capacity as their capacity so the combination of those two things made us think okay we're on AWS now we know we've got a cloud infrastructure that is much more resilient than what we had previously we have more capacity to think about how we can improve it and how we can measure our resilience let's do a chaos day so that's the timeline that took us to our first chaos day if you're wanting to run your own chaos day who do you involve where do you run it and how do you step through it the approach I'd recommend is that the people on the chaos day itself are actually going to be breaking things we call them the agents of chaos and we made that team a closed secret team people knew who were in the team but they didn't know what they were planning and for our first chaos day we took one person from each of our six platform teams and we asked the team to volunteer their most experienced person who's been on the platform the longest who is the person you always go to when there's a production incident because what we wanted to simulate through doing this is what we call the bus factor that's from the highest bus factor imagine that person had been hit by a bus and then your platform goes down how are you going to cope so for the chaos day itself these people were unavailable to their teams so not only were they planning the experiments to run we had good knowledge on how to plan these things and what were the key areas they wanted to explore they weren't going to be available to the team that was then responding so having identified who these agents of chaos are the next step is to map out the system that you're going to be experimenting on there's just to gather around a whiteboard and when you're doing this process you can involve the whole team you don't need to constrain it to just the chaos the agents of chaos because at this point we're just wanting to understand what types of things do we want to learn from this chaos day what's a big list of possible experiments we'll let the agents of chaos pick the ones they're going to run with and design them in the way that's secret but I think through having the whole team involved in looking at the the system mapping it out understanding what normal looks like in terms of the behavior of your system understanding what components do we not feel confident if they were to fail and also what components do we have no control over so for instance if an AWS availability zone went down then you can't do much about that particular kind of problem but you should hope that the way you've configured AWS should mean that your instances can spread across multiple availability zones so you would cope with that so doing this in front of the whiteboard and just having lots of ideas about things that people are worried about in terms of if they went wrong they wouldn't know how to respond to there's a helpful approach and I found that as a facilitator for this session that my job was to say as little as possible apart from setting people the constraints giving them a clear goal and then leave them to get on with it which happened with Apollo 13 when the engineers had to get together and figure out how to deal with the problem of the air levels in the Apollo 13 module a bit of structure helps and the structure I found useful to provide is this template we used Trello so we had a template Trello card for each experiment or failure that we wanted to run and the agents of chaos and once we had kind of brainstormed a big list of failures the agents of chaos did the detailed thinking on what are we actually simulating here what's the real failure that we want to learn more about what do we think will happen what's our hypothesis on how the system and the team will respond and then how do we think things are going to automatically recover will there be any self healing or do we think an alert will fire and people will notice and fix it somehow how will the failure be rolled back this is a hugely important thing to cover that we forgot about or didn't think about in our first chaos day and when we're running chaos experiments we found it useful to time box how long we run an experiment for so we'll introduce the failure give it maybe 20 minutes half an hour if the team haven't noticed we might give them a nudge if the team notice and then they're struggling we'll probably give them another half hour but it's quite demoralizing if you've got a production and you're just completely stuck so generally to help get us through as many experiments as possible and get kind of learning from a diverse range of experiments we found it useful to have a rollback step which we then invoke so the agents of chaos would invoke if the team is really stuck so we just magically put everything back to normal and normal service resumes because we're generally running several different experiments in our chaos day you need to think about how these experiments might interweave sometimes it's useful to have multiple experiments running at the same time just to simulate what can really happen in production when multiple failures happen sometimes you'll want to serialize your experiments so for instance if you want to explore what will happen if you took out or if you're alerting or if you're monitoring that's a useful experiment to do but you don't want to do that and run lots of experiments other experiments at the same time because your teams that are responding will be pretty blind without that kind of monitoring and alerting stack and so you're not going to get as much learning from doing those two things at the same time quite a different type of chaos experiment you would also want to consider running on a chaos day is having any kind of security engineers or security people interested in security on your team try and run some security attacks during your chaos day this is a great opportunity whilst people are busy trying to fight production instance will they notice if someone uses that all of that chaos and that confusion will they notice if someone was trying to then to the system or steal data or take someone's credentials if you've got people that can add some security experiments then it's a good idea to have those as well so having decided who's going to be doing it and what the experiments are the next thing we need to consider is which environment are we going to run it on and Netflix are quite famous for running their chaos experiments in production and I really strongly emphasise it doesn't have to be production you learn an awful lot from doing things in a pre-production environment what I think is really important is that whichever environment you run it in is as close to production as possible so that you're learning what would happen if it were in production by close production I mean you need to be able to simulate traffic going through that environment in a way that you would see it in production you need enough traffic so that when a failure occurs it will have an impact that the failure is going to then cause alerts to fire for instance that means you need production like alerting and telemetry and logging and monitoring we're really fortunate that on this platform we have cookie cutter environments that have the same logging, alerting and monitoring in each of the production and pre-production environments we just need to tweak for a chaos say tweak the alert thresholds so that a level of traffic cause a problem that an alert would fire you also need to consider then once you know which environment you're going to work on what's the kind of blast radius for instance if you've got other organizations that depend on your pre-production environments wherever you're running your chaos do you want them to be impacted by the chaos that you're running or do you want them protected it's fine either way just long as they're going to be affected that they're aware of what you're doing finally you all need to agree how are we going to simulate a production instant happening from the perspective of communications so what's Slack channel or Teams channel what way of communicating to the people that use that environment and the people that are coordinating our response to an instant where are we going to have these discussions again you want this to be as close to what you do in production as possible I'll show you an example of that in a second okay so you know who's going to do it you know what they're going to do you know where they're going to do it you need to then think about the actual execution of it each time we run a chaos day we ask ourselves is it useful to give people warning or not that this is going to happen we've also moved from having just one chaos day to actually running a week of chaos and this allows us to do a balance so we found that useful perfectly respectful to warn people that use our platform that we are running a chaos day but we've increased the element of surprise by changing it to a chaos week so we give people notice that in this week things might go wrong just be prepared for that so don't plan any super critical work to happen in that week because you may be delayed it's important when you're picking these dates to know if there are any key organizational events or deliveries that need to be done or that we'll use that platform or that environment in that period because you probably want to avoid those because chaos days can be quite disruptive it's important that you agree when you're going to stop wreaking havoc when we did our first chaos day I think it took us two days after the chaos day to put the environment back together following that we all agreed we'll stop wreaking havoc at about half past three half past four in the afternoon or back people are much happier if you do that then the agents of chaos they need their own private way of communicating so if you can't calicate them which is pretty difficult these days then ensure they've got a private Slack with Teams channel ensure they have a Trello board with those cards that are prioritized in terms of the experiments they're going to run so they can see who's running which experiment are people responding to it or do we need to roll it back helps to have someone to facilitate this just so that the team don't get obsessed with just one experiment run it for too long so they can actually move across several experiments super important that you self-document what you're doing and we find this is really useful to do using a mixture of putting comments in the Trello card and also using threads in something like Slack so each time you're running an experiment put into Slack that we've started this and then make comments about what you're seeing documenting chaos in this way makes it super easy to get really valuable lessons as you then later on go back and review what's happened because Slack effectively is creating an automatic timeline for you about what happened when for everyone else that is working on the platform that's going to be impacting the chaos it's very important that they work like it's just another ordinary day at the office or ordinary day working from home they should be busy doing whatever they normally do because you don't know when a production incident is going to happen do you they come when you least expect them they come when you're busy doing other things and that's what people should be on the chaos day however when something does go wrong if it's in a pre-production environment then they need to treat it as though it was production that was on fire itself in terms of communication channels on our platform we have public event then environment and issues so we have event staging issues event QA issues and for our production environment we have event and live issues and we agreed that if we were going to use staging then we would use the event staging issues as the public communication channel where the platform teams that are responding to the production incident or the pretend production incident would be communicating to the rest of the platform they then followed their own processes using other channels to communicate on the details of their response having run the chaos which gives you loads of learning by the way actually stepping through that day people learn a lot about how their systems work we then need to dig in a bit more and draw out more lessons and share those lessons more widely how do we do that when you're running chaos days on large platforms we typically had all well probably 6 plus teams we had about 10 teams involved in chaos day getting 10 teams of 10 together to try and do a retrospective is really difficult I would recommend that you divide up your retrospectives or your post incident reviews into your component teams so each team ask them to go away and run a post incident review or a retro about the chaos day to treat it for each failure that they were dealing with as a real post incident review and step through that process that's a way of improving that process itself they should be focusing primarily on what lessons about resilience can we draw out from the chaos day what we use to improve how people can understand resilience what can prove about our process and then finally what can improve about the product to then secondly consider the actual mechanics of the chaos day itself because it's a good idea to run these things regularly so what would we do differently to try and make the next one even better to try and improve the lessons that we get from the next one having had each team do that themselves then regroup bring someone from each of those teams into a team of teams retro where they can put forward the key lessons that they got from the rain retros have this all documented as well and shared across the organization as wide as you can there's so much good information from these post incident reviews and retrospectives that it's important to distribute it so other engineers that didn't participate can learn from it having done this quite a few times now and in a few different organizations we've learnt that the first time you're doing a chaos day starts more start with just one team and a few services because there's a lot you learn about doing chaos days themselves through doing this if you try and scale up to multiple teams on your first day it can create a lot of pain that isn't very helpful I'm talking about pain don't break too much on your first chaos day or even on subsequent ones you want to run maybe between 5 and 10 experiments and that's enough to get a whole load of learnings because that's a bit like 5 or 10 production instance happening all on the same day you don't need to use a production environment you can just use something that is close to it as possible that should help you consider how production-like are our pre-production environments because it's really important that they are production-like if they're not production-like then you've probably got a bunch of engineering you just need to do on that it'll be very tempting when you're running through chaos days to try and come up with a big list of improvements these are all the new alerts you need to put in these are the run with the dashboards we need to put circuit breakers here and while some of those improvements can be useful hold back on coming up with that list focus first on what lessons can we draw out about our system about resilience from this chaos day and as we're going through those lessons you're doing the retrospective note down somewhere separate what are the things that we might consider about improvements to our product for this then later a few days later go away and review that list because you don't want to have knee-jerk reactions of circuit breakers and alerts that you're going to put in because the chances are the alerts you put in they may never fire again because of the special nature of some of the experiments you put in finally there can be a lot of fun so use the chaos day as a way to help people that perhaps don't normally work together learn more about each other and enjoy being in control of your own chaos having listened to this there's some things that you might want to take away to your organization so I'd encourage you to think about how established are your systems that you're working on are they ready for applying some kind of chaos engineering what's the smallest thing you could do to improve your systems for resilience this is something that we have to help other organizations out with us so feel free to get in touch there'll be links on the slides that I'll share after this session we've also taken things we've learned from running chaos days and put them into publicly available open source playbooks not just on chaos days but also on things like secure delivery working remotely and our share links to those also right okay thank you so much for your listening and we can now move on to some questions so the first question is from Sambhavi it's what has changed in chaos engineering practices over the years could you share a bit more about where we are in the maturity of this practice certainly okay I think chaos engineering is currently a bit like where continuous delivery was maybe 10 years ago 15 years ago I remember seeing a talk by John Osborne who helped introduce continuous delivery at Netflix and he did a presentation about it in a Riley conference about 10 years ago and after talking about why continuous delivery was a useful thing to do some people came up to him and said you're irresponsible trying to encourage your organizations to deliver in this way and I think the same response can be made to chaos engineering some people when you're telling them you should try and break things in your pre-production or in your production environment and see what happens you might think that's irresponsible that was probably a few years ago I think now people are realizing actually failure is such a certainty that it is useful to be deliberately invoking failure and our systems are so complex now that deliberately invoking failure is the only way to be learning about how our systems respond and so I think we now recognize that this is a useful thing to do and there is increasing kind of tool support in terms of particularly things that work with different clouds or automating kind of chaos it's not an established engineering kind of practice as say tester and development or pair programming so it's not something that I think many teams have at the very start of their process yet so I think we're probably another 10 years off that I hope that answers your question and should we move on to the next one do you measure the cost of running chaos days we don't measure and report on it we know how much it costs in terms of how many people are involved because in terms of the cost it's the agents of chaos their time kind of planning things and then actually running the experiments for the rest of the teams they bear in mind that we're encouraging them to work as a more working day so if your organization is set up really really really well then most of the team will be unaffected and just be whoever is on support that day having to deal with things but if you're trying to break things big style and your team is poorly set up to handle these failures then often the whole team might get involved we don't find any kind of benefit in measuring the cost of actually engineering time and reporting on it because we we've kind of asked ourselves a question or you can ask the organization a question what would you rather do would you rather spend time putting in controlled chaos and learning about how to respond to it or would you rather wait for a critical business event a key business event to happen and for things to go wrong at that point because there's the cost that you're incurring there when it's a key business event that takes your platform out is the reputational risk and the reputational cost and that is orders of magnitude more than engineering costs so I don't think measuring the cost of chaos days is such a helpful thing to do because the benefits to me are clear the next question is does the Pareto principle work in chaos testing as well Pareto is at the 80-20 rule when the team is starting off to brainstorm what are the kind of experiments that they want to kind of run typically we get the whole team involved in doing this because through getting the whole team involved then they learn more about the system people will be saying well I know that when this component fails it will ripple out in this kind of way and another engineer may not have known that or one engineer might say well I'm really worried about what will happen if this component fails I can't predict what will happen and there might be another person that has the answer to that so I think when we run this we typically get 30 possible experiments that we want to run and I encourage people to try and prioritize things by risk, stroke, worry what are the things that if they happen tomorrow in production and your busiest day would be most worried about and we go for those and you get a lot of learning I think sometimes regardless of which particular experiment because if a production incident happens and some things are common across a production instance such as how your engineers diagnose a problem, what kind of alerting and logging do they get from it how do they communicate those kind of things transcend different experiments I think just by choosing a small number say between 5 and 10 of experiments then you will kind of learn learn a lot and by prioritizing by risks, stroke, worry that's very effective you don't have to cover massive amounts of ground