 Hello, everyone. Welcome to this session, where we're going to discuss some of the stuff that would, hopefully, enable us to build things, increase our velocity, while keeping the security and the safety of the data of our users in the back of our minds. So we all know that the demand for speed and innovation is everywhere. Users nowadays no longer wants to wait for weeks, to get their bug fixed, their new feature improved, the UX improved. The more you can ship the features to the users, the more retention, the more increased engagement, and that means better numbers for your business, obviously. However, that's unphysiasm to move fast, create some issues, and sometimes create a bug. It doesn't work really well when in security, especially traditional one, which favor the status quo. And the aim to go fast can create some kind of challenges, like from a security perspective, obviously, like you can skip some processes and efficiency and so on and so forth. And we all know about some companies that were bragging about moving fast, and they hit a roadblock because they forgot about security. And once the security is hit, the trust between you and the users is actually lost. So we spend a lot of time trying to build that trust back, but it's unfortunately gone. So DevOps is definitely desired. It enabled us to move fast, increase the velocity, ship things faster, and make our users happy, and make our managers and everyone happy. And the security is definitely required. We want to keep the data of our users safe. We want to keep the data of our customers safe. And that's what the DevOps brings to the table. So DevOps, what it tries basically is to save a seat to everyone in the table. It tries to give the developers the freedom to ship things faster, add new features, fix bugs, and add business logic as fast as possible. It enabled the operations to once the developer pushed the artifact, generate the artifact, and push it to production as soon as possible. But it also saved a seat in the table for the security people to make sure that the software that we release, our releases, are actually safe and secure. So you figure it out by now what this talk is going to be about. It's going to be basically about Formula One. How much of you watched the Formula One Drive to Survive? Well, quite a crowd. Cool. So very briefly, Formula One Drive to Survive is a Netflix documentary slash series that tells you behind the scenes of the Formula One in a dramatic way, very engaged, and it keeps the suspense very high. But it's really cool. And motorsports in general are very dangerous. If you have been to a race before, in the badge it says that the motorsport is dangerous. It says so in the tickets, in the badges, and the t-shirts that everyone wears in the pit line. And to make the Formula One safe, it's a multi-layered, multi-generational task that the FIA, which is like the entity behind the Formula One, and that oversees it, tries to actually solve. So back to this documentary, while watching the Drive to Survive in the 2020 season, this accident actually happened. So Roma, which is a driver from the Haas team, tried to overtake and then lost control over the car. And the car instantly hit the barrier, turned it into a fireball, and the car split into two. Now, by any measures, this is a death sentence. This is a fatal accident that probably nobody will get safe from that one. But what happened is actually a miracle happened. Not only he managed to get safe, but he managed to get out of the car by his own. He stayed there for a couple of seconds, but then he managed to get out of the car, safe and safe by himself. He had a couple of scratches, but nothing dangerous. And that's got me thinking. The Formula One people, the engineers there, might be doing a lot of things right to keep the safety of the driver, even of such dramatic and tragic accidents. And that's got me to the question, what would happen if we started to design our systems, our architecture, our softwares with the same level of devotion as the Formula One people, as the engineers, and everyone that is involved in the motorsports industry, design the cars and the races, and the pit lanes and everything to make sure that the safety of the driver is the number one priority. Now, what I've noticed throughout the whole documentary is they actually don't mention security a lot. They talk about safety more. And that got me thinking, because I'm not a native English speaker. So what's the difference between the two? So when I look at the dictionary, I found that security means the state of being free from danger or threat. Now, in the software industry, we know that when we put a software facing the internet and application, it's no longer safe. It's no longer free from danger or threat. There are hundreds of millions of things, hundreds of malicious users that try to attack your system, break your system, and seal the data of your users, or just like make it like a stationary node to attack other systems. However, safety means the condition of being protected from danger, risk, or injury. And this is kind of what we're trying to achieve. We know that once we put our system in the internet, there are some malicious users. There are some things that will try to break our system. But we try to minimize the damage. We try to keep our application up and running, make it safe, and even if an incident happened, we try to minimize the damage of our application. So that got me thinking that, actually, if we want to design good architecture, we should enable the ID people, developers, security people, operators to move as fast as possible, but in a safe manner. Now, we're just assured I'm not going to propose the diff-safe ops, whatever the name is, but by itself. But I am boarded into a journey to check, actually, what are the safety measures that the Formula One people introduced and used to ensure the safety of the drivers, and if there are any lessons and best practices that we can adopt in the software industry as well. And in this talk, we're going to check 10 measures. Five of them are pre-safety measures, like the things that they ensure before the driver starts racing or before the car crash. And five of them are post-crash measures, which are basically the things that they do once there is a crash. And check in the data, it actually works. So the first graph is the number of deaths compared to the introduction of new measures. And you see, as we, the Formula One and the FIE strategy to introduce more safety measures, the number of deaths has actually decreased. And the latest one was on 2014 in the Japan Grand Prix. Also, the number of days of fatal incident has also increased the more safety measures they introduced. And you could probably think in that the Formula One people, the people that they actually do everything to ensure the safety of the driver, might get inspired from their IT department, who actually can be inspired from the car engineers. It turns out, not really. So even the people that excel in safety and security from a car engineering point of view, or a race car engineering point of view, struggle to make their system secure. But nevertheless, moving on. So as I mentioned, the first five are what we call pre-crash measures. The things that we do, or the Formula One people do, to ensure the safety of the driver before the crash. And the first thing is, they have a seat belt that is a six-point harness, which can be released by the driver by a single-hand movement. So the seat belt is, it can't be done or squeezed by the driver. It needs to help from the bit-line people. And it's uniquely tailored for a driver to fit his shape and also to ensure a little bit of comfort and a lot of safety during the race. But the driver, during the race, if anything happens, it can only click on one button and the harness will be actually released. And this is the equivalent of the push-to-release, or the push-to-deploy in our industry, right? And that can be achieved through automation. So automation is really important in our software industry. Things like check-in vulnerabilities can bring great value if they are automated, so it can free up the security team to look up for other things. We can automate manual thanks, given more time to our developers and ops people to bring more value and more business logic to our developers. Moving on. The second thing they do is stringer dynamic, static and load test to ensure the safety of the drivers. And that means that they are testing the car in the different environments that the car could potentially be racing to during the race day. They test it when it's windy, when it's super hot, when it's raining, different conditions that the car may experience during the race day. And they collect the data and check the security and the safety of the car. And the equivalent of that for us in the IT sector is to have a trusted, repeatable, and most importantly, adversarial CI-CD pipeline. And that means that we are testing our application in the same conditions it's going to be released and deployed to our users. And adversarial here is really important, since we are listening to everyone and making everyone's voice listen during the discussion. The developers, the ops, and also the security people, not only after development, but also throughout the whole process of releasing our application. Another thing we can do is actually canary deployment. And canary deployment is actually, instead of releasing your new release to the whole user base, you select a subset of a user. It can depend 1%, 2%, whatever you prefer. And then you release to only the subset of your user your new application. So actually what would happen is some of the users will go through the direction to the new release. And the normal traffic or the majority of traffic will see point into the old release. And that will enable you to test effectively in production the behavior of your new release and have metrics that define if the error rate, the latency, whatever measure is important for you, if that increased, then roll back before it damages and it impacts more users. The third thing they have is to build the car around the cockpit. And the cockpit is the formable crash protection structure. And this is how the cockpit basically looks like. And they build the car around this piece of infrastructure, or piece of structure, which sole purpose is actually to protect the driver during the race and during the incident. This is the equivalent of designing for failure. So we, especially developers, no longer should design our application for only functionality, but we should also design our application to be fault tolerant, with fault tolerance obviously in mind. We should design our application, especially if it distributed systems, whatever architecture you are using, having fault tolerance in mind, designing for failures regarding database, network, whatever that's going to break your application. Another thing is mutual TLS, making sure that the traffic circulating inside your cluster is actually encrypted. And there are many options that we can do that. Sidecar proxy, like Invoi or other open source solutions that have been probably discussed in this conference. But it is really important, especially in distributed systems, and both the client and the server in distributed systems are services. So mutual TLS will enable you that no connection will be permitted unless both clients are actually verified, which is really important from a security point of view in a distributed architecture. And the third thing, we can adopt like a micro-segmented architecture. And what it means is we are putting the services responsible of accessing the data as far away as from the internet as possible and put in some facets and proxy services in front of the internet. So what it means, if an attacker manages to get access of one of the service or internet-facing services, our architecture will basically deform itself to protect the data of our users in the same manner as the car deforms itself in the cockpit and that will enable to keep our data safe. The fourth thing is before they race, driver must demonstrate they can get out of the car within five seconds. And they actually test that. If you can get out of the car under five seconds before they race, goodbye. You cannot race for the day. And this is not only designing for the worst case. This is testing for the worst case. Imagine if an accident happened and the driver is still conscious, then he should get out of the car as fast as he can. So we are testing this scenario before actually the accident. And the equivalent of that is what we call chaos engineering. We, in our systems, we all have, we all know things about our systems and we all know the things that we probably don't know well about our system. But what scares us the most is the things that we don't know, we don't know about our systems. And those are the major incident that we have. So chaos engineering is actually a practice that enable us to test in productions, in production because that's where the fun happened. That's where your users get affected and that's where we have most of your issues. And then by that you can uncover bit by bit the unknown unknowns and reduce them and check how your system will behave in case of issues. And you can design it, you can start small and grow gradually. You can start by putting hypothesis on how your system actually should behave in case of failure. Could be latency increase. Could be error, error, database outage and network issue. And design the smallest possible failure to test and design what you would expect your system. And then do some measurement after that during the experiment. And once that done, then you will have, you will increase the resiliency and reliability of your system. You know how your system will behave. You can start, as I mentioned, small like restoring delays or errors and then you can go wild, removing a whole cluster. Yeah, you can do whatever you want. But yeah, that would enable you to understand your system and understand the resiliency of your system. And the fifth thing is constant monitoring and replacement of the tires. And check out this quick GIF. This is the world record of replacing the tire in a Formula One race. It's one second 82 milliseconds. And notice how happy they are, how proud they are. They manage it to change the tire in less than two seconds. Compare this to our attitude in the way we deploy our applications. We're bragging sometimes that our application is still alive for sometimes months, sometimes weeks, sometimes days. What would happen if we reduce the lifetime of our application and constantly replace our application? Now, we are all familiar of the pits versus cattle analogy and that basically means in traditional servers, be bare metal or VMs, we are used to keep our application alive as soon as possible. That means that we build personal relationship with our applications, with our service. We try to keep them alive as much as we can. We fix them, we patch them, but we try to keep the system up and running and we can go even beyond that and give them names and so on and so forth. However, with the rise of the cloud, we change it to the way we treat our infrastructure to be more cattle, which means that resources are disposable. We spin it up as well and we kill them as well. Now, if we accept this zoomophoric analogy of pits versus cattle, what would happen if we push it a little further to the chickens analogy? Given that the time to reach maturity for the chicken is much more less than compared to the cattle, sorry. It's days or weeks compared to months for the pit. And that takes us to some other metrics that also matters when it comes to the way we want to monitor our service and the health of our service. The first thing is diverse uptime and let's actually take an example. Imagine that you have an application running in a cluster and it's running mission critical applications and now an attacker, a hacker, manages to get inside your node, inside your cluster. Now, that is bad by itself. But he managed to uncover itself and use this node as a base to attack other systems or to attack external systems. Now, what would happen if your cluster or your node or your application is actually constantly killing itself and restarting itself? What that means is that the attacker will need to do the same and repeat the same process each and every X amount of time. It's one day. So we need to do the same process to try to get inside your cluster because once the application, once the node, once the container, whatever is removed, I would need to just repeat the same process. And the thing is from a security point of view, we can't backdoor a system that is constantly revaged and revamped and removed. So from a security point of view, it is actually great. And this is a great way to combine that with the base image freshness, which is the base image that you use to deploy Oriole applications. And from a security point of view, once you update the image, and for that patch, you'd say an important patch, Linux kernel zero-deck security issue. Then you know that the longest amount of time that your application would send vulnerable is actually the reverse uptime. And the smallest and the shortest your reverse uptime is, the fastest the base image will be populated through your cluster and that enable you to fix the vulnerability issue fast. So those were the fast pre-crush measures that they do. Now moving on to the post-crush measures, what happened once the car crashed and it's actually burning. So the first thing is the driver can be extricated from the car by lifting out the entire seat. And that means that the design of the car in a modular way enable them to, if a crash happened and the driver passed it out and he couldn't get the driver out from the car, they lift out the entire seat from the car. Compare that to a normal car which under stress and we try to break out the car and we lose some precious minutes, seconds to save the driver's life. And modernity is really important in the way we architect our applications. Think of how fast you can pull the plug on a key that has been pushed to get happened, made publicly available as certificate that has been leaked, a lot of things. So the more modular your system is, the more that will enable you actually to react in case of incidents. The second thing is the drivers wear a hands and hands stand for hands, head and neck system that absorbs and redistributes forces that would otherwise hit the driver's skull and neck muscles. And this is how a hands system look like and it's basically if a crash happened and said of the forces go to the driver's head that would probably break his head or neck muscles, the hands system will basically take those forces and redistributed throughout the whole driver's body. And from a software engineering point of view, that is the equivalent of having an illicit architecture. And there are a couple of things that we can adopt to leverage that, starting by having load balancers. If a load increased, we can have some load balancers that evenly distributed traffic throughout our own node, having auto-scaling that can cope up with the increase of traffic and create additional nodes to handle the traffic. If one, some of the services are starting to act small, we should have like request time thresholds that cut the connection to save resources such as CPU and threads. Sorry. And some of the cases it's really good to have some degrading performance. It depends obviously on the business but having in some scenarios, having a degraded performance is really important than having a global output. And finally, we can adopt some anti-overload patterns such as the circuit breaking and exponential back-off that would help us if some of the services are started to behave weirdly. The third thing is driver wear suits that are fire-resistant and that keep the driver's body under 41 degrees Celsius even in the extreme heat. And this is how Roma, the driver that we started this talk with, manages to get out of the car. Even if he was surrounded with extreme heat, his body was under 41 degrees Celsius. And this is the equivalent of keeping the attackers in. Trying to contain the attackers in case, even if they manage to get access to our system, we can adopt some practices that would enable us to contain them and keep them and try to keep the data of our users and customers safe. Starting with the least privileged principle, we should give our applications, container, cluster, whatever, the minimum set of access right and resources that enable it to perform its function. Things like defense in depth and building layers of security kind of like the onion architecture that once an attacker manages to sneak in one of the layers, you will find an additional layer that will make his life a little bit harder and give you more time to react and fix the vulnerability and having a zero trust, especially in during communication between distributed services and having no implicit trust of the request. So making sure that the request is what it's presented to be and making sure that it's valid before like starting process and the request is also really important. Another thing we can adopt is actually hardware security modules. So imagine that you have an application that store some users information and you are basically hashing a password with whatever hashing encrypt systems or algorithm you are using. Now, imagine that an attacker manages to get access to your database, seal the data and put it into without you knowing it basically. What he can do is he collect brute force to break the passwords and you won't notice until it appears couple of days later in the black market. Now, hardware security modules enable us to have keys that with which we're gonna encrypt the password. And that means that the attacker if he can manage and those keys are really hard to break, they're unbreakable and gonna spend more than his life trying to break them. And that means that the attacker can no longer seal the data and put it away. That means that he will say inside our cluster to at least get the keys to decrypt our passwords and that will give you trying to keep him in, that will give you some additional time to hopefully be aware that there is some attack happening on your cluster and react properly. The fourth thing is they have fire suppression system that can be activated by the driver externally or by the race marshal. And what triggers me here is this distribution of roles. If a crash happened and the driver is still conscious, then he can activate this system by himself. If the driver passes it away and the safety team and the rescue team can get into the car, they can activate it by themselves. Now, imagine as our friend Roman the car break into two and there's a fireball. So if Roman passed away and there is the safety team cannot get in time, then the race marshal can activate it remotely. And in software engineering, especially in the way we define policies between our systems, we need to define communication policies and make sure that no application can call other application if it's not allowed to. Having access control and defining role-based access can help to give access to only the people that need to access the resource. And finally, they have data records. And the Formula One people are crazy about collecting data. They collect data about everything. So they have data records that keep the speed and the solution forces so doctors know the severity of an impact. So the doctors, if an accident happened, they already know how severe the accident was and they know what they would probably expect and they know how to react to such accidents. And no system should go to production without monitoring in place. Should have login that looks events, especially security ones, having monitoring and observing tools in place that give us an understanding of how our system is reacting and behaving and could potentially trigger alerts in case of malicious behavior. And those were the 10 safety measures and their comparison in the software industry but it's all about finding the right balance. It's a trade-off between the cost, what you are trying to achieve and the security as well. And that is the philosophy that the formula one people adopts as well. And they have two examples for that. The first one is refueling. So the refueling was, so before what you were doing is actually, they will start to race with, because the more fuel you have in your car, the more slow your car is in the first laps. What they were doing is they have, they start the car with a little bit less refuel and then they kept refueling the car with each and every lap. And it was banned in 2009 because of potential risks. As you can see in the picture, drivers were actually burning and of the cost. BBC reported that the cost of maintaining and refueling the car and all the people that handled the system is cost the teams over $1 million per year. So it was removed from both a security perspective and also from a cost perspective. And the second example is, this is how the drivers in a Formula One car look like. This is not comfortable at all and they feel every bounce in the car. But it's a trade-off to gain speed. So it's not comfortable, but it's enabled them to gain speed and design the car with aerodynamics to be as fast as it could. And I want to end this presentation with some numbers from the cost of data pressure report by the IBM security folks. In 2021, they found that the average business cost of a saver attack is more than four million. But the number that scares me most is it takes over 287 days to detect the breach. That's a long time. So I hope that throughout this presentation, we have now some good understanding of how we can move fast while keeping our system resilient and our data of our customers and users safe. Some resources that's helping me to build this presentation. And with that, I want to thank you. If you have any questions.