 So, yeah, like I was saying, show hands who doesn't know who is eBay. What is eBay? Okay. We got that covered. So basically, we are under the umbrella of eBay, eBay classified group, basically focus on local markets, so websites where you can buy and sell all kinds of goods mostly or within your city or in your own country. So we have these 10 brands in multiple countries, like in Markplatz in Netherlands, Mobilde in Germany for selling cars, Ebecklen and Zeigen here in Germany. And we have our data centers mostly in Netherlands and Germany. So our cloud is basically two regions in those data centers, one in Amsterdam and another in Dusseldorf. These are some of the stats. So as you can see, I'm lying a little bit. It's not totally 80 cars, but almost there. So we have around 40 cars in each region in Amsterdam and Dusseldorf, 10K instances on Amsterdam, it's our biggest region, and around 6K instances in Dusseldorf. We provide volumes, object storage, block storage, yeah, I'll give it to Idrin. So as you know, at the start of this year, we were confronted with a really serious security bug, the Spectre ML don, vulnerabilities, and as you can imagine, our security stands needed to be covered. So you saw that we have a lot of repeatable brands, Mobilde, eBay, and then Markplatz. So we decided to immediately or as soon as possible patch against these vulnerabilities. Just a quick reminder for you that you don't know what Spectre Melldown is. So Melldown leaks kernel data to user mode programs, and Spectre is priming the branch predictor of the CPU and can predict what branch is going to be executed next. While Spectre is harder to exploit than Melldown, it's also harder to mitigate. I think there was even this morning article on Ars Technica where they found seven more bugs. So the gift that keeps on giving. So this is the short description of what the vulnerabilities are, and in the next slide my colleague is going to talk about investigating it. I will talk about our timeline and the phases that we have done for this project. We started, so the project basically started from January this year and ended in July. So there are, basically we started with an assessment phase where we're going to investigate what packages we need to patch, what we need to do for this to be covered. Then obviously, as soon as we started the project, we started the development phase where it developed all the scripts and automation that we needed to apply these patches. For Dusseldorf, our first cloud region that we patched, we took one month, and Amsterdam, then we took more than that, like around, yeah, so four months, yes. But in between we took more time in Amsterdam because in between we are confronted with some patches that we need to do in our own infrastructure, and I will be talking about more in depth what we did in those different phases. My colleague will talk now about the assessment phase. So we were not one of the companies that knew upfront about the exploits, so we got informed with the rest of the world. So we started to understand what mitigating the bug would mean. So for us, the path that we have chose to take was to update the Linux kernel, the QMU KVM, and then updating the BIOS. There were other paths. We saw that the colleagues from CERN took another path, but we decided that we want to offer maximum stability to our cloud and don't really need to rush into immediately patching with the microcode. Our central distribution was already a bit behind, so meanwhile, while we took the decision to go the BIOS way and wait for our Dell vendor to provide the stable BIOS, we also had time to develop updating Santos from an older version to an older version skipping one and then updating to Santos 7.4, actually. The first mitigations that we actually did were to provide as fast as possible, well, as fast as the distributions released to the public mitigations, and then we built our custom images, we worked with our tenants to test the images, see if they have any performance impact. They didn't, and then we launched internally the glance images for consumption. We were pleasantly surprised that Santos and Red Hat were among the first to have patches for this, so we immediately offered that, and then Ubuntu and Debian soon to follow with different repetitions and iterations of the images. While doing this, we also developed our strategy how to patch the cloud, and then my colleague is going to tell you what we developed and why. So it was clear that we needed to automate this big task of just restarting 1,000 iProvisers. So what we decided is to use Ansible as our main tool, and we used heavily Ansible roles which is a way to organize your tasks. As an example, we have open stack roles, hardware roles, update roles, and one of the most important ones is the checker role where you check the melter specter vulnerability. Things like resetting the eye drag or starting the compute, updating the operating system and the bias, those are roles that we could reutilize and organize. I will deep dive on the meltdown specter checker role. It's not that rocket science, so basically we check for the patched bias that it is the correct version that we need on the iProvisor. We check if there's the correct version of the kernel installed. We check if we have the KMU version that we need since it provides some CPU bits that you need to expose the system calls that are new for patching the vulnerability. And then in the end, we run a checker on the host that is an open source script that basically tests the variants that we want to check. You can even find it on GitHub. It's a very nice script that covers everything. This is a very simple example of a Nessival playbook running where everything went okay and in the end, so you can see the versions that we were testing, the SVs that were mitigated, so it's pretty, so while you're running, it's pretty easy to detect if everything's all right. Now going deeper on the playbook that actually patches the iProvisor, so I divided this in three main phases, the pre-tasks where you just prepare the iProvisor for patching where you disable puppets, in our case, unmount the file system, ZFS, maybe check the current bios and stop, obviously, the instances that you are running on the iProvisor, then the actual update happens where you upgrade the bios, you update the operating system, we are actually upgrading the bios on the operating system, so that's why it's done before. And then we go on to the post-tasks where we reboot compute nodes and we run the checker that I just showed, and if everything is all right, we just start and putting back the iProvisor in the previous state. So mounting the file system, starting the services, starting canaries that I will talk later where they are, starting the VMs and enable the compute node. So as you saw here on this patching playbook, there's something that maybe needs clarification, so some services were restarted. A very important service that we need to restart is the VRouter agent, so in eBay Classified Group we are using Contrail for our STN, so we need to restart our VRouter agents, basically the Contrail component that forwards packets from the VMs. We also have a canary running on every iProvisor that does a little bit of monitoring and testing of the actual iProvisor, and we also were unmounting and mounting the ZFS file system that we use to host the actual VitoMachines. Now we will talk a little bit more of other tasks. So we are fortunate enough to mainly have cattle in our cloud and not necessarily pets, so one of the things that we were fortunate enough to do was to actually disable the iProvisor and completely stop all of the VMs, and then our tenants had the choice either to spawn their workload in a different zone or have the downtime with us, and then we are going to just bring up their VMs. So we didn't use any live migration, that was one of the things because we thought that live migration is going to possibly impact the maintenance, so we wanted to keep it as simple as possible and then ensure that the VMs are up again. So we had to develop a way to keep the list of the VMs running on the host, so we didn't do nobody evacuate, and then actually something very important for us because we're not living in a perfect world. We had broken iProvisers where they were not suitable for production, but they were in some sort of pre-maintenance task where we wanted to change some disks or something, and then we wanted to save the disabled reasons for them, so whenever you do Nova Service List you would see why that iProvisor was not productive, so we wanted to save that reason and restore it for future use. The BIOS upgrade, yeah, it was not perfect again. What we actually did, we downloaded the Linux provided package from Dell, and then using Ansible we executed it locally, parsed the output of the script of the package, and then made the decision if the operating system is safe to reboot, and after that we have rebooted the iProvisor. There were cases where iDRAC got stuck. Our maintenance didn't have the task to update iDRACs and make sure that iDRACs are fully functional. We just thought that as a precaution we would just reboot the iDRAC and then try the patching again, and in some cases again lessons learned, we had to do the patching manually. Bruno is going to talk about hardware failures. So with all of these reboots that we did in our infrastructure, very often the machines would not come up with or a BIOS was corrupted, and we needed to go directly to the machine and actually install the BIOS, no network or CPU memory errors, so because when you are booting exactly an intensive task that will cause some failures if the hardware is not in a good state. So there's always a risk when you reboot that you probably will decrease your capacity. I will move now into talking about our testing phase. A milestone was when I sat in our team chat in Slack, we have meltdown fixed computes, which means we had like a playbook that actually was running all the steps. We had all the versions and it was checking that it's fully patched. We selected some group of users to do this testing so we didn't patch everything at once. We just added some infrastructure with those patches and see how it behaved. And at the same time we were keeping like an eye on the community to see if there was some consequences of patching the iProvisers, if the load would increase immensely, because this was one of the people were afraid that this could happen. I will talk a little bit about as well, so while we are doing this, we are starting all our infrastructure and we use AVI network load balancer and we needed to do some automation on it. So the way that AV load balancer works is that there's some service engines that are basically VMs that are spawned in each tenant and those VMs basically do the load balancing. They are small load balancers. So we needed to migrate all of those service engines on the zone or rack that we were upgrading, because at the point that the version that we had was not aware of the availability zones. So they were not like spread around on different availability zones, even if we just did upgrade on that one. We automated this with AV as well as STK and some Python scripts. So there is a lot of tools to do this as well, which works very well. Adrian, we will talk now about the campaign in Dusseldorf. Yeah. So Dusseldorf, like we said in the beginning, was the first data center that we decided to patch. We went the full cloud way where we told our users that we are going to take down one availability zone per week and they were fine with that. We never had such scale maintenance, so it was a learning curve for us as well. So what we did is for in a specific day of, I think it was a Thursday, we started, we disabled complete zone four. Again our users could have taken the downtime with us or moved their workloads in the other three availability zones. We took the zone completely down and then up again, patched everything and restored the VMs. We did the same for zone number three and then for the last two zones we got a bit of momentum and then we did them in the same week. The lessons that we learned were that, yes, we can do it very fast. We can upgrade the entire data center in a few weeks, but there was not that optimal. Our users, so we found we had feedback from our users that the elastic clusters, elastic search clusters were re-syncing shards a lot, so if we weren't careful enough on bringing up entire racks that they had problems with rebalancing the elastic clusters. So for the next campaign in Amsterdam, we took that feedback and then we had another strategy. Yeah, so this data center took more time. We started with the same approach, so one zone per week. On the first upgrade, we had an issue, which is that we needed to patch our STN. I will talk more about it in details on how. That happened as well while we did the second zone, so we had two patches in between. That's why it stretched a little bit on our target to do it at least in one month. So we started with one zone per day and then we finished with one rock per day. I will tell you why we did this. So there was two patches that we needed to, that we detect while we were booting. One was related with our control STN and the other with the AVI as a service. So control has a protocol that distributes configuration from the configuration nodes to the control nodes. We needed to apply a patch that basically was catching an exception, and yeah, it was delivered by Juniper control and back port. The same happened with the AVI load balancer as a service, where a service engine was having trouble in setting up a cluster interface, and we also got fixed from RV to fix the old and new creation of service engines. So yeah, it took some trouble to do this. Now Adrian will talk a little bit about the performance. So Dusseldorf was our test case, our first test case. Our German colleagues would know that this is the elastic search cluster of a repeatable website here in Germany, eBay, Clientsagon, and then the lower graph is the hypervisor load before, during, and after we executed the maintenance. And then above is a bit more detailed CPU instructions and usage, again before, during, and after the maintenance. For this specific use case, which is one of the most important things, so imagine everyone searches for the things that they want to buy on eBay, Clientsagon. We see actually no impact for us. So everything was more or less in the same limits before and after the patching for spectrum meltdown. So for this example only, we can say that we weren't impacted by the 10, 15% that the community was saying. This is the same case for Amsterdam. So this platform is a dual data center platform. And then this is the same observation. Before, during, and after our maintenance, we can't clearly have a pattern where performance got impacted. Yeah. Just to go a little bit about, maybe this can be a little bit confusing. There's a lot of noise, at least on the first one. Only the conclusion is here is like, if there is no change before and after, so they all stay the same. Even if you can see maybe a little bit increase of load, this is just people creating more instances and the app provider gets more loaded. So there was no difference between the patches. I will talk a little bit about maintenance strategies. So we started with one zone per week, but then we ended up with a rack per day because it looked like a good compromise between velocity and impact. Like my colleague said, some platforms, which is a group of users for us, were kind of afraid that it was impacting too much on their workload. Like taking down a whole zone will be too much for them. So we were doing like one rack per day. We were also notifying which VMs were affected on that rack. So people need to know what's going on. This is something that we need to automate better. And we were as well communicating all the steps to our users. So we used Slack and we were just giving them updates on what was happening. And we were stopping the VMs when we were updating them and when everything is finished. Because usually users would want to know, like, what's happening at what time and when they will get the VMs back. So what we have learned, Ansible is a great tool for infrastructure automation and we encourage everyone to use it with roles or any other modules that you can use. Do not rush an updating as soon as a vulnerability is discovered. Try to hold it off until you are sure that the patches are not causing any consequences to your infrastructure. Because like that happened where a BIOS update was released and then it needed to be pulled out because it was rebooting the servers. And the same with the microcode, it was pulled out. And again, so make sure that what you are patching is really on prime. And so I talked about the patches that we did. It's actually good that we restarted our old infrastructure because you can catch either hardware failures or catch bugs on it that you only do when you are stressing the actual infrastructure. And try to scope the maintenance as the best to reduce impact, like we did from reducing from one whole zone to a small rack. That's it. Questions? Yeah, I have a question. Regarding that you update your BIOS that you said you did it via this in the Linux command line, then you have to reboot server. How do you deal with these VMs actually on the hypervisor? Are they cloud ready or did you have to live migrate those VMs before the rebooting? So like my colleague said, we just stop them and leave them there. We upgrade and then we start them. So they are not migrated. So those applications actually they are not affected by the rebooting? Yeah, we basically instruct our users to replicate their application towards all ability zones so they can be aware of failures. So their apps should be aware of those failures and cope with them. Thank you. Can I just... Sorry. I've got the mic I'm louder. Just a question. You mentioned in your last slide about lessons learned, like restart your whole infrastructure often. Is that something that you currently do or is that like a lessons learned? So you are implementing something like chaos monkeys as a result of this to sort of try and tackle that. Yeah, good question. So we don't do that currently. So we are doing it right now when it's needed like a new feature or new upgrade, but we intend to do it every three months. So we are thinking next year to come up with a plan to automate all of this. Yeah. Thanks very much. Cheers. The ladder. Is there any specific reason for not using massive live migration? Because you said you did not live migrate except AVI machines. Yeah. So we did a test with our users and then we saw that some of their workloads were negatively affected by live migration. So we decided together with them that they should take just the downtime because we have like maybe I didn't mention it correctly, we have four zones in each data center and then they can still have enough capacity in the other three zones that are up. And what we notice also is depending on the workload, live migrations, maybe 80% of them are passing and then maybe 20% are not correctly live migrated. And then we didn't want to stretch out the maintenance. We didn't want it to be very long. So we wanted to do a zone in one day ideally. So this is why we didn't want to do live migration. We have pretty beefy nodes and then live migrating everything was going to take some time, even on 10G. But did you use live migration in a regular basis or? Depends again. So we tend to encourage our users not to treat their VMs as pets. So we most often tell them either destroy your VM or we will take it down or something like that. You're lucky enough to have cloud compliance clients. Yes. Yes. That's true. Yeah. Hey, I saw all the steps, there were seven steps including disabling compute node, you know, stop services, start services and batch and then do the BIOS update. So all these are done in a single and simple playbook or you have several playbooks that you're achieving. So yeah, it started like that with a big one, which was easier for our testing. But then we noticed that we needed to split them up. So we did some splitting like things that we need to do before, things that we need to do only for updating. So we split it and it will actually better because we tried as well to have different team members. So it would be easier to split the playbooks and they will run specific tasks. So this leads where there is no continuous end to end patching for a node, right? So you have to wait until one playbook finishes and then start the next one. That's like a manual intervention. Yes. So it's true, but we can also control it better what's happening. So we had the two possibilities. We had this whole playbook that runs everything, but then we realized that since we had more, we were doing only one zone, that was a lot of hypervisors to patch. So that's why we split it. But then when we moved to a rack-based maintenance, we could actually run the whole playbook from top to bottom. Okay. Yeah, we've been using the same solution and some of the playbooks for our patching too, but then now we have stitched all our playbooks using RCIC redeployments so that it's just a trigger and then it'll notify you if there's something wrong happen during the run. Yeah, that will be the next step in the optimal. We are as well, we want to do that. Yeah. Just a suggestion. Yeah. One more question. So for your BIOS update, you mentioned you have experienced corruption. Did you figure out what is calling that corruption? Oh, yeah. So basically, for some reason, some of the idracks were performing poorly. So nothing would help. We did a rack reset. So just basically rebooting the idracks. We did everything to have, let's say, a stable idrack, but none of our solutions in the end gave us 100% stability of the idrack operating system. So the next step would have been to actually update the idrack firmer and then update BIOS. And that would introduce a new dependency in our past cycle. So you're saying basically just updating the BIOS independent, you leave the other components alone? Correct. Correct. We have solutions where we can upgrade every firmer of the compute node, but that was again out of the scope. We wanted to keep it very, very targeted to fixing very fast this vulnerability. So are you downloading those BIOS updates from locally to the server or where you update that? Right. So in the playbook, we basically do a VGAT. We place the file on the hypervisor. We execute the Linux binary and then parse the output. And based on the output, we decided that compute node is stable to reboot. And in most of the cases, I think we had just a few cases where corruption occurred. We were right about ... So when you downloaded those BIOS images, you verified those images are good? Correct. Correct. Yeah. So the image was 100% correct, the Linux binary. And then we applied it. Yeah. Sure. Okay. Thanks. We verified that. How diverse is the hardware you have? Is it 100% Dell or do you have HPEs and Lenovo's? And did you have to write in a custom code for Ansible like modules to deal with some of the BMCs? So we have homogeneous hardware from one vendor from Dell. And no, we didn't have to write anything for the BMC. We know later on while we were doing this project, we saw that there is Ansible for Redfish and that could do what we wanted actually. But again, our iDrax were not all at the same level. So this is why we didn't want the Redfish part. As a later project, after we did the patching, we upgraded all of our iDrax. So now we can use Redfish. What we wrote was a way in Ansible Playbook to reset the iDrax. So if it fails, we would run again. So that sometimes fixes because there's some jobs that are stuck and we resetting them, we could apply. But in the end, there was some corner cases that we needed to do it manually. I think if there's no more questions, that's it for us. Thank you. Thank you.