 Hi everyone, thank you for joining, I'm Guillaum Allard, I work for Société Générale, one of the biggest and oldest bank in France. I'm Cloud Principal Engineer and I am also the Product Owner of the OpenStack Compute Services of our Cloud Platform at Société Générale. Today, I will share with you the challenges we faced during the scaling of our OpenStack deployments for the last two years with Florian. Société Générale has a multi-cloud strategy to allow applications to be hosted on both public cloud and private cloud, depending on the confidentiality of the data and the policy we have. On the private cloud side, we expose the HG Cloud Platform and this is the console of the private cloud platform that can be used by application internally. The HG Cloud Platform enables the application to consume infrastructure through APIs and OpenStack aims to host a client-native application. We also propose VMware services for applications that are not transformed. So OpenStack runs the compute side, the block storage with Sinder and also the security groups. As we are a bank, we rely a lot of security groups with no trend. And many other services, the blue one, rely on OpenStack to deploy their own services on top of us. And they expose also an API for the consumer to get managed services like PostgreSQL as a service or RabbitMQ. The HG Cloud Platform is available in four régions in the world. The services are exposed through an API. Each services expose a swagger. And we have an internal Terraform plugin to ease the access for application at the consumption. And the cloud platform is available in four regions in the world and in eight AZ. So you can see here the deployment in the world. But we have one big region in Paris, where we have today 350 compute nodes. So we are running an upstream version of Uthuri, deployed with Coland Siebel. For storage, we use Safe, also upstream. We have around 18,000 VMs deployed in the world and 450 computes. We are a member of the foundation and we also contribute so far with around 4,000 lines of code in OpenStack upstream. So the first challenge we have scaling our deployments was the people. And this is still a challenge for us. Last year, we were four in the team. And now we are nine. So one product owner and eight DevOps. Five in France and three in India. So we work around two times on. And if you want to join, you get the link here to apply. Hello, everyone. I'm Florian Le Duc. So I'm part of the feature team that you mentioned before. And the second challenge that we have faced with OpenStack was related to constraints that the bank applies and regarding the strong network isolation between applications, entities, and activities of the bank, because we do financials and retailing activities. And all those applications cannot share the same network for some reasons. So we use massively the routed provider network with VLAN. And then, so we have many projects that have been onboarded. So 5,000 now in Paris. But mainly, those challenges that we are facing are now in Paris. With 120 networks, which represent now 160K airbag holes, because all those networks cannot, has to be shared between tenants. And not all networks must be seen, shown, and used between all entities and all projects in OpenStack. So we are faced problem with the airbag model that is used in Neutron. And we have made the first solution with internal patches that are not yet upstream. But this year, we will work on having something more robust and shared with the community regarding those problems. The second challenge for the network and the routed provider network feature that we use with segments is that every time you register, you add a new DAW or restart the Neutron services, there is an internal component that triggers some calls to register the node inside a segment and then aggregate. And the more we have networks, the more segments and aggregates we have, the more and the more compute nodes we had, we had many problems by to restart our services, OVS agent or Neutron services. So we have merged a patch to avoid that. So, which was really a relief because we could operate our control plane properly again. The third challenge for the network is the limitation for the segment plugin for the routed provider network that does not load to have more than one segment per network per host. We have a patch internally that we are testing in our labs and we will try to and we have already pushed those patch to the community. They are being reviewed so working in progress for that. Because those, the limitation forbid us to add new subnets in our network per segment per host. So as we are growing, the subnets are now full. So we have to create new networks with new subnets and we can add subnets in the existing networks that we have. So that was not the experience that we wanted to give to the user. They need to have the same network name and whatever the subnets that have behind, we don't care. They don't care, I mean. The challenge number three was the control plane because the more we add nodes, the more we have to handle maybe queues, plan out queues, in revenue queue, more calls to the Neutrons API. And also, the adoption was very high during the last year, as Guillaume showed us. And we have 1,500 instance created and destroyed per day. So it creates a lot of load on the control plane and even on the resource plane. So the adoption, the high adoption was an issue because we did not know yet what would be the next bottleneck and what would be the next component to scale. Because we need to maybe have more better monitoring and working on that. But even though in order to avoid new issues, new incidents, we do capacity reviews and we follow the large scale C recommendation community. And today, we have started with 1AZ in Paris. It was three control nodes on 1AZ. And then we have opened a new AZ in Paris. So we have spread the control nodes. So we had four control nodes in two AZs. And in December last year, we had eight nodes on both AZ. And now it's 12 nodes on the control plane just for Paris. I think that's all. And as we are a bank, security is important. And the policy is that we use the zero trust model. So no load traffic should come across each port on OpenStack. So we have faced an issue with the zero trust model because we with the airbag rules, we were sharing one security group for all the projects for some projects, sorry, but those security groups contained more than 100 rules. So it was an issue because they were, when they were updated, we needed to update all the ports in all the instances in OVS. So there was an issue and afterwards we have suffered an incident that we were not able to restart properly the instances because the deletion of the open flow rules took too many times. And by the time it comes back, the VM restart and the DHCP open flow rules was not applied. So the instances were not able to get their DHCP request response, sorry. So it was really an issue lately. So what we have done is to reduce the number of nodes and we are working, we will work on a contribution on the OVS agent to avoid that in the future. But apparently someone, I think I've seen him in the room, he told us that he has found an issue because he's also using the OpenV switch firewall plugin. So maybe we will test that in our lab just to make sure it fits well and it fix the issue. So even with all those challenges, so we are pretty proud of what we are now. The adoption, the high adoption is growing. We plan to add many, many more nodes, on-born, many, many new projects in Société Générale from all entities. So the growth is very high, the adoption as well. And we plan also to have new features for our users. But yes, it's been a good journey with working with OpenStack. So I think that's pretty all. And you have the QR code, the same for the OpenStack job that is on the website, so the position. So now if you have any question for us? Yes. Either you ask the question or you go to the microphone, whatever. Some of them, yes. Some of them, for the number of compute nodes, the patch is merged. For the segment, it's already upstream but not yet merged because it's been reviewed by the community. So we have internal patches for AirBug, it's not yet done because we need to prioritize. But some of them, yes. Yes, most of them I would say also. In the past also we have made contribution in Colline Siebel as well and not only in Neutron or Nova or Keystone. So most of them are there because every time we want to upgrade, we want to avoid to have to do a back port of our patch or whatever to the next release. Any other question? The question is how did we manage to reduce the number of rules to avoid the issues that we have with the sharing of the security groups for all the projects. So mainly we have reviewed because the security group is managed by another team and we have reviewed all the rules and we have made sure that all the egress traffic does not are allowed for egress, which was okay for the security policy. And we have kept only bastion IP addresses, ranges, Puppet Master, because we have Puppet Master that configure the instances for authentication and everything. And I think that's all. Yes, but this is still an issue. On top of that, each application is able to put its own security rules. So, yeah, there is still a lot of work here to enable the support of more rules per node. Maybe we encounter also this issue because sometimes we have big nodes with 96 cores, 2 terabytes of memory with a lot of VM. So at the open-v switch level, there is a huge number of flows. And as OVS agent is monothreaded, so all the actions are quite much sequential. So when it's fetch a message to do something, you need to wait until it has finished what it has to be done. And then you can maybe create a new port, whatever, what's inside the message queue, what would be the next. So yes, this is a topic. Oh, that's a good question. Because we know that they maintain those rules in a gate repository, but I think after something like 60 or 70 rules, we were starting having problems while restarting the instances. They could not boot up properly because they could not fetch the GHP response because the open-flow rules that OVS agent put was not there because it was busy doing other things. So I think around 60 rules, yes, we started to have problems. But with the open stack, open-v switch firewall, a plug-in, sorry. And also, the zero trust model has been enforced last year on July for new projects onboarding. So there was already a lot of projects without this policy that were running fine so far. And then the security enforced the policy time after time. So it took almost one year to reach the point where the majority of projects needed to have this policy. And then we have too many rules per node. The question is when, no, the question is, yeah. So everything is tested right now. It will be tested, sorry, because the problem was really recent. So maybe it was one month ago. So the contribution will be pushed whenever we're ready. But for the testing, I think we haven't made enough tests yet to be sure that we won't break anything or maybe change the behavior of the OVSA agent, which is something we don't want. Any other question? Thank you.