 Thank you, everybody, for attending this session. So I have a very interesting title today, OpenStack HR or Node.hx, that's a question. And the reason why I decided to talk about this is because at Word.a we have a very interesting use case about how complex is actually OpenStack and in the top of that we have to run up HA for multiple components. So I want to describe a little bit as a user story what is our experience and maybe having the audience to ask me some questions of why we took these decisions. So let's start with myself, my introduction. My name is Edgar Magana, Cloud Operations Architect at Word Day. I'm also a member of the User Committee and the OpenStack Foundation. I was former core for the Neutron team. So yes, part of my blame Neutron. I'm also involved in OpenControl, the user group and part of the advisory team. And I'm doing a lot of open source, have a wonderful team at Word Day. So it's been a very, very busy year for us, so we're very glad to be here. As you know, we were finalists, we didn't win. But just being there, it was an amazing experience. So this is what we did to get there. So let's start with very simple things. You are going to walk you through what we did to understand OpenStack from scratch all the way to our production systems. So I'm not going to waste your time on the agenda just to give you an idea that is going to be long and painful to be here. So everybody in the room, because I put this advance, it's boring to see this diagram. You all know these components. So actually, this is just probably now half of the typical components that the productions cloud run. As you remember, in the user survey we have, in average, nine to 10 projects that most of the production cloud deploy. So this is just the minimal. This is kind of like the core projects. And I like to start with here, because this is very easy, right? You look at these, you say it's not a big deal. And then I say, my team, OK, let's go a little bit deeper. Let's understand what is these boxes. So you have seen this picture as well. Nothing fancy. Yeah, start becoming a little bit scary. But yeah, all these components, they have their Badabes schema. You can put all these schemas in the same database. So it's not a big deal. You actually have recipes for each one of these components. Still not a big deal, because you can have the same process in the top of all of them. You can unify all these. And you say, OK, let's start reading the documentation. Let's start getting more familiar. This is good from the research part. This is good from the POC to understand what is going on on my servers before running up and start. So now let's start thinking what the community is going to be deploying in common. You don't want to be the weird guide in the neighborhood. You want to follow others. This is why we are a community. We exchange our users' stories or use cases, et cetera. So you don't want to feel yourself alone in this world called OpenStack. So we went to the basic documentation. And we found this picture, very nice one. I'd say, like, Badabes. Looks good. So basically, you need a bunch of projects who has GitHub repositories. You have packages from different distro companies, Red Hat, Susie, et cetera. And then you have basically options for the backend storage, MySQL, MariaDB. You also need a message boost to let all these multiple components to communicate each other as we show in the previous picture, right? So the message boost is all around. So not a big deal. We have Rabbit and Q. We have KPU. So we have all these things. Then you have to have a bunch of compute nodes you should be able to add as many as you want, right? As many as the system supports until it breaks, because you don't know when it's going to break until it breaks. So it's not that huge deal. So what happened that day? Let's do it, right? Let's build all these things. Shouldn't be that painful. So we start talking to our internal customers. By the way, what is an internal private cloud? So our customers are internal people, our own coworkers. And also realize that one simple cluster will not be enough. So we need to create a recipe for the typical cookie cutter to actually deploy multiple clusters. So number one, automation. And that's one of the most important things. I then potent because we don't want clusters to be different between each other, especially in the configuration we want to be very, very homogeneous. However, there are certain configuration parameters but that we do want to be different. I'm going to give you an example. CPU overall allocation. You're always going to have your dev environments before going to your production environments. In our case, we designed a system that the overall allocation is something configurable from the high-level perspective and not the same in all the clusters. Basically, what I'm telling you is in our dev environments, our overall allocation, it's something that we can change one to two, one to four. We can go a little bit crazy because it's still a little bit more of a comfort area. However, in our production system, we would like to be a little bit more conservative, especially on the first run out of the cloud, right? We don't want to go like too wild. So we want to make it one-on-one. So even though we have the same process to deploy multiple clusters in all our data centers, we have certain parameters that we abstracted out of the common configuration to make it unique to our system. Now, scalable, right? You don't want to run three servers. Our data centers allocate hundreds of servers, so we want to make it possible to run as many compute nodes as we want to add without breaking the system. The most important, the ones that actually draw me crazy across all this time, security. SSL everywhere, APIs, obviously. I will agree 100% with that one. For the people who drives public clouds, having SSL in the public APIs, I'm sorry, in the public URLs, in the private URLs, or internal URLs, maybe makes sense. For us, I was fighting too much with security, and guess why I lost. So we enabled SSL everywhere, sorry. Even in Rabbit and Q, I will say, I will have to admit, they gave us a lot of help. Something that I will encourage you and your teams, if you have a security team and security is a big concern, engage them from the very beginning in the project. They're going to help you. It's a lot of fun building these. So in our case, we have our security team investing a lot of time resources on this project. IP tables are at the host level. We're running IP tables at the host level, and we have a little bit of an issues in that part that I'm going to share with you in a little bit. Obviously, it needs to be stable, right? And stable means two things. Control plane and data plane. I don't want to go to the end of the presentation, but we're going to focus in these two planes in a little bit. Production readiness. So I'm talking to my operations team. We say, we're going to give you this baby to you, and you're going to run it, and there is LIA. Hold your horses. I need to see what's going on. I need to see from the dashboard, from fancy graphs. I need to have alerts. I need to see the login. I need to see all these cool stuff that I need for a production system to be sure that it's going to be running 24-7, 365. OK, got it. We're going to do it. Our configuration in the data center, it's very unique. We cannot change it. And I like to put these in the slides all the time. Bonded physical interfaces per server. Because when you start running things like some of the fuel or other distros, and I don't have anything in favoring or versus any of those ones, they assume that you have the ability to change your data center infrastructure the way they want it. And that's not right. Actually, our data center has a specific configuration for HA that we couldn't change. So all the servers, they have two NEX interfaces bonded. Go to the up links to our HA. So if we lose one of the NEX, otherwise we'll still run the traffic. So we want to have a shape in different places. Multitenant, obviously. And the last one and the critical point of this, it's high availability. So everything before sounds like interesting. We spike all these areas. High availability was something very interesting. So again, we went back to the books and I said, let me review again the architecture. Let me see all the components. OK, I need all these boxes. If I want to provide HA, I need to repeat all these in a typical three, closer, five, closer, et cetera. OK, it's becoming a little bit of a complication. So let me go back to the typical providers that they know all about these and trying to get it. So I found this picture, which is really good from TCP Cloud. Their reference architecture in HA is poly. I actually have the link at the bottom of this slide. I'm not going to spend a lot of time in this slide, but I want to highlight something. That is not easy to deploy. It is not. So I say, OK, good. I mean, it's a wonderful architecture. I love it. As an architect, I love it. I want to drive my team, which is actually sometimes a struggling and still learning and openness type to that monster. Let me find something else. So I went to meet aunties just as a reference. Again, I don't have any favor for any of them. And I found a little bit more simple, but actually what I realized that is just simpler in paper, not really in implementation. So I actually have this face after that. What the edge? How can I convince these guys who's going to learn openness type to actually deploy those things? So I went back and say, OK, let's plan it a little bit better. Hold your horses, and let's go a little bit more simplistic. So I say, forget about the HA. And actually could use another word, but this is a very polite meeting. So this is actually the word the architecture nowadays. You will see. You don't see HA. Do you? No, it doesn't. I'm not saying that we don't have HA. And that's a tricky part of the conversation that we're having today. That is at the WPC level. And the bottom line here is there are certain ways that you can provide still HA to your production clusters when you don't want to go to the monster that I showed before. We're going there. I'm not saying we're not going there, but let's do baby steps, right? Let's go step by step. So this is a very simple architecture. Let me drive you through that one. It is working. Yeah, sure. So the first box is actually the open style controller, the typical one. As I told you before, we have on interfaces. So just one uplink from the logical point of view. Top of the rack switches doesn't really matter what kind of switch, because with SDN technologies, they've just become like a thick wire for the hypervisors. On the first box, we have the typical things. We have a patch as a front end, because Keystone, at the version that we're running, is still using the Evelyn libraries, which was a name in performance, so we get rid of it and we have a patch at the top of it for Keystone. And we start moving it for other services like Glens. Then we have all the other services, Nova Horizons in the Glens. As you can see, I don't have a silometer. I don't have heat. I don't have all that good stuff, because we were not ready for them. Why should I start implementing all those crazy projects? We were not ready for it. By the way, I forget to mention that when you're planning all these, what do you do first? Go to the documentation and find the monsters, diagrams, and get scary and have that what the edge phase that I have at that time. Or you decided to do it little to little, piece by piece, and have architecture as flexible as possible to add and remove components as you start growing your cloud. And on the top of that, just your cases, right? As yourselves, what is going to be the path that are going to be running? I just want VMs, I just want these kind of things. So maybe this architecture is good for the next year or two of your production clusters. I don't know. The next box is actually the part of SDN. It's actually, we're using open control. We're using the second, the 2.21 version. This is, as you can see, it uses two boxes. The first box is when we have all the configuration and control components. And the next one is about analytics. This is actually becoming a best practices deployment because the data that actually the analytics are getting from all the big routers, which is in your hypervisals, will be massively injected into the East Cassander database. So why do we want to avoid is that the data for configuration and control gets a start because this Cassander is so easy reading and writing data from the analytics that if you don't really have a use case for analytics, guess what? You will never use them. So this is best practice. What I like about this model is if you end up losing the control analytics, nothing happens. Yeah, some things in the UI would not work. Who cares? Who uses the UI to actually make changes on the configuration? Why do it sometimes just for demos, but just to be fancy? But everything is automated. Everything is through the APIs. So that's the way we do it. All the configuration changes are not injected neither from the web UI or something directly. It's not even REST API calls directly to either the open stack controller or the control controller. It's everything so that we develop Python code that calls the APIs in a very, very nice and automated way. And the bottom, one of the key changes that we did is actually, so we noticed if we were losing these big bugs, we were able to reproduce that box because we are using Chef. And as JJ is around, he will be very happy because I say the word Chef. So we're using Chef for configuration management. So it's very easy for us to reproduce these boxes at any time. What is very hard is to recovery all your data instantaneously. So we did a very simple trick to this architecture to actually have an active replication to another box when you just have my SQL. So all the data is here. And you will say, like, yes, but it's not active active. So you will need to manual configuration. So yes, we don't care. And you will see why we don't care in a little bit. And then just that adding a lot of compute nodes, there is interesting projects to providing HA at the compute node. It's still, from my personal point of view, when you can disagree with me, that's a beauty of this kind of conference. There are projects to what led you to use another user here that is going to talk to a zookeeper, that is going to report, that is going to bubble land. It's going to relaunch another user agent here to restart or out to heal these user-specific agents. It's cool, but it's a little bit of a killing, right? Maybe the day or tomorrow, I'm not saying it's not. But don't go crazy. It's my advice. We are here to share stories. My story is, like, don't go crazy. Start with these simple steps. So that's where we are. And then I decided to go back to the slide that I have before. And I know everybody wants to go to the party. I want to go to the party. So I'm not going to spend a lot of time. But actually, we were able to fulfill all these requirements. On the automation, we started using Jenkins and Chef. We have a very strong CI-CD system. It's one of the key successful factors for Word Day to have a production system. We are repeatable creating these clouds. We call it overcloud, because we are using, and I want to clarify, for our deck concept of OpenStack and OpenStack. So once we have our third version of OpenStack, we build small clusters. And that is small cluster, which is our dev environment. We start using it, giving tenants all our internal developers to build over clouds. With simple scripts, they can actually create a full, the same architecture that I showed before, and then actually cool changes. I don't know, for instance, one of the key changes that we're going to do in the previous architecture is to move the MySQL database on the controller out of that box. So we can easily try all these changes, test it. It's a disposable, repeatable environment, which is very, very helpful. If somebody's interested in that CI CD, you just Google it, OpenStack, Latin Summit, CI CD Word Day, and you will see the whole presentation that we did about that. So reviewing all these, again, for a dependency, Chef, Scalable, we're able to have, right now, 8,000 cores per cluster. We have all these secure things. That's something that actually helped us to be as a super user finalist, because we did a lot of contribution upstream on Chef for SSL support. It was weak in that part. So we did all that contributions in our team. We actually fixed a couple of boxes in Keystone for a scalability that we also found. We integrated a bunch of things for monitoring. So we have a total of probably over 200 different Nagyus checks, so maybe around the number. From the basic CPU, mem, all that, to actually every single process that we run to see the health. And we have alarms threshold that we trigger for each one of them. So something happens. We have the knock team. They have these huge screens with the Nagyus checks. Something goes red. There's a ticket that is open. That's all the process that is going on. So you're all familiar with that. The last part, well, bonnet that we didn't do anything, we just make sure that it was work with that, especially the SDMPAR multi-tenant and then high availability, right? Is that fulfilled? For some people, I would say no. I would say no, it's not. But there's a few things that I would like to share today about how we actually indeed fulfill HA. So I'm going to move to a little bit of a concept, because before jumping to the next slide, I would like to review basic concept of HA. So probably everybody's familiar with all this concept. Stainless versus a stateful, right? So we do have both in open stack. And I just put all the NOVA process. And most of the process API is stainless versus persistent and the message boosts that actually are stateful. Another concept that I want to quickly review is active-active versus active-passive. Basically, in active-passive, you maintain a backup. So as you can see in our architecture, we have active-passive for our MySQL and nothing for the active-active in the other process at this level. But that's the main difference, right? Active-active, you have another system replicated automatically, and it will be promoted, or it will be some kind of low-ban and see distributing the calls. And not in one of them, it's down, all the calls will be to another one. And that's kind of like the key of active-active, that you don't promote, you don't touch the system. You build a higher-level architecture on the W API to actually distribute the calls to more than one API line. So everybody what is doing there is just creating instead of one box, three boxes, H approach in the top, keep a line, and then it's going to distribute to these boxes. Can you do that at different level? Yes, you can. Hold that all. Then we're going to talk about control plane and data plane. You need to provide a chair at both levels, right? Control plane is just the ability to change the configuration of your cloud, creating your tenants, changing your quotas, obviously creating VMs, or whatever virtualization technology you ended up doing, changing your network configuration, adding security groups, blah, blah, blah, everything, right? All that is the control plane. Data plane is actually the communication from your users or end clients all the way to the VMs and vice versa. That is the data plane. You want both. In my case, I prefer to have a gap in my control plane than in my data plane. I don't want my customer running an application in production. And suddenly, what happened? Timeouts is gone, right? It's a private cloud. So I have certain ability to control the number of VMs and the concurrency of the VMs that I deploy. But in the data plane, I have zero control. My user code starts sending, downloading traffic. And I have visibility, but I will not tell you, you'd rather not download that file because I think it's going to be a little bit big. You can't do that, right? Finally, corums versus clusters. I'm sorry, corums and clusters. I mean, talking the whole day, sorry. The corums is this ability to have a certain number of entities to make a decision. You're not just one person to make a decision. You want to have corums to make a decision. And cluster is just ability to replicate the same architecture of the same level multiple times. So high availability. Let's go back to that, right? Two key factors, control plane and data plane. Control plane, why would it to actually satisfy high availability in control plane? We create a concept of worthy availability zones. Forget about open stack availability zone. It's worthy availability zone. It's our network infrastructure at the core level that we develop to have HA for all the services that will be running in our infrastructure from the top of the rack to the bottom, or for the architects in tech networking from the top of the rack to the top because they draw things upside down. I don't know why, but anyway. So a quickly description. The availability zone is a full redundant network all the way from the H routers, firewalls, low balancers, core network, et cetera, top of the rack. This is why we have the bonding interfaces. And this is why we couldn't change it. And this is why when somebody came to you, it's like, oh, yeah, but you want to have this technology. You have to have three interfaces. One is going to be your data traffic, your management traffic, your API traffic, and your out of band traffic. How many racks do you need? How many top of the racks do you need to actually satisfy all those physical ports? So no, we just have one. And the out of band. That is out of this scope. We do have out of band to our servers, obviously, to do things like biochanges, out of provisioning, co-order, and all that good stuff from the infra team. These are very specific details about our architecture. So instead of giving you all the full speech of things that you can actually see recorded later from our slides, I would like to show you what it looks like in our data center. So this is the AC architectural design. It looks a little bit complex, but it's not that really complex. And I intentionally, it looks a little bit weird because I have to intentionally block few parts that are actually confidential information that I could not share with you, unfortunately. But the most important parts are here, right? From the edge router, everything is in pairs, as you can see. And that's the important part. So we have the edge firewall, the edge router, the core network in. These are the MX routers for the SDN get weight. Obviously, all our top of the rack switch are going in pairs, like little friends all the time. And this is where WPC is running. So how can I leverage this architecture to the point that I can satisfy my HA without changing my current open stack architecture? Well, it's very, very simple. You build two of these. And you build one cluster in each one of them. So what happened is you simply replicate your architecture that you have at this level in two sides. You have the ability to control your API calls to your user all the way to one cluster or another cluster. So this is killing fly with a bullet. But on the other hand, with the size of our data centers, this is actually one of the best decisions that we can have taken. It's a big investment. No doubt about it. But this is just one of the components that we run in our data center. We have a bunch of more stuff that we have the same level of HA or persistent, et cetera. So we wanted to build a very formal, a very robust network architecture. And this is what we call it, the availability zones for our services. And we, WPC, just leveraged that. So this is a user story. This is our path to production. And this is what. So now we have one cluster per data center, per availability zone. But because that amazing team at Workit did a wonderful job and actually had the ability to, through Chef, simple changes in data bags, create multiple clusters. So you can quickly change these to even adding more clusters to one availability zone. So even in the worst case scenario to do lose a completely unavailability zone, you can create another cluster very, very quickly and redirect through low balance of changes to the public API or the Keystone to a different cluster really, really simple. So that's contemplating. And that's even that I believe that's one of the most complicated things. It wasn't. It wasn't that. So we have a very good partnership with our network in any sensitive team. And that's we reached that point. So that was actually great. So moving on to have time for questions. So now let's talk about the data plane. So the data plane is fully controlled or mostly controlled by the SDM parts. As I said before, we're using open control. This is the typical open control architecture. So you can download it here from all the documentation. There is nothing magic here. There is nothing that I could. If you want to learn more about the open control tomorrow from 6 to 9, the open control user group is going to get together, especially the advisory group to understand what are the gaps for the project that actually we can fulfill. So in terms of the architecture, what I can highlight is the part of the big router. The big router is actually the path that actually keeps sending the traffic to between the other VMs in the same cluster. They can even do it across different clusters as long as you have connectivity to your MX getways. In our case, big, big, big point. I forgot to mention. We have what we call a full, rotable, layer 3 network. So we don't use, this is going to be hard to explain, we don't use private networks. But everything is a private network. So we have a very large, sludge-aged network that we use as SDM network. So everything is rotable. Everything is reachable. You will see very fancy OpenStack demos, especially for the SDM vendors, creating a public network and a private network and a router and a low balancer and a lot of connection there. And it's very cool, and you can do it with this. But guess what? Our use cases doesn't really require all the things. Everything is a provided network, which is the concept that is more common known in the, especially in the networking community here in OpenStack. So everything is a provided network, which means it's rotable. And the way we protect the communication between them is through a very high and severe set of security groups or rules. We end up having, and this is a key point, around some cases, even 80 different security groups per port, per subnet or virtual network in our system. Why? Because by the file, what we do with OpenControl is every time that we create a virtual network, we close all the doors. Everything is denied all. And then you start opening the doors that your security team has approved. We don't do it touching the system. We do it through the most wonderful thing that I found in engineering, which is source code. So we have a project that we actually called WPC environments. That's just a name. But basically, it is a set of the policy rules. And the policies are not writing in a control-specific Python code. No, that would be a terrible mistake. Everything is in YAML files, which means that everybody could actually, from the networking point, not everybody, networking people can understand how to write a new security policies without knowing nothing about this architecture. And that YAML file could be sent to another team in infra or networking and be translated into either IP tables or configuration for the other part of our availability since they're networking. They don't need to know that nothing is attached to control at that point. Focusing in the data plane, so the most important time is the VR route that actually communicates and has the ability to have to run in something that is called headless mode. So in the worst case scenario that we lose all the controller, the VR router will maintain. And that's something that you can configure from the installation. It's being configurable to not lose the BIFs configuration, basically all the routing configuration for the current state of the network. Yes, you lose your control plane to that point. But you don't have your customers creating tickets because they lost data in their VMs, or they cannot connect to their VMs, or they were SSHing to their VMs. The combination had not, and you don't have an idea what happened. It's better to have a trying to create a VM that is a network not available, something like that, that you have information to react to having a customer like, hey, what happened? I was SSHing out. I cannot connect. I don't know what's going. I'm crazy to try the another SSH, and things are really, really bad. You lose certain things. For the people who is very familiar with control and want to catch me in a very interesting point, you will say, well, but control has certain specific management things in the control plane that you may lose. The key one is DNS, and I will tell you, yes, you're right, and you're very smart. But this is what I'm discovering myself in advance. One of the things that we're doing to solve that problem is instead of configuring the DNS or the virtual DNS and the only entry on the VMs to have connectivity from inside to the external world, we have a second entry of the DNS of configuration on the clients to talk to our physical DNS that actually are highly redundant in our data center. So that should provide the right results that we are expecting. So we shouldn't be losing any kind of data plane at all. So this is not a session about open control, but we can have an extended session on open control if you want it. So yes, actually, we're happy. No, we are not extremely happy, but we're good, right? But yes, we want to build a more complex architecture per one of our clusters. But now we know exactly what we need to prove. I don't need to drag somebody else's diagram, somebody else's HA model. I'm building or we are building our own model. So we just start extending the things that we know. So we know that we need an HA proxy. We know that we need to keep alive. Maybe we can actually virtualize all these parts. Maybe I can actually use these as a platform as a service because there is really nothing in OpenStack that is needed in these two boxes that you cannot run from another service that is already provided in HA. You can actually move the whole thing to F5 low balancer just to mention one vendor or something else that you have already in the data center. Just what you have. You do not remain the will. On the controllers, we have the ability to run these and these. So it's very straightforward. You just need to change your code books in our case for Chef to actually have the rabbit and queue in HA queues. In the Galera, somebody say, hey, you draw strong because you need three for code. Yeah, you're right. So it's just another box. It's not a big deal. The configuration in the Cassandra is exactly the same. So you can start extending your HA architecture. And you may end it off with the same scary picture that I showed in the beginning. But now it's your scary picture. You know what is behind it. You control it. You can make changes. And that's where I really want to bring you. So key takeaway, this is my personal advice because I really like what we have done at work. They do it by yourself. I know you love the distros. I know you love all these companies around there. But try it by yourself. It may look painful and long, et cetera. But it's because most of the operators and users are trying to deploy very complex HA architecture. So don't go crazy, baby steps, and you will do it. Focus on your your cases. Don't try to resolve everything. Maybe Layer 3 is good enough. Maybe you don't need to provide the networks. Maybe you don't need a loss balance as a service, far as well as a service, service chaining. And God knows what is going to come for the next release. So focus on your use cases and satisfy the use cases. See, the use cases have a lot of changes. Make an architectural to be flexible enough to be adopting or evolving as your use cases are evolving. Obviously, be brave and use the latest releases. It's going to take some time. And you realize that there are releases every six months. And then you say, oh, I'm using this stable version. So I don't know if I should attack or not. I'm going to start with Liberty. And just start with Liberty. And then just start running POCs. And then we're going to have Newton out of the door. And Ocata will be planning. And then you end it all with a version that in a year and a half, people will say, are you still running Liberty? That's just crazy. Build your own. I'm sorry, there are many ways to provide H8. That was the goal of the session, just to give you an alternative to provide H8 to your control plane and your data plane. Build your own CI CD. That's critical. This is basic. As I said before, we have a very long presentation on this party. We're interested how we build it at workday. Try to not get stuck in your architecture. So make an architecture that can be flexible. One of the most interesting conversations that I have with the smart team that I have is that they are breaking apart my architecture. So I feel like, why do you need an architecture? You're changing things. But that's healthy. And that's the way it has to be done. So be flexible. Understand that the architecture needs to evolve as the project evolves. Be ready for change and obviously have fun. So thank you so much. I'm ready for any kind of questions that you have. And please use any of the micros. So I don't see nothing because of these lights. So when we go back to your slide about what you have with your HA, it's actually still not HA because you don't have any fencing. I don't have what? Fencing. So if something goes wrong on a node, the node stays alive. And while with HA, the node, you're actually killing that node so that you're sure that it's not disrupting anything. So a good example, if you keep alive, it's not working. And you have HA proxy not working on a node, but the virtual IP is still running, then things go wrong. You need fencing to fix that. So it's still, I mean, for me, it's not HA. It's on the right way, I would say, but it's not HA. So I would say that your requirement about HA is not met. So that was not a question. That was an operation. So I can actually reply to your observation in the follow way. If this is not a HA for you, obviously, the other architecture is not a HA. So I disagree with you. I disagree with you because the goal of this conversation is to have HA different levels, right? And I could agree with you in the technical part. And this is not about making an argument who's right or who's not. You're right. You're certainly right. My point here is to start making baby steps. So you are jumping the trigger to a more complex architecture. That we may end up in six months be there, but I don't want to start driving crazy the group of engineers that I have with an architecture that it's not going to be comprehensive. I agree with you. I mean, I just want to correct the fact that this is not HA. I mean, people might think it is HA. It's not yet HA. The second point I had was when you made the big diagram about how it looks like with the HA architecture and it's difficult to deploy and so on, that is true. But that's it's true if you do it by yourself. If you use a tool that does it for you, then it's not difficult. Sure. So I think it's a matter of whether you have the requirement to do it by yourself, which was actually not listed, but I think you had it. Sure. Thanks. Another good observation. But if you don't need to do it by yourself, then you can actually, it's not difficult. Sure. Hi, I had a question about your neutron implementation. What HA solution did you choose for neutron? I know you have contrail. I don't know if that's, I'm not familiar with it, so. Sure. So contrail replace is 100% neutron. So the only thing that we deploy with neutron is the neutron server for the API. We don't even use the neutron schema. Contrail replace the whole thing. So all the control plane is managed by the components that I show in the control architecture. And the data plane is also controlled by the view routers. We lack of the same problem. If you lose the view router, you actually lose connectivity to that VM. But it's exactly the same problem that you have with neutron, with either the OVA, so the Linux rich agent gets down. So is it basically bridged to the contrail replacements? So do you set up your bonds, and then you build your bridges, and then just let it be a drop-in replacement bridge to that interface, or? It's very similar, but it's not exactly that way. OK. So the neutron server, basically, you use the open control plugin. And basically, it's just a proxy call of the API call. So in the open control, and there's a long process happening on the compute machine. So there is the VHOS interface that is automatically created by the agent space of the control. And it's actually where the contrail view router will be making all the changes. So NOVA will make the user space, and NOVA will create, obviously, the VM. And it will create the tap interfaces. And the VRouter will make the connection between the time interfaces and the beef interfaces inside of this contrail view router. That's the way it works in a very high level. So there is no components at all of neutron, but the neutron server, the API. So the last question I had is about how you do monitoring and logging. Do you use Solometer, or do you use anything like that to push, or do you aggregate all your logs and look at them in one central place? Because I know you had the block for contrail logging, or analytics, or whatever. Yes, that's a good question. So the logging, we send all the logins to Arc's log. We actually have an internal project. We call it Sol as just internal, to actually filter all the logins and send it to something similar to the Elk server for actually aggregate all the logs of all these components. On the monitoring part, there is a negative checks running in each one of the servers. Obviously, for the controller boxes, there are multiple negative checks. For the components, it's much more less the negative checks. All they're reported to the negative server. And at the word that we have, we work with a company. I don't know if it's a startup anymore, but it's called Way From. And actually, they produce a very cool dashboard. So we have a kind of like a plug-in to be connected with the Nagios server. So we abstract out of the Nagios all the information. We have a fancy dashboard that actually shows you everything you want. So for instance, we have lately a requirement to see. Now I can see the CPU overhead, so I can see the memory. But I have a PM, which is actually in the room, asking me like, I don't know nothing about the capacity. I don't know how many VMs are you running. I don't know how many cores are you using, how many are left, how can I allocate more if needed. So we also use that mechanism to collect information out of the cluster. So sometimes we don't just collect the net information out of the hypervisor. We actually cool informed from the Nagios check, API calls to the system for capacity or even for performance. So how long is going to take an API call directly to the system? Thank you. Thank you. So you're deploying multiple independent open stack installation into different data centers. And then just spreading the workload amongst them. Do you use anything to centrally target these installations, or are you just individually talking to the different ones? That's a good question. That's perfect, yes. It's what we do. So as I said before, this is all Chef driving. And we don't have a Chef server. Well, let me correct. Yes, we do have a Chef server for disability data centers. Let's put it that way, because things are changing. But it's just for scalability. All the artifacts from the Chef server are coming from the same point. And we are so brave at what we actually do patches every Friday night. So what practically a patch means, it's an update on the artifacts on the Chef server that are already very well tested. And they get the spread across all the data centers. So all these cluster Chef clients are running every 15 minutes. So the moment that you actually push on new artifacts on the Chef server, the Chef client will contact in the moment that they need to contact. And they will pull all the new changes. If you have an upgrade, for instance, we run a very good experience running Chef upgrades, which you update, you change the packages in your, or you add the new packages and you jump repos. And then you change your code, say like, you're not going to use the inference version of Chef. You're going to use the hammer version. So automatically it happened. And it was very cool. We haven't tried it for open stack because indeed, if somebody asked me about upgrades, I'm not planning to do upgrades. I'm planning to kill one of these clusters that I showed before. And I don't know what is it. Killing one of the clusters and actually build the other one with a new version of Open God Trail Open Stack and then having all the load in these ones. Once I'm very happy with a new one, I will actually kill all the sales VMs from this one, having it in this one, using the Chef for the persistence ones. And then once I have all the VMs are backquitted from this cluster, I will kill it again and re-install it with a new version of Open Stack, leveraging the same point. Another thing to keep, maybe you want to know like, how we actually keep these clusters equally, not just from the configuration management point of view, but from the onboarding process. So we also use Python code as the same way that we create the policies through Python code. All the tenets creation, quotas configuration, flavors, networks, et cetera. Everything is done by Gary. So basically the workflow is like somebody file a ticket to service now. We should have a piece of code in Gary. This is approved by security. It gets deployed to the clusters everywhere. Maybe one more question because it's, but. I have two questions. So the first one is related to this HIR architecture. So this one, if you're enforcing the users to put their application in two different clouds, then it will have sacrifices on either on the performance issues related to networking, which it has to travel from one cloud to another. And the second one is also has related to orchestration. So you cannot have a single stack that is deployed across multiple clouds. So the first part is if we were letting the customers or our users to deploy the VMs that belongs to the same multi-tier application, you will be right, but we don't let to do that. We have something in the middle between the customer, the users, and open stack APIs to avoid that part. So it's an internal project. We call it Gourmet. But it's basically something like Cloud Foundry. It's a very easy way that actually is application aware platform as a service to let you distribute the VMs in the way that there will be VM affinity in the same host, in the same cluster to actually avoid that part. For the second questions, could you repeat it a little bit because I was a little bit... Okay, the second question would be related to the MX Gateway. So does the architecture work without this MX Gateway portion because that one is a very expensive hardware piece. So does it work without this one? Good, good, good point. Theoretically, it does because Open Country provides a software gateway as any other company. With the mono-traffic that you will be sending, I don't think you will be happy with the performance using all the traffic to a software gateway. For this kind of architecture, it will not make sense to do a software. It's expensive, but it's actually that we need to keep the performance of the data center. Okay, thank you. Thank you. Thank you, everybody, for attending. Have a good summit.