 4,100. Kiedy wchodzisz w ten rum, myślę, że możecie się zrozumieć, dlaczego pokażę tę numerę. Możesz mieć jakieś wypowiedzi. Dlaczego? Właśnie dla nas w Nowym Człowiecach ta numer jest numerą odstęczonych, o której ruszamy teraz w naszej instytucji. Instytucja, która sprowadziła w więcej niż 30 regionach w 10 miejscach między 8 krajami. I możecie się zrozumieć, dlaczego pokażę te wypowiedzi? Chcę, aby zrozumieć jak na wielkim sklepie ruszamy nasz OpenStack Cloud. Ten sklep daje nam unik widok na challenges, na problemy i na problemy, które się sprowadzamy każdego dnia. I szczególnie na problemy i problemy, które się sprowadzamy w ostatnich 3 latach. W ostatnich 3 latach, to była jakaś dziewczyna do punktu, kiedy integrowaliśmy L3-services, więc distributowali digitalne routers, sklepujące IP-i i eksternalne gateways w jednej z największych OpenStack Clouds. Mój name jest Dawid i razem z moimi kolegami Mohan i Michał i z naszym kombinującym 25 plusów doświadczeń w Cloud Computing chcielibyśmy zapomnieć wszystkich oficji żeby się z nimi znać. Mam nadzieję, że jesteście gotowe. Zacznijmy, aby zacząć, aby znalazły kilka lat do punktu, kiedy na OVH Cloud nie mieliśmy OpenStack at all. W tym momencie, jakaś serwiska wytrzymała do rentu fizycznego, dla tych typów serwerów otworzyliśmy do klienci dwa typy networking. Pierwsza typa networka był public network. I dla tego networku jest jedna specjalna feature, która była trudna, wytrzymała w OpenStack wytrzymała do IP wytrzymała. Więc kiedy ktoś znał serwery i na ten manager widzi, że ten IP został do swojego serwera private, wtedy, kiedy SSH do tego serwera widzi też ten IP na jednej z jego interfaces. I oczywiście podczas publicznej networking wprowadziliśmy private networking. I dla private networking wydarzyliśmy technologię Wyrtularka za krótko. I tę technologię wydarzyła do networkingu między dwóch serwerów private, nie ma nic, gdzie są oni. To może być same rach, może być dwa rachy, może być dwa datacentry 100 miasta. W tym momencie z serwerami, wygląda na to, że wszystkie serwery są zbierane do samej rachy, w Wyrtularku. I chciałbym naprawdę, żebyś fokusował na te dwóch serwerów, ponieważ kiedy zaczynamy myśleć, że musieliśmy wrócić w wyrtulowe maszyny, czyli OpenStack Cloud, to jest jedna rekomensja, że będziemy chciał, że naszymi klientami mają te same doświadczenie z wyrtulowe maszyny, jak oni mają z serwerami fizyczne. Więc musimy zrozumieć, że public networking prezentuje IP od razu i na wyrtulowe maszyny jest to, co możemy mieć z wyrtulowym rachem, żeby to pojawić, że wszystkie WMS są zbierane do samej rachy. To jest to, co wydarzyliśmy. Więc na public networking, który nazywa się BGP Network, na każdej instancji, kiedy naszymi klientami zaczynają, też zbieramy instancję. To jest obie instancja, więc mamy jedną instancję per każdej instancji. To wszystkie instancje są zbierane do jednej instancji, ale między nami mamy namęszkę, i w tej namęszce mamy jedną rolę IP, a druga równa do tej instancji a druga równa jest jedna z tych instancji. To wszystkie instancje są zbierane do jednej instancji i wtedy wszystkie IP są zbierane do jednej instancji, a tutaj też mamy równa do każdej instancji. Mam nadzieję, że nie jesteście zbierani z diagramami, bo jest jedna z nich i druga równa jest na private networking. Uwagamy w ten sposób, aby wyłączyć instancje w ten WIRAK Network. WIRAK Network jest, jeżeli zostanie zobaczemy, na przykład z bardzo basicą conceptsznię, tylko w ON w ON. A więc zzjedliśmy ten ON na koncepcji networkinga i dodaliśmy się u wykształcenie do innego rolę. Więc z instancjiним jak w nami. W integracji bridges mamy jakiś rodzaj localny WIRAK, ale potem, przed wystarczającym ten node, musimy to opakować w customer-village do VRAG-village. To wytrzymuje do topu rach, a wtedy w OVH-networkie, to wszystko jest łączone przez VX-lans, które są niebezpieczne dla klasy. A potem wytrzymuje drugie node, musimy to opakować z VRAG-village, a potem customer-village jest utwórczone do localnego village i wróci do instancji. Musimy to opakować z customer-village w środku, nawet jeśli to nie jest używane w tym używaniu, ponieważ VRAG-networkie nie tylko jest łączone przez instancje, ale też jest łączone przez instancje w fizycznych serwerach. I w fizycznych serwerach, ten customer-village będzie widział na interfesie, więc musimy też opakować wszystkie pakety w ten village. To jest to, co mieliśmy i produkujemy, kiedy musieliśmy wytrzymać OpenStack Cloud. A teraz chciałbym zapytać Michal, żeby wyobrazić wam, co musieliśmy zmienić i jakaś rozwiązania mamy, kiedy byłyśmy wytrzymać w fizycznych serwerach. Dziękuję, Dawid. Jestem Michal. I wracamy do 2019, kiedy nasza infrastruktura jest na OpenStack Newton. Newton jest na koniec życia, więc musimy znowu i zdecydowaliśmy się, i decydujemy się z Steinem. Wytrzymała nasz wytrzymań, że Dawid wytrzymał nasz rozwiązanie multi-agent i kod bądź multi-agentem na Newtona kod. To jest naprawdę trudne dla nas, to jest zbyt pomyśleliwe, aby wytrzymać z Steinem. I ostatni koncert, który mamy, jest, że chcemy wytrzymać wytrzymań wytrzymań dla naszów. Więc, jak się otworzyliśmy? Pierwsza wytrzymań, byłabym zbyt pomyśleliwego agentu do mnóstwo neutrony plug-ins. Pierwsza jest bgp agent, który jest całkowicie naszą solucją, i używamy KWAGA, Software for IP Announcement. Pierwsza agent jest VRAC agent. I to jest plug-in, który jest używany dla wytrzymań wytrzymań, ale wytrzymałyśmy wytrzymań wytrzymańską solucją, czyli wytrzymań wytrzymańskich, że Dawid wytrzymał nasz rozwiązanie. I oczywiście, musimy wytrzymać wytrzymań L3 agent, aby przygotować wytrzymań wytrzymań L3 wytrzymań L3 serwysów dla naszów. Przestępnialy było zabić wytrzymań wytrzymań wytrzymań, ulepszać agenty do dźwięku, i to ulepszył wytrzymań wytrzymań wytrzymań wytrzymań wytrzymań. I teraz, teraz jesteśmy smarti, ponieważ ulepszyliśmy nasz agent z różnych plug-ins neutrony. Więc to będzie naprawdę, naprawdę łatwe, aby ulepszyć do następnych rozwiązań. Aby wytrzymać, wytrzymań wytrzymań nie zmienił, ale wytrzymań wytrzymań będziemy zainteresować jakieś feature do BGP agent, aby ulepszyć ulepsze dźwięki externy i ulepszyć wytrzymań wytrzymań. To będzie diagram. Więc, jak nasz agent zlepszył wytrzymań wytrzymań wytrzymań wytrzymań? Więc oczywiście, musimy ulepszyć neutrony. Więc zwykle, tak jak powiedziałem, mamy z dwóch lub trzy neutrony serwerów, czyli ulepszyłem neutrony RPC i IPI neutrony. Chcemy do tego, żeby na networkach, tak żeby z zniwami stworzyć tom, poradniłem, a鑽ąłem i dałem, czyli ulepszyłem neutrony. Więc wytrzymałem wytrzymań z dźwięki, wytrzymałem, z zniwami z dźwięki, wytrzymałem, ulepszyłem neutrony CZU, przez uncommonznie okazję, wytrzymałem, od cruise 1, z rynkami, ze względu na zagroj, Koniecznie można zauważyć rząd, ale można też otworzyć reżyserze w rady 2000, a w tej reżyserze mamy 7 serwerów neutronne. Co robiliśmy żeby zainteresować l3. Pierwsza rzecz była zainteresować BGP agenty i Virac agents. BGP agent now can handle L3 ports, so announce the L3 ports, so the floating IPs and external gate reports. And also VIRAC agent is doing small part when we want to use floating IPs because it's configuring L2 connectivity for us, so tuning of the agent. Another thing, the problem that default behavior of Linux kernel is causing for us. If we attaching new IP on some interface, when we adding new IP to some interface, kernel creates a new default root by default. I if some customer will choose the same IP range that we using for our internal connectivity to top of rack, our networking can be broken. This is not an issue that OpenStack has because there is no VIRAC in OpenStack, but it is on OVH. So we needed to tune L3 agent to remove that rule. And the last thing, of course, is to plug L3 namespaces, so floating IP namespaces, SNAT namespaces, Qrouter namespaces into our stack. So a lot of work related with creating proper routes by L3 agent. So this is how our stack look like if we talking about floating IP. So we have an instance, and from that instance we want to go outside, with outside traffic to external world. So, of course, from the instance to the integration bridge, we need to go to the Qrouter because it's private networking, so we are plugged into some private network, so into router. And from there we need to go to the floating namespaces. And this is the part that L3 agent configured for us. There we have this small part where the VIRAC agent configures routes for us and untacks the villain. And already prepared routes by BGP agent and announcement allows us to go outside of our RAC to the internet. So few years before, actually we implemented something like floating IPs. This is our own solution implemented as a neutral service. It's IP failover, it's marketing name, or IP alias, it's internal technical name. The idea was also to allow our customers to attach IP to private servers, bare metal instances, so generally also for private cloud. Now we're going with open stack floating IPs and we decided to keep those two solutions. So we support both of them. The biggest difference is that in pay IP aliases customer needs to add the IP manually on the interface. In floating IPs everything is done by proper routing that we configure and it's transparent for our customers just needs to attach the IP on panel. So what is happening if there is no floating IP attached, but we want to have connectivity with external world. So we can configure external gateway. So how it works at OVH. So we have an instance and the traffic wants to go outside to the internet. So as you can see by our VRAC network we're going to different node. It's our network node so the SNAT node. So by VRAC networking we're going to the integration bridge and to SNAT namespaces where we prepared our routes that are needed to allow this traffic to go out. So that is the VRAC part where we're doing untagging and there BGP announcement is done by our BGP agent. So we can go out with our traffic. So at this point everything looks fine from architectural point of view. Everything should work, but then new challenge appear. We have new customer needs. We want to introduce Octavia load balancing. So we have two problems here. So the first problem is that VIP port of Octavia load balancer is down and unbound all the time. This is how this is designed in OpenStack. So we needed to learn our BGP agent how to handle this kind of ports. Because previously it doesn't care at all about ports that are in down state. Second problem is that VIP of Octavia load balancer is present on the interface in the SNAT namespace where the router is active. So in SNAT node somewhere. So we needed somehow introduce a solution that allow us to go with this traffic outside. So it also is done on BGP level, a BGP agent level by giving customers possibility to attach floating IPs. So there's VIP load balancing ports. So now from architectural point of view everything is done. But of course we introduce some bugs and problems. At Mohan can you explain what happened next? Dziękuję Michael for the wonderful details. With the controller of scale what we are operating introducing a new challenge in our infra is not an easy job. With the help of our dedicated SRE and the dev team we are able to face it. For the better interest of time I like to share rather three recent challenges what we are facing in our infra. So these are challenges I am going to talk about. The first challenge is duplicate packets with multi-region router. So before jumping to the challenge I'd like to highlight few details about our infra. So we have a VRAC network. It's basically a private network. And we are trying to reach a router that sits on different region from a VM that's in local region. So it means there is no local router to answer back your request. So how it looks like. Here you can see on the region one we have a router and a VM. So the VM is trying to reach a router's gateway. It's very much local and we are seeing a single request. It's everything looks fine now. So here we are seeing some weird issue. We are trying to reach a different region router. So then we see multiple duplicate packets. It means there is something wrong with the infrastructure we need to do it. So when we analyzed the issue, we figured out, so we have something here, the private VRAC network. It basically tried to distribute a request to all the routers that sits in different regions. Here you can see on the region B we have a node one and we are VM on a local and we don't have any router. So this VM, if trying to talk to router that sits in node two, then the VRAC network actually is rather sending a request direct to node two. It's actually distributing to all the routers. Hence all the routers are replying back with their duplicate request. So we're still trying to solve this issue because it's kind of an issue with multi regions. We're trying to address what's the best use case we can solve. So the second challenge what we see is we are not able to reach metadata service. So how actually the metadata service works in upstream is we have an instance. When an instance is actually posted on a network that is connected to router, then the router is how to solve the metadata service request. So in our case we have L3 services. L3 service is taking some time to set up all the routes on gateways to host a proxy. So when the instance is booting up, the cloud in its script is trying to reach the router's named space to reach a metadata service. So that time it was failing because the router is still taking a time. So the metadata agent is supposed to be set up a metadata proxy. It's actually waiting the agents to complete all these actions. So after some point of time the cloud in its is giving up because it's not able to find metadata service. So how we solved it is using the workaround what upstream is providing. So rather depending on the router we try to take a DKCP support for metadata service. By setting this option force metadata equal to true, we actually tell the DKCP agent to append the particular static IP on the VM space. So when the VM is trying to send the metadata request, it actually go to the DKCP port rather going to router's gateway. So the third and final challenge what I'm trying to describe is the flapping router. So we have a BGP agent and L2 agent in place. So the BGP agent is actually depending on L2 agent to send the port binding successful message in order to complete it further actions. So this port binding is actually a complex process. It has to go through lot of actions, the lot of queues and listeners in place to complete this binding successfully. So after this bind successful message came, the BGP agent is the route advertisement to KWAGA. So in the meanwhile L3 agent is trying to set up a router and HHS script to check the info is healthy and this works fine. So at this point everything looks fine, it works because we have router advertisement in place and KWAGA does this part and HHS script is trying to reach internet to check the info is healthy and looks fine. So the issue here is sometime the route advertisement takes rather a few more time than expected. So the pink fails. It means when the HHS script fails, the router is trying to go to the failed state. In the failed state is clean up all the routes. So it's going to backup mode. So in the backup mode the router will not send any VRRP packets. So the same story continues, the other node router is trying to become from active. So it also has to go through all these actions happening here. And if the router's advertisement is done on time on other node, then it's again go to backup mode. Meaning it fails, first it's reached to fail node and clean up all the routes and it's go to backup mode. So then VRRP packets not in a place. So again the first node will again try to go to active mode. So basically this loop continues until something we need to do on the improvement side of HHS script. So what we did, so we basically improved HHS script in a way say just wait for a BGP agent to complete all these actions. So when BGP agent complete all these route advertisements or route settings, so then try to do the checks what we are trying to do on the external side. Again we know it's like, it may not be fit for all the use cases because the condition what we are checking for HHS script it may depend on what exact use case was SRE try to solve. Sometimes we need to say, we need to fit some external checks points to depend on what exactly the meaning of failure to understand by the HHS script. So we try to do with templating format so where the SRE can set up what exactly the output I want to get it from the HHS script in order to make the state of failure. And the long term solution what we are looking at sync up the port status. So at any given point of time there are L2 agent or L3 agent should know all the port status so that they can make the internal decisions that are depending on each other. So by doing so we also introduce a bug. How it looks like, so this typical behavior we have one active router and one backup router on two different SNAT node and SG interface, SNAT gateway interface is actually binded to active. Everything looks fine. And when some package upgrade happens or some installation happens so we see some disturbance on the traffic. Means the traffic, the VXN traffic is actually disturbed because of that the VRRP packet is supposed to send from active to backup it's not reaching time or it's somewhere dropped in the middle. So because of this the backup it has to make a decision to go back to active by doing so it's sending some request called bind SNAT gateway interface. It was successful that it able to bind SG interface and it become active. Here again it's a weird issue because it's not intended or expected behavior to active routers at a given point of time. Again it's not, we are seeing for a certain amount of time so it's able to recover back when the traffic is restored. So traffic is restored and the new active see, okay there is already active it's sending VRRP packets so I suppose to go to backup it's able to change its transition but the SG interface still binded to the backup node. It means when the transition happens from active to backup there is no such request called unbound SG interface there is no such event happened. So because of this we see SG interface SNAT gateway interface is still binded to backup router and we need some manual interruption or some kind of external canson to do that to actually unbound from backup node. So with that I shared a few recent challenges and I now request David back to the stage to share our future roadmap. Thank you Mohan. So during this presentation you learned a little bit of historical reason why and how we designed our network what then we changed to introduce L3 to all of this. Few challenges just a few of dozens we have that we seen during alpha and beta stages but you may be wondering what we are going to do next. So first of all the stuff that we described just is now in beta and we would like to go for general availability. Second thing that we would like to introduce next is load balancing as a service of Octavia that we have mentioned before during the presentation and for the more long term goal we are also thinking about changing how our physical network is being built so that we will have better networking and more reliable networking. To finish things off if you are already using OVH cloud and you are our customer I invite you to choose the region Grand 9 and on that region we have already enabled everything that we just spoke in beta and if you are not our client or maybe you just didn't use for a while we have a promo code I believe it's for 50 euro that you can use during this month to check things out if you like OVH or not. That being said thank you very much for coming I wish you plant rest of the summit and thank you. I think you have two minutes so if you have any questions there's a microphone. I have a question and to find the microphone maybe you will hear me I think if I get it the VRAC realizes your private network to be different from yours or it goes through internet No, the VRAC realizes between the data center the question was how the VRAC works if it's utilize internet connection or internal network and the answer is that it relies on our internet network so we have some tunnels between data centers so everything goes via this and it also goes via our infrastructure in terms of networking in terms of cables itself Hey, thank you very much for the talk if I may ask what are your thoughts on Octavia what kind of implementation are you planning on using? I mean I'm not in the team who's working on Octavia right now so I don't have that much of details I mean we are just testing it seeing how it works and I mean we're basically planning to bring it as a value and see what our customers are going to do with it Yeah, nice talk I was going to ask if you're advertising the floating IPs via these as an ex-hop using the gateway port agent why not using neutron dynamic routing project in the first place? I'm not sure if I understood fully the question So you're advertising routes to the floating IP using the gateway port agent IP as an ex-hop? No, we are advertising directly to the node Directly connected route Yes, we are advertising each IP that is living on given host and we are advertising that it's on this host and then internal routing to name space or instance Okay, but still why did you evaluate neutron dynamic routing? And it didn't work for a reason I believe we did and I'm not don't remember the details right now we've seen some issue with this I don't remember right now I can check and we can get in touch and discuss this Thank you He had the same question too so we're gonna follow up later Thank you Thank you very much