 Okay, so let's start. Welcome everybody to these new sessions where I'm going to talk about humanities at the edge. So my name is Adrien Lever. I am professor at IMT Atlantic of French Engineering school and I'm a sole leader of the stack team. I am Gary Manawil, PhD student within the stack team. So what we are doing at the stack research group mainly we are working on the next generation of digitalized infrastructures. So namely cloud computing, fog, edge, and behind. And to give you an idea of the infrastructure we are working on can give a look to these figures. So on the top you can find the large data center that holds computing resources and then spread across the internet. You can find medium, micro and nano data centers that host computational and storage resources. And you can also extend this micro and nano data center to the extreme edge, for example, where you can deploy such a small data center in a public transport such as train or aircraft. So basically we try to solve two questions, how we can operate such a geodistricted infrastructures and how we can use such an infrastructure. So why we are there? During the last couple of years, we were deeply involved in the OpenStack community where we tried to answer those questions within the OpenStack framework. So we can do two kind of studies. The first one is related to OpenStack OneWide. So basically the idea is that you deploy all control plane services of OpenStack in the cloud data centers. And then on each edge site, you deploy a remote compute node. Basically, you deploy Nova and Neutron Agents. So to evaluate how OpenStack behave in such a scenario, we develop an open source tools which call NOS that allow us to conduct an experimental campaign of OpenStack under such conditions. So basically we evaluated the scalability of OpenStack, the performance of OpenStack. We evaluated some alternatives in terms of communication bus, etc. If you are interested by that, I invite you to give a look to the FAMVCC Wiki page on the OpenStack website. So at the end of these first studies, we actually identify some trouble that appear when you face network disconnection. So basically you may lost some remote compute nodes, but in the most critical case when actually you face a disconnection between the control plane and the remote compute nodes, you can lose, sorry, the root infrastructure. So it doesn't mean that actually if you really want to deploy an edge infrastructure, autonomy matters. Each edge site should be able to satisfy local request, whatever happens at the network level. So to this end, when we investigate a second axis, we actually we try to deploy multiple instance of OpenStack. So each edge site now is fully independent. And we extend the OpenStack key in order to allow do some senses to make collaboration on them. So I'm not going to dive into details. Once again, if you're interested, you can give a look to the presentation. You can see at the bottom of that slide. But the main idea is to explicitly define which service of which site you are going to use to satisfy one request. For example, if you want to start a VM on Berlin using the glance from Denver, you just have to specify that inside the request. The same if you want to get the list of VMs that run in Berlin and Denver, you can also use a combinator. So what we present today, basically since now six months, we start similar studies but on Kubernetes. Our idea is to see in a similar way as we did for OpenStack how Kubernetes behave in such context. So today we're going to present the preliminary results we observed when running Kubernetes at one wide. And then we also discuss possible alternatives that are available nowadays in the Kubernetes ecosystem. So for the ones that are not familiar with Kubernetes, basically it's a system for running and coordinating the containerized applications, so namely the pods. It's a REST based master's lab architectures and it has been designed to deploy content pane and workers on the same DC. So what does it mean that basically all content pane services and the worker that host the pods are connected to a high performance network switch. So what does it mean that no latency, negligible packet loss, no network disconnection. Some conditions that actually are not available in the Hedge scenario. So the goal of this experimental campaign was to evaluate what is the impact of wider network links on Kubernetes. So to this end, we consider the following scenario where we keep all control plan services of Kubernetes on one master site. So in this example, Paris, and we deploy several worker nodes remotely on different edge locations. So in this example, Madrid, London and Berlin. And the question we try to answer is that how this latency can impact the creation of pod. Is there any issue related to the consistency of the cluster states? Are there other services that might be affected by the latency? And typically that's the question we are going to solve within this presentation to answer. So to conduct our experiment, we deploy our evaluation with thousands using a similar tools we develop which is also open source based on Kubernetes. And the scenario we consider is a deployment composed of 100 nodes, one for the single master and 99 workers. And we increase the latency from one millisecond between the master and the workers to 400 milliseconds. For the benchmark, we use cluster loader that enable us to stress infrastructure by creating pods, namespace and so on. So regarding the metrics we collect, Kubernetes come with its own monitoring framework, leveraging promoter use. So by default, you have several metrics such as the API request duration. But unfortunately, there is no metrics that enable us to really capture the impact of the latency. So to this M, we had to revise a bit the go clients to capture the record duration for every component. So basically each time one component inside Kubernetes perform a request, we capture this timestamp and then when we receive the answer, we just make the difference and we have the duration of the request. So with this experimental protocol with enhance, with an evaluation for Kubernetes and we start with pod startup latency. We briefly give an overview of how pod startup works in Kubernetes. So it is divided into three phases. We did this so that we can have a better observability. And if there are any problems, we can know where they are originating from. So there are three phases, creator schedule, schedule to one, run to one. But the creator schedule all starts by the user submitting the object to the IPI server. The IPI server persists the objects in its CD and at that moment the scheduler will patch the object and tries to schedule it and where it can out. And then sends back that object to the IPI server persistency again. And then this moment we can say that the pod is scheduled and we can move to the second phase where the kubelet gets that same object and starts it by preparing the environment and running the containers of the pods. So then the kubelet will eventually send back a status to the IPI server reporting that the pod is now in the runable state, which represents the last phase. So theoretically, if there is a problem, it should be in the two last phases because it's here where we do communication that are affected by latency. So those are the results. We run the benchmark and then we gather the pod startup latency. And we can see that the creator schedule phase is not affected at all because basically that's clear because all the communications happens only between components on the master plan. However, for the two other phases, we can clearly see that they are affected by latency run to watch is a little bit less affected because it only encompasses the status update and its delivery. So there is one run of communication with the IPI server. However, for the schedule to run, it is more affected because it encapsulates a more complex logic for preparing the pods and then starting them communicating with the container run time. But eventually it might also involve multiple runs of communication with the IPI server to fetch config maps or secrets for the container that are needed for the container. So basically this is what we were expecting, but the real question is, is this delay in processing the request is only caused by latency? Are there any other major errors and what is the amount of degradation in request? So for this reason, we have measured the IPI request latency for the cluster and we got those two graphs. Basically the graph or the upper graph shows the master request, the latency of the requests that are issued on master components. And we show that they are not affected by any kind of behavior that may relate to latency, which is something we are expecting for the workers. Actually the request take at least, but not so much from around trip time duration. They are not really degrading in a horrible way. So this is what we were expecting basically. No errors, requests, retries were observed and that was actually a great result for Kubernetes. Basically, we can see that it is actually doing great for our pod startup latency. Now let's move to service discovery and see what's going to happen. So I start by giving a brief overview of how service discovery works in Kubernetes or DNS. So we have pods with their IP addresses, but we want to expose them with a service name. So for this reason, we use a service object in Kubernetes. We give a name for our pods and then Kubernetes will assign it a virtual IP address. In this case, it's 10090 in the example. Once the service object is created, the core DNS or the DNS server of the Kubernetes cluster will observe that and will fetch this object and create the corresponding DNS records, but will eventually resolve this name to its virtual IP address. And at the same time, in phase four, the queue proxies will also fetch that service object and they will inject Linux kernel rules, network and rules that will make sure to forward traffic to the back end parts of the service when requests are sent to the virtual IP address. So now, for the pods that are running on the cluster, they will eventually communicate with the pods using their name. In this case, it's svc.cluster.org. So basically, the service will ask the core DNS to translate the name. It's going to attend the virtual IP of the service and then the client will start communicating with that virtual address and the request will all be forwarded to the back end part. So these are our quakes. Basically, we had an experiment, we deployed NGINX server and the client on the same site and we iterated the license HTTP request. So the hypothesis is that the communication between the services located on the same site should not be affected by latency. And this is actually the main reason why we decided to put them on the same site to reduce the communication latency. So it was actually paradoxical because we have observed two distinct collections of latencies in the one wide case where we had the 50 milliseconds latency between the master and the workers. So few requests took 1.5 milliseconds, but other requests took about 101 milliseconds. So it is not supposed to be done like this. They are on the same site. They should not be affected by latency. So we want to understand the reason why we have a set of fights that will eventually lead to the reason. Basically, CubeCTL deploys core DNS as a replica set on the cluster and the deployment is tainted and core DNS can be scheduled on the master. So basically there is no replica on the master or replica on the edge site. And core DNS itself is exposed as a service with a virtual IP address and the nodes or the pods would eventually get load balancet, the request will get load balancet to one of the replica. So with this in mind, and since there is no DNS record caching within the pods, we can understand the reason why it happened, why we observed those collections of points. So basically a few requests will get the replica set on the edge site, but others will eventually forward it to the replica set on the master, which is 15 milliseconds away. So with a run trip time, we get 100 milliseconds and 1 millisecond for the processing of the request. So with this, we can actually learn some lessons. Basically with the first experiment, we can see like Kubernetes can manage pods one wide without any critical issues. But as long as connectivity, the master can be maintained because otherwise we will have a single point, we might have single point of failure issues with the master and with the single master. For the second experiment, we can actually say that Kubernetes might be okay to be deployed in Edge infrastructure, but it has to be done with care because some unexpected behaviors such as the DNS one can be seen. And maybe in other services also that we haven't measured. So the conclusion is that the centralized control plan seems to be good solution for some use cases. But now the real question is that are there any alternatives in particular to satisfy the expected autonomy of edge site? So with autonomy, we wanted to say that the edge site is self-contained. The master or the workers are on the same side and we don't have partition problems and disconnection between the master and the workers. So for the alternatives, we start with QVAD, which is facilitating multi-plaster federation. It's implemented as a centralized server that distributes and propagates objects. It also implements as an IPI extension. Basically, this is how you do a deployment in QVAD. There is a federated deployment and then eventually the QVAD will create two objects and then we'll submit each deployment object to each cluster independently. Maybe the most interesting feature in QVAD is autonomy. But it still has some problems. Basically, there is no communication cooperation between the clusters for dynamically improving resource management and resource sharing. And there is also this downside of functionalities being re-implemented on the federation control plan such as scheduling. QVAD is also an interesting solution. It is mainly made for, maybe the most interesting feature is device management for IoT, but this is out of our scope. We're interested in giant distribution and mitigating its effects. And in this regard, QVAD equips the nodes with a local SQLite database for caching objects. So now we don't have to communicate with the master and fetch objects each time we need them. They can be in the cache. But now we need to synchronize the local cache with the central ECD. For this reason, we need an asynchronous bi-directional communication so that changes on ECD can be pushed dynamically to the SQLite. It also supports lightweight communication based on the quiz protocol. But conceptually, QVAD is the same as central Kubernetes. It has same limitations, namely single point of failure and other behaviors such as DNS. Submariner is another solution. It might not directly relate to the edge, but it's an interesting solution towards peer-to-peer and resource sharing between clusters and the edge for better resource management. Basically, pods and services of each cluster can be directly reached by other services through VPN connections, by other clusters through VPN connections. It has a central broker which stores all the information required to set up the inter-cluster connectivity. But it's still only limited on networking and also the broker might be a scalability issue, but also a single point of failure. So those are the main interesting initiatives now I give the floor to Adria. So what is the takeaway of this presentation today? So basically, we first address the question of evaluating Kubernetes at one wide scale. As Karim said, at first sight, it runs quite well. But it's really important to keep in mind that you may have side effects such as a DNS and maybe some other services can also be impacted. So if we looked a bit into what we did within OpenStack, actually we also discovered similar issues, in particular with Neutron and with the DVR features. So what is the general idea that maybe for some use cases using Kubernetes as it is and just configuring the different parameters, the different services in the correct way can satisfy a few use cases. There are two other points that we do not address today, but that probably makes sense to study. The first one is related to the management of the container images. So basically, what does it mean to deploy many container images on different edge side, how we can take into account the network characteristics. So that the first point and also one point is related to a single point of failure issue of the centralized control plane. So maybe here also we can find some control measures such as using replication strategy. But from our point of view, we are a bit skeptical with such approaches, in particular due to the limitation in terms of scability and the issue related to the network partition. So based on that conclusions, we started to investigate a few alternatives. So we highlight today the major ones. So the first one was related to KubeFed. So the first site, it looks quite okay, but unfortunately, there are important limitations. So in addition to having, in addition to to reimplement different mechanisms that define control planes, there is no collaboration between entities. So what does it mean that if you deploy a workload between two edge sites, the component that are deployed on one edge site are not aware of the other component that are deployed on other edge sites. So this is an important issue. So regarding KubeFed, it's probably a promising solution, in particular as we believe as I'm going to illustrate in the next slide. We believe that the right solution will be basically a mix between some Kubernetes one wide and some Kubernetes and something that are completely independent. So in that sense, we introduced today Submariner, which is quite interesting from the collaboration viewpoint, because this is the first project that actually provides some east-west communication between the different control plane in order to share some information. So in that direction, what are the next steps for our research group? So basically we want to investigate more decentralized models as we did for OEAD. So as I said, we believe that it will be a mix between some Kubernetes one wide, for example, typically to manage some really light device where actually you can run container. But it doesn't make sense to deploy a full Kubernetes or there is no enough resource to run the full control plane of Kubernetes. So basically we believe it will be an hybrid between such Kubernetes one wide deployment and some independent Kubernetes instances that will be in charge of managing the different edge sites. So our main idea is to leverage the OEAD proposal we presented one year ago through Denver Summit. Our main idea will be to offer abstraction and the right mechanisms to allow DevOps to deploy workload across multiple edge sites, but also to create cross Kubernetes objects such as namespaces, services and all the fundamental elements of Kubernetes. So with that, this is the end of our presentation. So thanks for your attention. And if you have questions, please feel free, we will be happy to answer them.