 Welcome to building an edge computing platform for network services using cloud-native technologies. With a title itself being such a mouthful, I guess we'll just get right to it. So these are the genders. So we will first look at the problems that we actually looked at first. The framework that we came up with to address that problem. The demo, which is actually basically a description of the framework also. And what are we going to do after that? So who are we? We are a team of OPNIV contributors from Huawei USA, actually FutureWay. What is OPNIV? OPNIV is an open source project specializing in integrating a bunch of upstream projects and to address various different NFV use cases. What's NFV? NFV is network function virtualizations. It's basically the idea that the majority of your network functions that are currently deployed as a vertically integrated network appliances can be provided instead by software running on commodity servers. And each of these network functions, we would like to call it virtual network functions, like a firewall, for example. And that is a term that I'll use through L, so that's why I actually kind of put it out there. For the team that I just mentioned, the OPNIV contributors team in Huawei, for a few of us, for the last several months, we've been focusing on whether or not we can build VNFs as cloud-native applications. And the obvious benefits are there. I don't want to insult you guys for why you want to build cloud-native applications, given the venue. And if you are able to build a cloud-native VNF, we actually think that it's actually very suitable for the edge network use cases. The nature of elasticity for the cloud-native applications allows you to optimally utilize these resources. So that's actually, for an edge case, that's actually kind of good, because the constraint on resources are much greater. The telecommunication industry, particularly edge networks, comes in all shapes and sizes. It goes from something like tens or even hundreds of servers, all the way down to maybe your home gateway that your Comcasts give you. Needless to say, if you go further and further towards the edge, the maintenance cost is actually higher. So the nature of resiliency and fault tolerance is a great benefit for the edge case. Usually for edge nodes, the images are being pushed by a central management system. And for all the mods, I read it somewhere that if you push a mod patch or something, it depends on the edge node itself. You can have anywhere from 5% to maybe even 30% failure rates on a push. In this particular case, if you're cloud-native apps, you can definitely create something like a sandbox. After you run your CI on a cloud, you can still run a validation on the box itself to see what it actually runs. And if you really don't like it, you can have intelligence on the edge nodes to actually roll back. And on the other hand, if you really like it on one site, you can replicate and repeat on various different sites moving forward. So with all this great benefits on building VNF as cloud-native applications, why haven't we? Well, actually, the big reason is because there are actually a significant amount of gaps to get there. There are some that would take a very long time for a lot of people to do, which are things like security issues, performance requirements, latency requirements, throughput requirements, or compliance requirements. All those things would take a lot of efforts for a lot of people to get it going. For this team particularly, we're basically looking at two functional gaps that are somewhat obvious, actually. The cloud-native application, most of the common cloud-native applications, which are mostly web service backends, usually expose an endpoint, allows the clients to send requests to an endpoint, and then from there, they basically just go through the application flow. There is a very large subset of VNFs that are not exposing any endpoints to the clients. So we call them transparent proxies. They basically intercept the packets and start actually processing it. So we commonly refer to that as a surface insertion problem, and we need some mechanisms to basically redirect traffic, and that's only one type of thing. You also need to allow them to listen to traffic, and if they listen and analyze for a while, you want to give them a programmable interface that they can basically say, although you want to deny it, or you want to ray-limit that traffic flow. And finally, and most actually kind of challengingly, is you have to allow VNFs that are ported to cloud-native applications to be able to bring their own data path. I'll actually talk about that later. And the second obvious gap that we're looking at is what we call VNFFG. It's an NFV thing. FG stands for folding graft. It's an orchestration function where the operators can specify a chain of VNFs where the traffic would actually traverse through. And then the realization of that desire is actually being carried over by the data path. If you converted the VNF into a cloud-native VNF, most likely, instead of having, I would imagine, I would hope anyway, that you don't do a lot of data path stitching, but instead it's actually creating a service graph by just stitching, by making the API it's calling each other between microservices. Now, the really big difference here is instead of the applications defining what the service is you're going to subscribe to, it's going to be operator-defined. So those two are the gaps that we are going to address, and the framework is supposed to address both gaps. Before we go into the framework, these are the terminologies that I'll be using describing the frameworks. Cloud-native network functions, which is basically what I've been saying, the cloud-native VNFs. But then someone told me that if you're a networking guy, your acronym goes beyond four letters, you panic. So it goes into three, just CNFs. This is something that is defined by the operator, and it's made up of a list of microservices. So what is an operator? An operator is the guy that actually defines and deploys those CNFs for the clients to subscribe to willingly or unwillingly, which is, in one case, maybe for enterprise case, maybe your IT administrators, for some mobile case, maybe AT&T and Verizon, or it could be cloud providers on some side, too. Microservices are basically just microservices. It's nothing special. It's something that consumes some compute resources and exposes a set of API. In CNF, we kind of look at that as like a Lego piece that would construct a CNF. And the clients are just the people that would initiate traffic that goes through the edge node. So this is the diagram of the architecture. The operator... I'm thinking the mouse is here, right? Or not. The operator goes into the central management and defines the CNFs. That's the REST call that goes into the edge. So they manage many of these edge nodes, supposedly. The REST call goes down to the edge agent. The edge agents would request the Kubernetes API server to create deployments. So let's say in this case you define CNF 1 as making up of microservices 1, 2, and 3, and 4. They would create those pods. And then the diagram here for what extrapolates in the pod. I'll describe that later. So kind of remember this. The pod at least contains two containers. One is the microservice. The other one is an event handler. And to route traffic to MS1, in this case, with all the arrows, MS1 is the one that actually is where the traffic beginning point is. There's a Datapath manager that will be listening to Kubernetes API server, and they will program the Datapath. And that really good-looking little unicorn icon there is actually EBPF. For those who don't know, I am not going to describe a lot of EBPF. I'll basically just tell you why we choose the EBPF as the Datapath engine. But I'll talk about that later. So this is a completely event-driven model. Basically what we do is we are creating an event handler containers as a sidecar to every single pod that runs the microservice. The microservice themselves would expose a set of events and the operators on the cloud side would implement the handlers for those events. And they don't have to, because that's a default. There will be a default that does nothing. But then if they do, what they would do actually is the framework would give them a set of SDKs, and the SDA calls for them are steps. Internally, they're going to be translated into GRPC calls, and that's how you actually start chaining things together. On the programmable Datapath front, look at number two. The rules format is something like this. It's an optional IP, a port numbers. You can choose the actions, redirect, or copy to a list of labels. And as I said before, we actually utilize EBPF, more specifically I-Ovisor BCC, to low BPF code. There's two major reasons to do that. The first major reason is, as I said, we want to allow people to bring your own Datapath. And literally, I can actually allow you if you know how to write BPF code, not the bytecode, the C code. So you can write them C codes based on what BPF requirements are, and then I can load it inside your part's ingress and egress port. So that allows VNFs, the microservices to bring in their own Datapath. Because obviously I cannot design a network Datapath that would actually satisfy all the VNFs. That's just impossible. So to better describe this is actually going through a demo. So in this demo, we are going to have the operator using the portal to create a CNF. So this particular CNF, I would call it Edge Security. Even when you just do a prototype, you kind of realize that for an Edge Security system, you're more protecting the clients rather than the applications and servers. So you actually see us processing a lot more on the HTTP response than the request. So you define a CNF by including multiple microservices, which could be, in this case, it would be actually proxy, a policy manager, and a content inspect engine. And each of these microservices actually has a set of events, and the operators would have to implement, or actually they can choose to implement these events and customize them through the SDKs that we provide them. So in this case, our SDKs are all in Python. And there is a DPM code for you to... they have manager code, API for you to start routing the initial traffic to the XGB proxy. And then you deploy it on the Edge node. And then those nodes get spawned into the... those parts get spawned on the nodes. And then clients later on sends an XGB request, and then you see that they actually go through the call graph that the operator defined. And then at some point during the demo, we're going to add another microservice into the CNF antivirus, which actually is a very interesting use case for Edge, because if you think about IoT use cases, perhaps the end-user devices may not have enough power to run endpoint securities. So you may actually want to run endpoint securities directly on to the Edge node. But we add antivirus, and then we are chaining that with an antivirus-specific call to make the CNF extended, to add services to CNF basically. And then we'll see that it actually goes through antivirus. And the operators now, so this is the third step, wants to narrow what needs to be scanned. So for them, they have the ability to see, because it's actually kind of silly to scan HTML files. So instead of that, they may want to scan it only when you have embedded inside executable. So in this case, you can actually do a logic to check whether or not you have an executable embedded, and then you only run antivirus when that happens. So here's the demo. This is video recorded, not running live, not testing demo guide here. So initially, you're creating a CNF, so we have a really nice UI here for our guy. One of my teammates actually built this UI, thanks to him. And from there, you pick the set of events. So the first event you actually handle is the deploy event. The deploy events, because of time reasons, we already populate the code, the defaults. But we'll change it actually as we go. So the important thing is here. So in here, you specify the set of microservices that you're going to add into this CNF. And then each of them is basically a code. Here. So you're doing proxy.start, policy manager.start, and content inspect.start that basically creates three pods into this CNF. And then later on you specify using a DPM call what is the rules that you're going to reroute what traffic to this particular CNF. So actions redirect the label to label, and then the port number is 28,000. This is just made up of port. And then you have to tell the deployment agent that this is done with deployment. And then later on, as you can see, as I said earlier, we are doing more of a user protection. So the request actually does nothing. They should be request as an event. It does nothing. The response is the beginning of the chain. So after response is received from the server side back there, fine. The first thing we'll go through is the policy manager to do a policy check. And then we'll chain that to a content inspect. And then content inspect is the end of the chain, so we do nothing. And then we deploy it. So this is a recruit demo. The way we show you that it works is by actually showing you trace logs. So hopefully you guys trust me that I'm not just sending a bunch of printfs to a web page. This is actually something happening underneath. So now you see everything get deployed. Those three microservices get deployed. And you can see here, there's a deploy agent that we dynamically load. The part that we dynamically load. When you set done, they will actually kill itself. So they're the one responsible to execute the deploy handler code. And so all the parts are up, so now we can use traffic. We generate traffic using Coral. Not because we are giving more crude demos. It's that Coral actually works better than the browser in this case, because Coral only do a one XGB request and run response back. So you'll be able to see the trace much better than using a browser, which is continuously fetching. So in this case, you are seeing. We're going to go to Google.com. And you see that it goes through XGB proxy. And when a response happens, it goes through policy manager. And then content inspect on line nine. And then the response are all coming back to the proxy. So we send it back to the client. So now let's say we want to add antivirus. So you go back to the same CNF and modify your deploy hand door by adding this SDK. So this is a good simulation of how you write something based on an SDK. You do it on web. And so now you want antivirus to start. And you want to chain it at the end of the current service chain. So that will be content inspect. And in there you make a SDK call equals to basically an SDK that antivirus that scan. And voila. You clear the trace first. And then you deploy it. From there, we actually do a complete update of deployments. So those things that has image changes will actually get terminated and respond to that. Now we see the parts coming up. Antivirus has been running. So we are using traffic again. We're sending to Google.com again. And now you can see that it goes through antivirus embedded inside content inspect and policy manager all the way down to sending a response back. So now let's say you figure you don't want to scan it indiscriminately scan all the files. You only want to scan them for if you're embedded JavaScript executable inside your HTML. So you can change that logic by first removing the antivirus scan call from content inspect. And let's say add it into policy manager. So you can wait until you get the content inspect to show you what the content actually contains. So in this case JavaScript is greater than zero. And you say get JavaScript is greater than zero. Then we only when that happens we do an antivirus scan. I mean I typed this which is very slow as you can see. And then deploy it. So now the code logic all the parts are now restarted with the new cycle. So they build, tag and push to registry. And then Kubernetes will fetch and if you do Google.com which actually contains no embedded JavaScript zero in line 10 there. It actually doesn't go through antivirus as you can see content inspect and policy manager right away. And we are fortunate to find that there's still some website that has embedded Java instead of continuously fetching JavaScript. And that would be cnn.com. In line 13 we fetch cnn.com and as you can see at line 18 we are doing antivirus. So the logic get executed. These are actually the call floats of what happens on the three scenarios that I show you. I don't think I will actually get into that too much. But the important thing is actually on the left hand side you can see the cascade number of calls and that's actually what the operators will define. When they define those calls we allow them to actually chain things so in essence we allow people to define the service graph. I will upload the slides after the presentation so if you want to read that go ahead on schedule.org you should be able to read it the PDF format of it. So this is what we have right now and then moving forward there are a couple things that we actually want to do also. One of these is actually the data collection aspects. This has been used as an edge case to definitely allow them to collect a rich set of data. That's actually the second major reason why we chose EBPF. EBPF is actually a tracing tool by nature. Particularly for IO Visor BCC it comes with a bunch of sample BPF code that you can use for kernel tracing. And I think because we're building this for any generic edge cases so we are not assuming that we're going to run some analytic engines under the edge. But at the very minimum we would want data correlations to happen on the edge. So we use the cookie field which is in this in current case we're using the five tuples the source test IP, source test port and the protocol ID. Something that both the data path and the user space can actually generate and they would all key that as one of their major keys on the tables. We want to fetch data once they become available, all the trace data and we have Redis working to which also use that as a key. You can actually fetch all data using the same cookie key. It's a string actually in the containers and it's a binary in EBPF. You can use the key to actually fetch all the data corresponding data and send it to cloud. So data collection is something we have considered. Someone will probably realize that on the first slide we actually kind of violated that policies by sending the event handlers to be CNF specific. The event handlers actually loads the CNF specific code that defined by operators and then run as a sidecar next to a microservice. So in essence your microservice cannot be reused which is a pretty bad thing. So moving forward we are thinking of using a multiple sidecar type architecture. So in essence each of them would be running CNF specific event handlers and how to distinguish how to go to which particular sidecar is actually based on the cookie again. And how we map that most likely I would imagine is the client. So you have to identify the client. The client maps to a particular CNF and then you map the sessions back to the client. So now you actually to be request have to do something. Not by the operators but actually by us inside the framework. And then that information we get propagated throughout all the event handlers for all the parts. And we may actually make that programmable. Not sure what the use case is yet. For VNF cases it will be bad if we can't support legacy VNFs. The biggest reason is there are many, many cases of VNFs that are not well NF anyway. Network services that are battle tested fully loaded and deployed for a very long time. So you can't just tell people to rewrite everything just because it's cloud native. So to really think about that how do you support legacy VNF our demands for a VNF micro surface is two things. The framework asks for them to generate events when something happens and it has the GRPC request handling interface. So obviously legacy VNF doesn't have that. But luckily for us actually Kubernetes has a really well-known design patterns for containers pod relationships that would actually try to solve this problem. It's called the adapter pattern. And I actually did that for the antivirus pod, micro service. The antivirus I use ClamAV. The ClamAV daemon actually runs in a container. I use Python. So Pi, ClamD runs on the adapters and interface to ClamAV daemon. And the adapter itself is a GRPC server that actually runs I use serial MQ for event handling. So I actually didn't mention that, sorry. It runs a serial MQ event loop. I hope to make that into a framework so people can easily just port, as long as your legacy VNF runs in one or more Linux processes you can pretty much use similar framework to port the legacy VNFs into this particular framework. This cloud-native framework. And finally, if you did all this, this is relatively it's not 100% cloud-native. It's definitely no stainless or no streaming logs or all that stuff. But it is actually good enough to take advantage of a lot of cloud-native computing ecosystem tool sets. One of this is if you want to scale up, you can use a service match, obviously. Because now I'm converting the entire interface to everything, to all the microservices through GRPC. So it naturally works with Envoy, Nginx, or LinkoD, and obviously with Istio. And particularly I love the service graph tools on Istio. So you can run traffic for a while and then it actually generates a graphical look of what the service graph looks like. Given that it's really fine service graph, it's kind of good to use it as a visibility and validation tool. As you see from the demo, I basically show you a demo by showing you results of tracing. So tracing is important. And right now we are pretty ad hoc on our tracing. Just kind of rush the demo for today. In the future, we obviously want to use open tracing APIs and probably Jega as the tracer. And finally, if you are cloud-native applications in Kubernetes, you will be able to leverage all this CICD solutions like Spinnaker. Should add tons of them actually, really. And maybe February 8th, I think, that's another one. So for information, if you like what I just said, we have the shameless plug. We have a project called Clover on the OPNV. The project was approved about a month and a half ago. The charter is really specifically investigating how to build VNF as cloud-native applications. So this is one way of filling the gaps. There will be more gaps found and there will be more prototypes being built as part of this project. So you can go into this Slack channel and you can also email me. That's the last resort. Thank you. That's it. Let me see. About 5-10 minutes. Any questions? No? Thank you.