 All right. Welcome, everyone. Thanks for joining me today. So this session is about bringing visibility into your OpenSack network. Just a little bit about myself. My name is Valentina Laria, and I work for Plumgrid. Plumgrid is an SDN solution for OpenSack. Our booth is right there, so if you have more question on what we do, we're welcome to stop by later. So today, I want to concentrate primarily on the challenge of operationalizing the networking layer in OpenSack. And the one thing I want to start from is some of the transformation changes that we see from an infrastructure perspective and how they're impacted the operational side of the house. So obviously, you all see adoption of new technology, cloud being the driver of this transformation. But all these new products and components that come into the picture obviously bring some complexity, especially from an operational perspective. So the first complexity that we're dealing with is that we are now operating a large-scale environment where there is a very large number of components, so what we call usually a distributed system. And operating a distributed system has some inherent learning and understanding of the different pieces of the puzzle that are needed to be able to actually go in there and operationalize it from a complete lifecycle perspective. The other aspect of this is that the operational team, the first line of the funds, is not necessarily always super familiar with OpenSack and SDN and all the different pieces of the puzzles that come together. So how do we help this group become effective from day zero of deploying a cloud, especially from a networking perspective? And at the network level, I'll talk a little bit more about it, but there's a big transformation, obviously, of delivering networking services from the physical model we're used with, to more of a software-driven model. But it's not that the physical network just goes away, right? You still have your physical switches and physical routers. They need to be operated, and you need to make sure that that traffic is actually flowing through these devices. So you now are left with a physical infrastructure and a virtual infrastructure, and often, your lack tools that are helping you correlate these two layers back and forth. So what we're looking at is really a model where you have this physical environment underneath. And that's your traditional data center infrastructure what you use with. And on top of that, you have your virtual layer. And at the top level, it's what is driven by OpenSack, Neutron, for example. And your physical infrastructure is what is traditionally driven by CLI configuration. Those are your physical routers and physical switches. Now, while the operational team, especially in the network operation center, it's very much used to operate the bottom layer. The challenge that we're seeing, and I spent a lot of time working with, my customer is helping them operationalize the solutions that we bring to them, is how to take what they know about the physical layer and translate it up into this virtual layer. Now, what is different is obviously that while on the bottom, you have individual components you can log into and troubleshoot and log into ports and figure out if the traffic is flowing back and forth. At the top, you have something that it's more of a concept and a structure. You have a logical router. You have a logical switch. But those devices don't actually live physically in one spot. They can live distributed across maybe tens of components or hundreds of components. So you immediately see how it's quite different to go and operate a single device at the bottom versus a much larger number of devices at the top. So on top of this, the other piece of the puzzle is very often in OpenSack to achieve this virtual network infrastructure construct, we leverage the concept of overlay networks. And when you have an overlay network, what happens is that you do a couple of two things. The first one is that you insert a software component inside each compute node. So these are your Nova OpenSack compute nodes. You usually insert a piece of software that runs inside all of these compute nodes. And in terms of the PlumGrid solution, we are an overlay solution. And we do insert a piece of software inside each of these servers. And this component for us is a kernel component. It's a component that we refer to as the iOvisor. And this iOvisor piece is what runs all your networking functions. So you're switching, you're routing, you're nutting, your security policies, and so on and so forth. And then you have the ability to create VXLAN panels that will help you kind of decouple your virtual environment from your physical environment. So you see there's a couple of things that happen here. The first one is that your virtual network starts being fully distributed. You have this VXLAN infrastructure that starts decoupling your virtual layer from your physical layer. And these entities, these networking entities, these security entities, while you have an abstraction of a single one of those, start living across the entire environment. So PlumGrid has been in the SDN business for a number of years. And we work with a variety of customers that have asked us to help them and operationalize this level, this level of the stack, this piece of the puzzle. And we started looking at the existing tools out there. And a lot of what it's out there in the market, especially around SDN, tends to be very tabular heavy. You have all these rows of information about your infrastructure. So you probably have a lot of data there, but it's not necessarily easy for you as a user, especially if you're an operator and someone that is not necessarily a PhD in open stack and a PhD in SDN and a PhD in distributed system to go figure out what is actually happening in there. Again, the information might be all there, but it's not necessarily something that it's easily consumable by the user. So we wanted to take a pretty revolutionary approach to the problem of monitoring and visualization. And we wanted to combine that with the fact that we have this kernel component that runs inside each compute node. So we have visibility into the entire open stack environment, and we wanted to build something that could be very intuitive and easy to consume for anyone from an operational perspective. And so together with our SDN solution, we introduced this new product recently that is called PlumGrid Cloud Epex. And I'm going to show you a demo in just a sec. But the idea of Cloud Epex is to build a cloud visualization platform that helps bring understanding of the health and condition of the overall distributed system. We wanted to make it very simple. So the goal for us was to achieve zero-day operations, so to really enable the operational team to jump in there and be proficient from the get-go. Again, based on my experience, I spent a lot of time with the cloud engineering groups at all our customer sites, and they all tell me, you know, help me get the operation guys involved, because I want to be able to be out of the loop. I don't want them to call me every time there's a problem. How do we enable them to be proficient? So this is really the goal for Cloud Epex for us was to enable these zero-day operations, build something that would be very intuitive, very simple to consume, something that anyone could just jump in there and figure out exactly what is happening. The other big piece of the puzzle for us was that, again, there are all these layers that come together. You have your physical infrastructure. You have your virtual infrastructure on top. And sometimes the problem you're troubleshooting can be as simple as, oh, my two VMs cannot ping. But even that very simple troubleshooting problem involves traversing all the layers of the stack and touching both the compute layer and the networking layer. So my two VMs cannot ping. Well, first of all, let's figure out how those two VMs are connected from a virtual infrastructure perspective. Are they part of the same tenant? Are they part of two separate tenants? Are they interconnected through maybe a shared provider network that has been created on top of that? Now let's look at the physical infrastructure. Are those two VMs sitting on the same physical server? Are those two VMs sitting on two servers in the same physical rack? Are those two VMs sitting on two servers across two separate racks? So you immediately see that the possibilities and combinations are endless. So we wanted to provide a way for someone to jump into this cloud epics visualization and be able to immediately find where components are, correlate across the different layers of the stack, and get a feel for the health of the environment. The other thing that it's obviously important here is that, as you know, we are a community. That's why we're here at this summit. And there is a lot of moving parts that build a solution for a customer for an environment. So what we wanted to do from the get go was to build an environment that would be extensible so that we can bring in not just the plumb grid as the end components, but start pulling some of the open stack logs, for example, as well as some of the physical infrastructure that are relevant for this environment. And last but not least is the central point that I have there, which is what we wanted to build was something where it would be very easy to just log into the UI and pinpoint a problem. Again, instead of having to process all this heavy text-based usual interface that you usually deal with, have something that would be very intuitive and easy to consume. So I'm going to show you all of this in the demo. But a couple of areas that are important here that I wanted to just highlight before I get into the environment are the functionality that we call affinity-based GUI. And this is what you see listed here at the top. And the affinity-based GUI is the ability to select any resource, whether it's on the virtual or the physical layer, and to see how it correlates with the other layer of the stack. So you have this ability to click on a virtual entity and see how it maps the physical and vice versa from the physical to the virtual. We also have the ability to turn on all sort of real-time heat maps. So I'll show you all of that in action. So based on the identity and the type of the VM, whether it's a web VM or a database VM, based on the identity of a physical server, you can look at metrics and have this real-time display functionality help you navigate the environment. As you can see, it's quite unique in terms of design. You don't see a lot of text there. But you have all the resources that are represented in a very compact way, something that could be running kind of on your knock center and your operation center. You can just jump in there and see if there's any red blinking light. Probably something that's not quite right, right? So I'll show you a demo. Just to give an idea of the environment, we have what we call the Plum Grid Cloud EPEX middleware, which is a certain number of components that are collecting all these distributed logs that come from what runs inside each of your servers, each of your compute nodes. It's pulling from all the Plum Grid components. So again, it's collecting the traffic as VMs flow through the virtual network infrastructure. It's collecting all sort of traffic. The middleware is aggregating this information in real-time. And it's presenting that through the Cloud EPEX UI. Obviously, all these things can be exposed through API as well. So let me quickly jump here. And all right. So we're logging into Cloud EPEX here. And this is what we call the resource view. So at the very top of the screen, you have all your virtual resources. In this example, those are the virtual machines and the projects, what we call virtual domains. And at the bottom, you have your servers. And the servers are organized by rack. We run LDP so we know how the servers are connected into the environment. You can see it's very simple to have a global view of the environment. And you can get detailed information about each of these elements by mousing over that. Or I'll show you how you can actually search for any type of string in this environment. So you have your entire distributed system. This can be hundreds of servers. And you can very easily find resources anywhere in the environment. On the right, we have what we call the detail panel. And the detail panel, it's context-sensitive. So whatever you select on the left, whether it's the overall deployment or an element, let me just pause here for a sec so I can explain you a couple more things, so whatever you select on the left, whether it's the overall deployment or an entity, we'll show you detailed information on the right. So it's going to show you how many, for example, for a physical server, how many virtual machines are there. For a project, for a virtual domain, it's going to show you, again, how many virtual machines are there. How many virtual domains are deployed in the overall environment. So it's going to help you navigate information for the entire system. And then at the bottom, you have your detailed real-time logs. These logs tell you if something is happening in the system. So if there is, for example, interfaces that are going up and down. If there are crashes at key components in the environment, it's going to help you with monitoring all of that. So you can see here, for example, showing me the list of all the virtual domains, all the projects that I have in that environment. It's going to also show me all the virtual machines there. So now what we're going to start doing is we're going to start clicking around. And we're going to start looking at the affinity-based functionality. So you can see here that I'm selecting an entity, for example, a physical entity at the bottom. And you can see that this maps to the top to a number of virtual entities. So here I have, for example, a server. And this server has six virtual machines that are deploying on it. Now, what this is going to help me figure out very simply is, how are these virtual machines mapped to tenant environments? And you can see that in this example, they're pretty equally spread across multiple tenants. There's two that belong to the same tenant, but for the rest, they're equally spread. So you can immediately see if there is a hotspot, a misconfiguration in terms of deploying VMs. Also, if there is a problem that it's affecting a cluster of entities, whether it's again VMs or server, this can help you very easily spot it. You can do the same from the top. So you can actually select a virtual resource and see how it maps to a physical environment. And you can do that for a tenant as well. So in this example here, I'm selecting a whole tenant. And I'm going to see how it maps to my physical infrastructure. Again, very powerful to figure out how the two levels are correlated together. So you can see that this tenant, it's again pretty equally distributed. And for this tenant on the right side, you can see all the VMs that are there. You can see the servers it's mapped into. You can also see aggregated information about the virtual domain itself, about the project itself, and all the logs that are present in the system. In this example, there were some interfaces up and down events. There were some access to CPUs that were related to this environment. And you can see you can filter logs. You can also filter logs by severity. So you can look at just the critical alerts, the warnings. This is also displayed at the very top of the screen there. And as the end solution, we have the concept of this control plane and management plane elements that we call the directors. And those are very critical because they are the brain of the system. And so we show them at the very top as well. So we can have a quick lens of the health of those elements there. Now what I'm going to show you next is how do we turn on heat maps on top of this resource view that I was showing you earlier. So what we have here is the ability to look at different metrics in the environment, things like packets sent and received, byte sent and received, CPU utilization, memory utilization. And I have some graphs on the right side that are helping me look at the normal trend of the environment and are helping me set thresholds properly. So once I've learned what the normal behavior is, I can start setting thresholds. These thresholds can be set at the VM level. They can be set at the physical server level or at the project level. So if an entire tenant is affected, I can see that problem easily. And you can see that as you set these metrics, you can set whatever thresholds for your environment. We also have the ability to customize these metrics based on the type of workload. Again, I was seeing earlier the example of a DBVM versus a WebVM. You probably want to set a very different type of thresholds for these two categories, right? So you have the ability to do that. And as I said, you can look at the machine's metrics, project metrics, physical server metrics. You can also order them by severity. So again, the goal for this was really to make simple and easy for a cloud operator to jump in there and see, oh, there's a problem. There's one VM up there. I can get all the information. I can see how it maps to my physical infrastructure is experiencing some misconfiguration. Now, obviously, if you have a large environment, you want to be able to quickly find something. So you can see here, I'm just searching based on a random pattern. I'm looking for a VM name. I can look for an IP address. I can look for a MAC address. I can look for any type of information. And it's going to start filtering things dynamically for me so that I can just quickly find anything in there that it's relevant for what I'm doing. And you can see that, obviously, I can do that for physical information. So I can look for IP addresses of my servers and VMs and all of that. So this environment was relatively small. The one I had recorded is demo web. It had a handful of virtual domains and a handful of servers and virtual machines. But this is to give an idea of how this can scale with a much larger environment. So here, I have a much larger number of VMs, much larger number of projects. And it's still pretty simple to find if there is a problem there. As I said, you can kind of turn on these metrics for the entire project environment and easily spot if there is a problem and a correlation between the two layers. So this is what I wanted to cover today. I know I have about a minute left. So if there is any question or comment that anyone wants to make, otherwise, thank you so much for joining me. And I hope it was helpful.