 Good morning everybody. I'm Deep, he's Kiran, and today we are talking about OpenStack, tapping into the OpenStack notification system. So we can talk about OpenStack and notifications and our experiences dealing with it, wrestling with it. We've also written some specific code to, you know, and open sourced it, and Kiran is going to go into the details of that, and then we will run through some demos of things that we have done and are working on. So this is, think of it as a work in progress. Now let's see if this works. So as I said, the agenda, we just introduced notifications, then how we hook into the notification system, the use cases, and then we kind of go through a live demo. And a shout out to actually two of our team members who contributed a lot to the notification system that we have kind of built at ZeroStack. They could not come here for personal reasons, because then I've been here. Okay. So first off, you know, this is maybe obvious to a lot of you, but just wanted to go through the very basics as to why do we kind of need a notification system. So obviously, to monitor and troubleshoot a private cloud, or for that matter, any management system, you kind of need a notification system, to measure consumption, you know, when, as things change in your, in the cloud, you want to be able to kind of capture those moments and kind of string it together to measure consumption of the system, right, to measure performance, right. To discover, you know, security, compliance, violations, you know, a VM comes up and certain ports are open and certain ports are closed. Somebody uploads an image. Is that image a good one or a bad one? You know, to kind of, you can use notifications as the basis for, as a starting point for some of those things. The other one, a nice thing that, you know, notifications can be used for is a call-out mechanism that, you know, something gets notified and then an external process or an application does some other work. And then when it's done, then the private cloud goes on to the next step, you know, a nice call-out mechanism. That's another very useful use case for a notification system. So, this is a, you know, a picture of a poor man's view of open stack services kind of stacked together. You have a set of stateless services on the right side, you know, and Keystone is a common one, and you have these other dedicated services for, you know, Neutron, Cinder, everything. And then you have MySQL as the, you know, as the, as the state saving system. And then on the left, you see the MQP messaging system. You could be using RabbitMQ or, it's the other one, I forget, QPID, 00MQ, QPID, like the underlying MQP messaging system. And then Oslo is a common set of libraries that have been written so that, you know, there is much better code reused within, within open stack. So, specifically with respect to notification, you know, it's the MQP messaging system. And within Oslo, the Oslo.messaging layer that kind of plays a role in, you know, dealing with all of the notifications, the publish, subscribe, the RPCs and things like that, right? Sorry. So, the Oslo messaging API is used for RPCs and notifications over the underlying transport, whether that be RabbitMQ or QPID or 00MQ, right? And the AMQP service, that's the middleware that enables open stack services that kind of, you know, run multiple servers to be able to talk to one other to communicate. So, you know, it supports three implementations today, RabbitMQ, QPID, and 00MQ. So, in terms of the messaging workflow, let's kind of start with a very simple example of, you know, creating a VM, right? So, a Nova VM create request, whether you make an API call or you make a CLI call request, it goes to the Nova API service layer. And from there, you know, a request kind of is queued into the AMQP service that there's a VM create request. The Nova scheduler is listening for, you know, on that queue for VM create requests. And once, you know, it listens and then it decides where to schedule this, on which host to schedule this. And then that places a request that, you know, launched this VM on this host. And then the Nova compute where the actual VM is going to live, that is listening in for requests about somebody's asking me to launch a VM on my host. And it's going to take, whenever it gets a request, it will go launch a VM, right? Now, in all of these, there are notifications that are also coming into the system. So, whenever the Nova scheduler is making a decision about placement of VMs, you know, it sets a beginning notification scheduler dot select underscore destinations dot start. And then when that decision has been made, there is another, you know, notification that is sent out scheduler underscore dot select underscore destinations dot end. When the compute instance is about to be created, you get a compute dot instance dot create dot start notification when the compute instance has completion has, has been created, then you get a compute dot instance dot create dot end notification. And we kind of see some of these in action, you know, in a demo down the line. So now I'll pass it on to Kiran. He'll talk about some of the, you know, the code and things that we've written to Kiran. Oh, one thing we did was actually we opened social library which walks through some of the examples that you see here. And also most of our zero stack whole system itself is written in Golang, not Python. There's some basic reasons for it, like most of this. If you've looked at Golang before, there are very common reasons. It's great for writing services, network services, web services, message processing services, concurrency and communication is really part of the language. You get multi-threading for free, almost free. And it lets us scale to build really large systems where we're processing a lot of notifications. And a lot of the built-in libraries help in passing some of these notification data and how you un-martial them, look at them and then put them back as you need them to the rest of the system. And one of the biggest advantages that we found with Go is almost everyone who's joined zero stack came from very different diverse background. Java, Python, CC++ and all of these people took on to Golang like fish takes the water. So it really was the language of choice for us to build whatever systems we are building. Given that, I want to dive a little deeper into the library that we've open sourced, like walk through some of the code examples here on screen and show you how to build a system yourself using either the library that we have or on your own in any different language. And to basically hook it to notifications, one thing to remember first is that notifications is subset of the things that the messaging system is used for in OpenStack. The basic workload functionality itself, like RPCS and everything, also run through the messaging system. You have to be careful that you're looking at the notifications rather than the rest of the messaging system. And I'll talk about one gotchas a little bit later. So first thing is if you search on the internet for enable notifications OpenStack, you'll find 10 different ways to enable notifications. We'll show you an example of how it actually, the configuration that works for us in Kilo. Then how do you connect to the RabbitMQ system? How do you listen for these notifications? What type of data do you get on notifications? And how do you parse it and process using that one? So this is a configuration for NOVA Neutron and Keystone. We'll actually maybe upload a readme.md into our open source library, which the rest of the configuration. There are various different ways to enable this in different versions. And if you search, you'll see all the way from maybe sx till now and how to enable notifications. This is the one that works in Kilo. Now this is an example of the type of information that you need to be able to talk to the notification system. And the information that you need to provide is the exchange, the exchange type, the queue, the binding key and the consumer tag. And these are important because when you want to talk to the notification system, you basically give the exchange, open stack, the topic, the notifications or the info, and the consumer tag is any tag that you want to give. And this is where I was saying you have to watch out for gotchas because if you end up hooking into a wrong exchange or the wrong topic, you end up barking your open stack installation because instead of the NOVA compute agent, you get the message, the VM never gets created. So you are very careful that you are monitoring the notifications, not the main RPC mechanisms because you can actually take over the NOVA compute agent functionality if you really care. All of this code is on the open source module so you can look at it offline after. So once you have this basic information on how to connect, how do you initialize the client? This is again a full Golang example, but basically you connect to the AMQP, you declare the channel that you want to talk to, and then the binding key and the exchange, and then you start consuming the message from it. The messages actually come out in a Golang channel. We're using an AMQP library here, pretty standard AMQP library in Golang. That's how the flow looks like. All the previous information that you had with respect to exchange, exchange type, queue, binding key and consumer tag, this is how they're used to basically talk to the queues. Once you have a channel on which you're receiving these messages from AMQP, from RabbitMQ, then this is how you pass them. You basically get on the message channel that you just created. You basically get all these messages, and all these messages have a very standard format. We'll walk through some examples of what those formats look like. They have a basic format and a payload. Here's a simple example of the unmartialed data. Again, this part is an open source. What the OS notification data looks like, we'll go through it. You handle this message. The types of notifications that you get. Here's a small list. If you look at the open source, we've declared constants and passed a million of these notifications in the library. You basically get these notifications with the type which is a string. As an example, like we just talked about, there is a compute instance, create start. When an over compute has decided that it is going to start creating a VM. If you're deleting a VM, you get the delete start. If you're basically live migrating as an example, you get instance live migration start. There's a lot of these events that you look through. Instead of you having to pass through this, we've collected a lot of the useful ones. We put it in our open source library. There's more and more of these from all the different services. There's an exhaustive list that you'll find online. As an example of a compute instance payload, what is the payload of this message that you get in the notification look like? I apologize if it's too small in the back to see, but we couldn't fit it in a bigger font. It gives you a lot of information about what the compute instance payload is. You get the UID, you get the name, you get details about when it was launched, created, what type of instance it is, the image metadata, what image has been used to create the VM. You get what state it is in right now. One of the interesting things is when the VM is going through power state changes, you can also enable notifications for that. I don't think I have an example for that in the online, but we'll add that to the open source library. We get the rest of the information about the image with respect to how many VCPUs and memory it's configured to start with. When you have this, this looks pretty rich, and NOVA is actually sending a lot of this information to the payload. One of the things also that NOVA sends is it gives you some token outside of the compute instance payload which you can use to actually query for more information back from OpenStack, back from the rest of the services. If you look at volume notification, similarly, you have a lot of rich payload information there which tells you what is the ID, what is the type, what is the display name when it was launched, what availability zone it was created in, the status, and you get multiple notifications for some of, not just for these, but for some of these entities, you get multiple notifications based on when they were started, when they were processed, and when the state changes. So there's a lot of different ways to enable what type of notifications we want. We'll walk through some of those even if you look in the open source code. So a networking example which we'll actually go through a live demo of part of this one is you create a network, you get information about what is a network type, whether it's shared or not, and the segmentation ID for that network, same thing with respect to Keystone. And there's an intentional choice of the order of these things that we've shown. As you can see, NOVA is the richest one. Keystone starts becoming lesser and lesser. And as a simple example here, the payload just has a project tidy. When you create a project, it doesn't even have the name. So this is something that we'll talk about some of the gotchas in this notification system that we miss. So same thing with images. So if you look at the example that we've shown, talk about all the different types of entities that you have in the system and all the information about the different notifications that you get. We've collected all of these payloads and we've written the Gulang descriptions for these, and open source and put it in the library. But you could go back to the source, and some of these we had to go back to the source as opposed to just observing the events to see what gets generated. And part of the code is you can pre-do the same thing in Python. They pre-do the same thing in some other language. There's nothing which prevents it. So we'll talk more about now using this open source library that we've put together, how you, once you configure it, once you hook into it, and you can start passing these notifications. The library that we have actually open source doesn't actually do any work in each of the actions, finally after you get the event. We'll show an example of some things that we do. But the bare bones library that we've put out is more for you to figure out what sort of auditing, security, and all these other things you want to do. We'll show a few examples here, but that's just a starting point. Thanks again. So let's see, we'll just go through. So we'll look at a few examples. So one of them is when you create a VM, and that's great if the VM is connected to the private network, then you decide to give the VM a floating IP. And all of a sudden, that VM gets accessible to everybody in your corporate internet. And at that point, the security compliance, all kind of red flags go up in their head. And they probably want to run some kind of test to make sure that you don't have the wrong ports open and things like that. And that's something we could leverage the notification system to do some work there. So let's go through a demo and pray to the demo gods that things work. So today, we'll use the zero stack. The UI is essentially a replacement for Horizon. And I have a bad build. So throughout the whole demo, you will see internal server error. Please disregard it. Product person doesn't stay up to date with the latest code base. Yeah. OK. So on here, I have a kind of a log where you'll see some of the notifications. So it's as crude as it can get. So the logs, you'll see the messages. And we'll do some things there. So we want to highlight the fact that the same library that we've actually open sourced, that is the exact library we actually use in-house to actually pass these notifications. Like I said, we've put the skeleton out. What we do in each of the notifications is some code that we demo here. So the log messages that you'll see is what you'll actually see from the open source library. Yeah. So we're creating a new VM. It's a serial VM, so it should get created very quickly. So let's give it a network. Don't want to assign a floating IP. Let's give it to all the security groups. I think there's no cloud in it. So I'll create the VM. Let us see on the other side what's happening. This is what the Intel, Susha was saying. Like you want to create VMs in two seconds so all the demos get faster. The volume creation is happening. And then the VM creation will start. Looks like the demo gods have not heard my prayers. It's going to take a little bit of time. I think it's filling up a lot. Let's hit another window. You'll see that. Ah, OK, these are the notifications. So let's see the... This is all the JSON body of all the notifications that we have received. And like deep shore, like four notifications that happened as part of the VM create. There's a few more actually. You will see them in the log. For example, there's a port create start, port create end. There's a bunch of other notifications. Just for brevity, we showed like, we talked about only four. There's a longer list of notifications that happened with each sequence of VM creation network creation that we'll see in the log. So you can see the scheduler.select.destinations.start. That's the notification that's been sent by the host where this VM is supposed to be created. Then you have the scheduler.select.destinations.end. And then the compute, when the actual compute create instance starts, you see that the compute.instance.create.start. And then when the compute create instance has completed, you should be able to see that if I can pick it up with my device. Yeah, there you go. So they did show up in that order in which we described it. Now, let's go back to the VM. So this VM has only an internal IP. Let me just try to separate it out a little bit. Let me give it a floating IP. This is where once, till now this VM is internal, once you give it a floating IP, it's actually accessible from outside. And that's when, you know, like Deep was saying, where flags go up, some antennas go, starts perking up saying, okay, now that this VM is accessible over the network, then you really start worrying about what to do. This is one way where it doesn't matter which user has created a VM in your cloud. As opposed to user having to initiate something, the system can automatically look for notifications and based on those notifications, try to do more vulnerability assessment as opposed to every user having to schedule something manually. Right, and so here we got the floating IP of 172.161. This is where the, should be. So one thing also to remember with respect to the notifications is it does take some time for the VM to get created for the floating IP to be assigned and the network to come up. So one thing we realized is if you just try to scan the VM right away, as soon as you got notifications, the VM actually is not up. The network is not up. We are to wait for some number of seconds before we can really start scanning it. So in this one example, what we did really was on getting the VM floating IP assignment notification which is the floating IP create end. We went and actually ran a port scanner on it. The port scanner actually shows what ports are open and closed on this VM. So similarly, we have more network scanning tools that we use whenever the VM comes up and collect this report and show it in the UI. That is a basic way to drive some of these views that have enabled in the zero stack environment with respect to the admin being able to view what's going on with the different VM creations and projects in one place, as opposed to, like I said, again, each user having to manually verify it. This gives a better vulnerability assessment in a central place for the whole cloud. Right. So go back to the, can we do all the questions after this? I think this is finished very quickly and then happy to take your questions. So while doing this, just one thing to mention out here was that this is the data that is used, is the notification that we are listening on provides the actual, the IP address of the VM that it got and then kind of use that to do the scanning and all of that stuff. So the next thing that where we leverage the notification system is project approvals. Everybody wants to create a project and create their VMs, but as a cloud admin, you want to kind of keep some tabs on it, some control over it. And so when a project is created, there's a project create notification that gets generated and that gives some details of the project until a cloud admin approves that project, we hide it. And the notification is used to kind of send some information to the cloud admin and for him to make a decision whether he wants to approve or reject the creation of that project. And until he approves that project is not visible in the system. That's another place where we kind of use the notification system. A third one is, this is still a work in progress for us. So kind of verifying uploaded images, project members, project admins, business unit, application teams, they can upload their own images. Oftentimes these images may not be the best images around where they found it. So as a corporate policy, you might want to figure out whether these images are blessed or they don't have viruses and things like that. So you can have, our system listens for image upload notifications based on that. It kind of kick starts a verification process. Linux images be scanned with something like OpenSCAP. Windows images can be scanned using live NTFS. And Symantec has its tools for scanning, say VMDK images on Windows machines and things like that. So various tools that are out there that can be leveraged. The basic idea is that the framework that you can use to do all this is offline notification mechanisms that you're getting from OpenSCAP and you can build your own workflows on top of it, both for security, for compliance and other things. So last but not the least is, one of the things we use quite a bit in ZeroStack is leverage the notification system to create a timeline view of the world. Whether whenever a new VM is created or a new project is created right from its birth till as long as it's alive, what's been happening to it. So from a compliance or purpose, or from an auditing compliance purpose, or for troubleshooting purposes that comes in very handy. So I'll just kind of show you some examples of what we do. So for instance, out here you see that this VM got created and these are kind of stored in the timeline view. Over time, as this VM ages, there's a lot more data that shows up out here and we kind of merge this with VM metrics. So I could go ahead and select N number of metrics of that VM and the system would kind of show all the graph. There's only data on the right edge if you see. Yeah. There's no graph on it yet. This VM just got created, right? And over time, the VM will have a lot of red dots, something may have gone wrong with the VM and things like that. So you could potentially kind of select a particular red event and it would go all the way down through all the metrics. So a very useful scenario is somebody files a bug saying his VM is not behaving on a Cloud Admin or whoever is troubleshooting can kind of go back in time and see what might have happened to the VM in the past. So these are some things that we think. And this is something that we really felt a lot of need for because customers say, oh, my user says last night his VM had performance issues. How do you troubleshoot that? Right. But we basically built a zero stack system to really collect a lot of information, both with respect to events, like from notifications and other events and also time series data, sort of overlay all this information and show it to users. And that's one of the benefits of kind of running part of the zero stack system as a SaaS thing, platform in the cloud that we're able to kind of run the big data analytics and all of that and show some of these things. And here at a project level, you can see across multiple VMs, everything that was ever done to that project, you see it here, right? You see a timeline view and you can go back and kind of move this back and forth and kind of go and do some of these things. So that comes in very handy for our customers, right? So, and I think let's move. So Kiran, you want to talk about the issues with? Yeah, I mean, one thing is we said, hey, you know, it's so great. We can hook into all these things and get all this information. But as I was alluding to at the big, in the middle, not every notification system has been built with the same level of detail in mind. So NOAA actually comes with some of the richest amount of information and it actually gives you a token so you can go back and query for more information if you needed. Keystone actually, I think, is about the most bare-born one with respect to notification information. For example, like if you saw, I think in the UI, I don't know if anyone spotted it, it said something got created with, you know, there's no name next to it. So that's one of the reasons. For example, the Keystone project creation doesn't even actually come with a name, just a UID. Now, if you really want to go back and figure out more information about this event, you actually need authorization credentials which are difficult to manage in an automated system because different projects have different authorization levels and roles. This is why using more API calls to fill in this information becomes, you know, important. It would be nice if the notification system itself has improved because when we looked at the code, there is information there which gives you the richer information that could be posted in the notification. We are working on some of that code base also to upstream it, sort of make more information, richer information available. And that eliminates some of the back and forth API calls that we had to do to get this information. So this, all the code base that we ran in and we showed you the live demo, all of this is actually available in the open source library. And like I said, it's more of, here's how you get the notification, what do you do with it is all left in as a shell code that you guys go fill in. And as we build more richer information with respect to port scanning or scanning images, we'll open source more of that too. It's just not in a stable state yet. And even if you're not a go shop, I think some of the work we've done in creating the structs around the information that falls back and forth can be very handy in writing in any other language. We're open for questions. So I think we're open for questions. So please step to the mic. Step to the mic if possible. Yeah. It looks great guys. One of the problems I've seen with notifications is if you are doing some action or some API request that's going across services, how do you link all of the notifications back to a single API request? So that is a very, very good question. How do you link one request that happens? For example, like we said, a VM create request with all these notifications flying around the system, right? So this is where we had to do a lot of the work with respect in the code to track these things over time. So once you get the basic information for a VM create, the initial request, then we had to go track all of these requests that are happening with respect to the rest of the scheduler and post events and everything. We track it in our system by putting together a temporary in-memory tracking of each of these things. Yes, I mean that part of it, it's not in the open source library, but like I said, it's really more of what you really want to do with it. If you really want to do more, for example, one basic thing we do is we get the project event, we just store it in the database and we don't actually get any more information. When someone really logs in, we can use that person's credentials to go look up the information about that project from the database and real-time API calls to construct the view and show it. Because then you know that the person has authorization to access the information about that project. I don't know if that answers your question. We have to do a lot of heavy lifting to sort of put sense of all of this together, yeah. I have two questions. First one is do you see the notification is lossy? So you do miss some notifications? I mean that's based on our experience that depends on the load of the system and you're sometimes just kind of mysteriously saying, okay, how come this information is missing? It turned out that the guy just not sending it. The second one is on the port scan. That's a great application. Do you have to mark around the security groups so that the security groups allow you to scan all the ports? Otherwise it'd be everything that's dropped except that permitted by the security group. So for the first question, do you miss notifications like are the lossy? There's two things there. First thing is you should attend the RabbitMQ talk. You could look at the video of the RabbitMQ talk that was yesterday by the pivotal guy. You really have to tune this a lot so that RabbitMQ keeps it up and running. And what we found is RabbitMQ is a nervous system of open stack. Lots of bad things happen if RabbitMQ goes bad. So part of it also is that once you keep it healthy, we don't miss notifications once it stays healthy. The second thing also is we've, because of the RabbitMQ messaging notification system, if you actually have problems in your own corner, you crash and come back up, notifications are sitting there waiting for you as long as they're not over flown for a long time. So we basically solved that by looking at, you know, like I said, we try to keep RabbitMQ healthy just because otherwise the rest of the functionality doesn't work. The second part of it is yeah, you really have to make sure that in your own processing that you capture it quickly, all read messages as fast as you can, and then do more of the lazy processing offline in a synchronous manner. Because if you try to do synchronously go back and call the APIs to get more information, we've found that that actually back some, increases the backlog a lot. So doing that asynchronously, which is why, you know, Go is great because spin-off a Go routine for handling every notification works great for us. That was the second question. Yes, I think if you noticed in the UI, we actually had to enable the basic security group for that VM also, which is what lets you do the port scan, yeah. This is more of a discussion point than a question, but you talked about the lack of rich notifications being a problem. I would caution you not to look at each event that doesn't have enough information because that is essentially a scalability feature of complex systems. So if getting all those little pieces is expensive, then it decreases your scalability. Yes, I mean it's a valid point. Trying to collect a lot of information in the system to send it out as a notification is definitely not the right way to do because it might be like you said, trying to put together that information is too much, too expensive. What I was more referring to is that we've seen places where I think it is possible to send more information that actually exists in those modules as opposed to going back and doing API calls itself to get those. Whatever information is available, posting that would help in, because there's a different reason. And like I said, you know, actually we call it the original title of this presentation was notifications for fun and profit. It's more of really observing everything and for profit also in terms of really what can you do with it? So you're tapping the Rabia MQ bus, the notification bus. Do you guys use Solometer Alerts? Because I know that they're also tapped on the same bus and those messages are taken out of the queue and so they're not available for consumption. How do you deal with that or do you just turn that off? So we actually did not use Solometer which when we started about this a year and a half ago, it wasn't in a state that we could use in our products. So we have our own lot of internal monitoring and notification mechanism for the rest of the infrastructure the Solometer applies to use. And at the beginning when we tried this we tried to do the fan out thing in Rabia MQ which wasn't very stable. So we avoided going down that route. Can you go into more detail about what your experiences with fan out is? Because we're talking about doing the same thing. I think a lot of you go on it offline. We tried it about a year ago and we didn't go too much down the rabbit hole of trying to enable the fan out. We can sync up offline about that one. But I mean like just as a high level thing. So you could try to do one key thing about like I said one gotcha about don't tap into the wrong command channels and notifications and topics because then you're barking your open stack. The same thing with the spec notifications.info there can be only one reader of this one. One option to enable multiple consumers of this is to enable the fan out feature where you can like multiple consumers can read off this channel. So Syllometer and our system like your library whatever can also try to get the same message. We avoided going that route and only our code reads it. We don't enable Syllometer in our system. If you enable Syllometer I think we'll have to talk more about the fan out part of it. But one thing if you just add a high level if you look at the zero stack like this little bit of UI we showed is we realize that a lot of the information that we wanted to collect didn't exist in terms of we collect a lot more real time information of what's going on in the cloud which Syllometer was and Syllometer was heavily notification based as opposed to a lot of statistics and meta information that we collect. So we use some of these notifications overlaid with the rest of the information we collect. So for that we basically like wrote a lot of the notification capture and processing on our own. No more questions then thank you very much. Thanks everyone for coming. Thanks for coming, I really appreciate it.