 Hello. Hello, everybody. Welcome to the session. So today we are going to present session on the scalable heat engine. Okay. So this is my friend, Kiran Anand and myself, I'm Kanak. So this is the agenda. So we'll just brief over you of the heat. Then we'll introduce about the new architecture, convergence. Then we talk about the comparison between the current architecture and the new architecture. Okay. Just to give you the feel of heat. So this slide gives you the need for, what is the need for heat service? So today if you take any cloud applications, so you will have the front end, either mobile or desktop, then you will have in the back end, you will have the load balancer followed by the API. Then back end you will have the application server or the DB server, then the database. So if you look at the cloud applications, when the load is going up, so usually it will increase the number of servers, and when the load is going down, it will get decreased. So all these things will happen automatically in any cloud applications. On top of that, if you give any applications, there will be a configurations. So in common, if you look at the cloud applications, you need the load balancer at the front side, you need the scaling, scale up or scale down, and you need one service to deploy your complete application, which is kind of like a hardware, like infrastructure, like server or the storage or the network or the application or the configuration. So these are the different needs from the cloud applications. Okay. So that's where the heat was introduced. So it's basically orchestration service to create and manage a lifecycle of the cloud applications. So now we know like cloud applications means it has the infrastructure and the software that's application with the required configuration and the load balancer in place. So if you take any cloud applications, we can model it in the heat by means of the template that's called heat orchestration template. So it's nothing but it's equivalent of AWS CFN. Okay. So if you take the cloud application, it's made of, we saw it's made of different resources. So to model that resource, we have a resource plugin. Okay. So once you model the cloud application as a template, you can create as many cloud applications out of that by means of the stack. Okay. So once you deploy the stack, when the load goes up or when the load goes down, you want to scale up or scale down. So that's where it gives the auto scaling. Okay. On top of the infrastructure, you want to deploy the cloud applications for that it gives the software deployment. So once you deploy, you want to configure it. So for that, it gives software configurations. So while happening all this process, we want to track how things are happening. So that's where it gives the events. Okay. Let's just see the glimpse of what is the template. Okay. Take a simple example. Say you want to create the instance and you want to create the volume and attach it. So each of this element, you can model it. Like instance, you can model as one resource in the template and then volume, you can model it as another resource and then attaching them, you can make it as another resource. So all this constitutes the heat template. Okay. So now we know what is the cloud applications and how heat is helping to model and deploy the cloud applications. Okay. So heat is having the heat engine. Okay. In legacy, we have some problem. So we worked on top of that and then we solved what are the problems in the legacy and we created the new architecture called convergence and my friend will talk more on that. Thanks, Kanagraj. I'll go through briefly about the convergence architecture, the various phases that we decided we planned to do and the comparison between legacy and the convergence architecture and then I have some data. We ran some benchmarking tests that I want to share and show. So first, motivation for doing convergence. Robustness was lacking in the legacy heat engine in the sense that the heat engine doesn't really know about what happens to a resource when a stack is deployed and because clouds are noisy, there is possibility that due to hardware or software failures or due to some other, you know, user-related things like the division of a VM server without knowledge, those kind of things actually end up in a heat stack failure when a user does a heat stack update again. And then there are also issues with heat engine restarts. When a heat engine restarts, it marks the stack which was in progress as a failed stack. And this is pretty common when the system is upgraded, the heat engines reboot and then the stacks which are being provisioned, they are moved to failed state. So we wanted to ensure that when such things happen, either in the heat engine when the heat engines reboot or if the stacks that you have provisioned, the resources that are there in the physical world, if they disappear due to some reason or if they are down, then heat should be able to automatically bring them up to match with the desired state that you have mentioned in the template. Another issue was with respect to scalability. In the current heat engine, the stack is provisioned within a heat engine process. That means if you request for a stack, the whole request is taken by a heat engine and over time it starts provisioning that. And basically the whole heat engine is there in one heat engine process. So it may so happen that all the large stacks end up on a particular heat engine exceeding its capacity. Also, we wanted to ensure that heat scales up to the limit of other external components like DB or a messaging queue. It should not limit by itself. If DB supports 1,000 transactions per second, heat should take that up. If it supports 10,000 transactions per second, maybe it should take that up. I'm just giving an example in that sense. It shouldn't restrict by itself. Availability and usability. Heat should be available in the sense that it should be able to take action on... It should be able to take right action when user intervention is not required. For example, you have a stack and then network port is deleted or your VM is powered off. It should be able to bring it up and then keep it in sync with what is required. Also, users should be able to update the stack at any time. Stacks that are in progress and which take a long time, that as of now, user cannot do a stack update on that. He'll have to wait for the stack to either move to failed or a complete state. Then only they can update the heat stack. Then one more requirement was that when we move to convergence heat engine, these stacks should be backward compatible in the sense that stacks that were created in legacy heat engine, they should work fine because the convergence heat engine is being configured. All these efforts were addressed under the name of convergence and this all started during Juno time frame. After a lot of discussion and emails, we decided to do this convergence in phases because it required a lot of changes in the heat engine. Phase one was mostly targeted towards the design changes in the heat engine itself in the sense that we decided that we'll persist the graph in DB and then we persist the progress information in the sense that when the resources are done, we should be able to, when the heat engine restarts, it should be able to take up from where it left, not that it starts again. Then we decided that we'll have workers in each heat engine that will take up the load for stack instead of the stack being provisioned in one heat engine process. All heat engine processes should be able to take up the resource tasks from any of these stacks. Then resources should propagate on their own. They don't need to again go through the entire process because the resources are now distributed among multiple heat engines. Phase two, we thought we'll do the observer. I'll briefly talk about what observer is and then fault-tolerant heat engine maybe. And then fault-tolerant heat engine in the sense that if one of the heat engine processes goes down or the system on which heat engine processes are running, if they reboot, then it should be able to take these stacks forward. So users should not really see that these things have happened in the back end. Phase three, we thought we'll do the continuous observer that is still under discussion. Continuous observer is continuously monitoring the resources that you have provisioned using heat. So it continuously pulls, either pulls the resource or it will listen to the notification and then decide on what needs to be done and take the action and bring back the stack to the desired state that is being, you know, declared in the template. So let me talk about the design evolution. Let me briefly talk about the legacy heat engine, how it works in the current heat engine. These are the components and this is how it is connected. You have a heat API. The templates are given to heat API and then the request goes to RebitmQ or whichever is configured. Then the request ends up at heat engine. Heat engine has a database in back end. So user requests a stack. Here this stack one actually has four resources. I am just showing it as stacked up on each other to convey that resources A and B are needed for resource C and resource D needs all three of them to be done. So this is how the template is written by the user. I have just modeled it like that. So here you can see when a request is issued, the heat engine one takes up the stack request and it starts provisioning that. Similarly, stack two from some other user is taken up by another heat engine. Stack three is taken up by another heat engine and they start provisioning these stacks. So stacks are distributed among available heat engines. And then when stack one and two are done from heat engine one and two, the stack three here you can see is still running on heat engine three. So one of the issue here is that stack three, which is a very big stack, is being handled by only heat engine three. Though after stack one and two, heat engine one and heat engine two are available to take the load for stack three. So this is how it works in the case of legacy. Let's go to the convergence heat engine. All the components remain same. We have not introduced any new component. It's just the way it interacts, the heat engine interacts with each other and then other changes related to dependency graph of the template, those kind of changes have been done. So first thing in convergence is that the heat engine processes register themselves as worker to the AMP PQ, which means that each one tells to the rabbit MQ that they are available to work on particular resource to provision a particular resource from any of the stack. So same use case, let's go through. Stack one user issues a request to provision a stack. The stack one ends up, let's say, at heat engine three. The stack is provisioned in the sense that the template is validated and template is parsed, validated. The resources are created in the database. And then as you can see, A and B are the first set of resources that needs to be provisioned. C and D cannot be done because they depend on A and B. So it then schedules A and B for provisioning. So A and B here, as you can see, they are distributed among the available heat engines. They probably go to heat engine one and two. Similar things happen with stack two. Similar thing happens with stack three. Stack three is loaded and parsed, and the resource graph is created, sorry, dependency graph is created, and the first set of nodes, A, B, C, D, E, F are distributed among the available heat engines for provisioning. Then how it propagates. I want to show here how the resources really propagate. Let's say from stack one, resource A is done. When resource A is done, it marks resource A as done in DB, and then it updates the data for C. Whatever data is needed by C for provisioning, that will be updated in DB. So resource C could be like a volume attachment which needs a server. Let's say A is a server and B is a volume, and C is volume attachment. So volume attachment needs the server ID and the volume ID so that it can do the attachment. So in this case, when server is done, it updates C's data saying this is the server. But it cannot really trigger C because it still depends. B is not done. So when B is done, it again updates C's data saying that I am done and this is the volume ID. Then it actually sees that A is also done. That's when it schedules C for provisioning. So here you can see the first set of... first level of nodes are all done for stack one, two, and three. I just want to show you the distribution of work, this is how it happens in case of convergence heat engine. So all the heat engines are provisioning all the resources from these stacks. And this is how it propagates. A, B, C, stack one and two are done with stack three. The second level of nodes are done and the remaining nodes are propagated. And finally, the final resource from stack three is being provisioned and then it is marked as done. So a brief design comparison between legacy and convergence. In legacy heat engine, there was a stack-wide lock in the sense that when you issue a request to create a stack or update or delete a stack, the heat engine would acquire a lock and that stack and then it starts provisioning or deleting or updating the stack. So this in a way restricted the user and they were not able to do an update on a big stack which is being provisioned and if they want to update that stack, they are restricted. But this is not the case in convergence. In convergence, there are more granular locks. The locks are actually on the resources that are being worked on. So if a stack is in progress, you can still issue an update on that stack. What would happen is only those resources which are logged, they will be weighted upon and when they are done, the new update will take over. Otherwise, all other resources will be provisioned depending on the dependency, of course. Then the entire graph and progress information in the sense that up to what point it is done and what is being provisioned next, that is in memory in case of legacy, the whole stack is there in memory and then it continuously provisions them. In case of convergence, all of this is put in database in terms of the dependency graph, in terms of the sync points. Sync points are nothing but the resource data that I talked about. When resource C needs to be provisioned, it needs data from resource A and resource B. So this actually founds the formation for fault tolerant or high available heat engine. When heat engines go down, they come up, they can simply look at the sync point and then proceed from there. They don't have to go figure out where it needs to start. Load distribution, in case of legacy, you have seen that these stacks are distributed among available heat engines. But in case of convergence, it's more granular. The actual resources are distributed among available heat engines. That way the resource utilization is very high. So because of the stack-wide log that I told just now, the concurrent update in the sense if a stack is in progress, you are not able to update that stack that is not available in legacy heat engine, but it's available in convergence. So let me talk about observer in brief. Why was this required? So you know that when a stack is provisioned, the resources are there in physical world. And then heat doesn't really know what is happening to those resources. It needs to be in sync with the physical reality. For that we needed an observer. So primary purpose motivation for doing this was there was no mechanism to know the current state of stack and then no mechanism to tell what's happening in the stack to a user, whether their server is down or the network port is up, all that information. Also, if a resource is not available or it's deleted or the server crashed or due to whatever reasons if a resource is not there, then your update is going to fail. I'll show you briefly. Any such updates on that stack can fail. So again, coming back to the heat engine. Here, as you can see, this is the heat engine that is down the stack. This is already done, A, B, C, D. But the resource C is down, let's say. It's not available. Let's say it's a server or something like that which has powered off due to some reason or deleted. And then the user issues another update that is having another new resource, N, which depends on C. For example, it's a new volume attachment or a new network port to be attached to a server. So in case of legacy heat engine, what happens is, without observer, I mean, what happens is that heat engine takes this request, it sees that resource C, it will check what's the status in DB and then it will try to compare whether it needs to be updated or not. Because resource C here is not changed, so it won't update it, it won't do anything, and then it goes ahead with the resource N. And when it tries to do that, it will fail because in physical world, the server is actually down or it's not even available. So creation of resource N will fail in this case because it depends on C and then C is not available. So with the observer, what happens is that when a stack is updated and a new resource is there, like the resource N, it goes through each resources and then tries to sync up with reality in the sense that it will poll NOVA for the state of server or it will poll Neutron to get the state of network port. And then if the reality doesn't match with what is there in the template, it will try to sync it up in the sense if the server is down, it will probably bring it up or create a new one. After doing that, it will try to provision N, resource N, and that will basically succeed. So this was the primary reason to bring in the observer component, not really component, the feature in heat engine. Talking about progress, it's been a long way. In Juneau, almost end of Juneau, the convergence initial blueprints were filed. And then in Kilo, we did a lot of POCs and evaluations, and then we decided to do it in phases. In phase one, blueprints were submitted in Kilo time frame and two basically blueprints were implemented. Those were the DB changes to the convergence message bus. In Liberty, we did a lot of phase one related blueprints. We created the lightweight stack, the convergence graph computation, storing it in database for reliability, and resource level locking mechanism, then the concurrent update features, and then we added the convergence-related scenario tests, which were very important. And then we had experimental gate job running every time people submit patches, we wanted to test with convergence whether it's going fine or not. In Mitaka, we did a lot of bug fixing. There were intermittent issues with respect to functional tests. They were failing sometimes, they were passing sometimes, so we had to fix all of them. The experimental convergence gate job that we introduced in Liberty was made mandatory in Mitaka. There were a lot of patches around observer in Mitaka time frame. So it's... That's how it went in Mitaka, and in Newton, we planned to make it... In Newton phase one, we planned to make convergence as default heat engine. And then we have other plans to implement the convergence phase three that I talked about where we have a continuous observer which continuously looks at the state of resources in physical world and try to sync it up continuously. We also wanted to bring the active-active HA capability in heat engine. So we'll discuss in this summit and we'll decide what needs to be done there. That's mostly about the progress of convergence from past few releases. This is... We ran rally tests. This is the data collected from that. We had taken the... What should I say? The templates and the scenario to run this rally test from Angus. So thanks to him. We ran the rally tests, and this is what we found. We ran it on a HP server which had 48 CPU to 56 GB RAM, one TB hard disk. We had multiple runs with number of heat engine workers. We wanted to see what happens when the number of heat engine workers goes up. So this is running with 32 heat engine processes. And then we are creating stacks with various size. So you can see there the stack size horizontally growing 50, 100, 150. And then you can see that the convergence heat engine... In case of convergence heat engine, the throughput is high, which means it's taking less time to provision than. So we carried out tests from the code from master branch and a few patches which are still in review. We also wanted to... For some reason I am missing the... Okay, here. This is the comparison between, you know, number of heat engine processes. We wanted to see that whether it really scales out. If we have a convergence heat engine running with eight processes, and then we increase add more number of processes, we wanted to see that the throughput actually increases. So this is what is being shown here in case of convergence engine. You can see that the fading green line is with, you know, eight heat engine processes. And then the time taken is much higher than the orange-colored 32 heat engine processes when we run with it. So when we compare the same thing in case of legacy heat engine, there is not much improvement because it's not designed to, you know, scale out. You just take one stack and then that heat engine processes, provisions that stack. So obviously there's not going to be much difference. This is mostly... This is about the contributions that were being made, the blueprints and then folks involved. I thank the whole heat developer community for, you know, giving a lot of reviews, helping us with POCs and evaluations, filing blueprints, taking this forward. Thank you all for these things. And I hope we will take the observer and continuous observer forward. So please get involved with heat. There is a lot of stuff that needs to be done with respect to observer, and we need to have a lot of functional tests return and a lot of scope to be done in case of heat. Q&A. That's all I mean. You showed a graph of the time it takes as you add more to your heat stack. I was curious about that sort of graph, but against the memory footprint. And I was wondering if you tested that. I guess I'm interested in... Does this change reduce the... Yes, yeah. In other words, do you have a similar graph or was there a test taken? These ones. In case of convergence, we compared with eight heat engine processes and then 32 heat engine processes. The previous one. But it's the same idea. I mean, I guess I'm curious. I see you've improved the time taken. Have you improved... Did the memory footprint change at all from this or is it about the same? I think memory footprint went up in case of convergence, but I have not collected that data. I just wanted to see whether the throughput goes up or not. Yeah. I would say I was marginally gone up, but we would do... I mean, it's still in the process. We are in the process of getting these tests done. So probably we should be able to share this data in the following weeks when convergence is actually getting enabled by default. That was one of my questions. So I have two questions. One, how... If I wanted to run a convergence engine, heat engine, how would I do that? And two, can I run a convergence heat engine against a heat API that is on an older version? Compatibility is between the two. So how do we run a heat engine? That's the first question, right? Yeah. How would I run a convergence-enabled heat engine? Yeah. For example, if you take the Mitaka release, you have to just enable it using your configuration flag. Okay. Right? It's disabled, marked as false. Okay. And your other question of how would you actually use it in legacy? Yeah. Can you use a convergence-enabled heat engine against a legacy heat API or are they tied together? Does that make sense? Not really. The API perspective would still be the same. So API and all, there are no changes in API. The changes are in heat engine, so that won't affect the heat API part. Right. So my question is, I could potentially run an older version of a heat API that would run against a newer heat engine. Yes. Yeah. I mean, from an end user perspective, you can use the same old API, same old templates, and run against a convergence-enabled heat engine. Okay. Yeah. Do you plan on implementing cancellation? Heat cancel update kind of thing? Yes. Yeah. There is already a patch in review for heat cancel, stack cancel update in case of convergence heat engine. Yeah. We are planning to do that. Yes. Okay. Questions. One, what is the unit on the horizontal scale for stack size? Is that the number of resources? What is the unit? Yes. That is the count of resource group, basically. And that includes the dependencies as well. Everything is resource at that point, right? Yes. So basically we had a template in which we had a resource group. Okay. And there was a way to change that count. So we incremented that count by 50 that you can see here. Okay. That particular resource, it's a resource group itself. It's a resource. The actual resource in that resource group is another stack, which is having a server volume. It's basically emulated like it has a server volume attachment, a floating IP volume, but it's been emulated by another resource called test resource. So we have actually not run it on a cloud or physical, this thing. We just tested with test resource. And also you said you can turn on migration in the Newton release. Convergence will be true by default. It will be enabled by default, right? We are planning to enable it in Newton phase one, yes. So if you're already using the legacy engine, do I have to migrate my stacks or will there be a migration process? If there is one, would it require that the engine be brought down? Or it will be no downtime upgrade? So it can be... We are in the process of doing... We are writing scripts to migrate the running... Sorry, the stacks that were already done to move it to convergence heating. Okay. Thank you. Hello. My question would be that how is action check related to resource observation? So is it implemented through that one or is it something completely different? So in case of resource observations, we are... We plan to look at all the properties and then see if we can, you know, update that resource, if we have to bring it back to what is given in the desired state, what is given in the template. So you mean that you actually have to implement the compression of resources for every resource one by one? Yeah. So every resource it will pull to the external system, NOAA or Newton and then gather what's the real state and then try to update or update, replace it. Yeah. So my question is also along the same lines. So there were plans to address failure policies at resource level. Right? Like, you know, after a stack create is done, if one of the resources fail, heat convergence will take care of applying failure policies for the resources. Convergence? Yeah. I think it would fall in the bucket of observer, the continuous observer which is being planned for discussions in this summit. Okay, it's still being planned. To automatically, you're talking about when the resource fail automatically heat would actually inspect or know that this resource has failed and take corrective action to restore it back so that it's in sync with the stack. Let's say somebody deletes a VM that's in a stack. Right? Yeah. That is not handled today in convergence. That is the part which we plan to handle in continuous observer for which we would discuss in the summits. So is that planned for Newton cycle? We are planning to discuss. Yeah. Okay. Thanks. Hi. Earlier today there was another talk on one I also learned about called Sennlin and in the Q&A session of that I believe heat convergence was mentioned as well and somehow these overlap were related. I was just wondering if you could comment on the relationship between heat convergence and that project if you're aware of it. So I think it was with respect to auto scaling. So I think what Sennlin does is it will auto scale. It will see if there are four resources and then it has been overused and it will add another fifth resource or six resource like that and it will probably scale down. But let's say if you are scaled up from four servers to five servers and then it so happens that now for some reason your two or three servers go down automatically in that. Not in case of auto scaling, in case of resource group let's say which is not auto scaling group. In that case if the server goes down in that group the observer or continuous observer will bring it back to those many number of resources. So this observer particularly is very tricky to be implemented in auto scaling heat engine. So as of now we have we don't plan to bring in the observer for the auto scaling resource group because there the auto scaling happens from a feedback from silometer or monaska so it's bit tricky to do that thing there. And just to add to what you said I think with heat right auto scaling is I mean if you integrate with monaska or silometer I think that is where heat gets its information whether to auto scale or not. I'm not sure how senlin would do it. So that was one part we didn't get much clarity on that session probably I think during inter-project discussions we actually have to see what are the specific use cases senlin and heat would be addressing. The closed loop corrective system you're creating using your observatory model the corrective the correction happens immediately right because in your example you said when a request comes for land it sees the dependency C and then if the C is not there then it creates it. But if it's a closed loop corrective system it happens as soon as it detects an error. It does not have to wait for a request to come right? No that's the continuous observer part. Yeah that's what I'm doing. When continuous observer will be done it will either pull or it will depend on notifications from NOVA let's say NOVA notifies the server is down then it will immediately bring it back. Okay and then how do you determine if the action happened intentionally by user or if it happens due to an error? That probably we cannot determine if it was user who deleted or I mean powered off. The server could be done for administrative reasons or it could have crashed. How do you determine between these two cases? In first cases you really don't want to restart it automatically because it's a user action. Yeah that's one of the reasons why in continuous observer there would be options actually to specify whether a particular user has to be recovered or not. In the sense like is it something is this resource has to be continuously monitored or is this resource has to be kept out of monitoring. Okay and even in this case I think the particular use case you described where an admin has actually brought a particular resource down to maintenance. This has to be considered in the continuous observer use case so that it doesn't automatically interfere with an admin initiated operation. So each what we plan to do is give a flag at a stack level and resource level where you can say whether you want to observe or not continuously. Suppose you have plan to do a maintenance activity you can switch that flag off and then continue with your activity and then again switch the flag on then it will again start monitoring it. Thank you. I think my question connects to the previous one so I forgot it. I'm sorry. But I have another question. So we see these graphs and you say that the number of processors and the threads that are running the heat tasks is scaling quite linearly. Currently as I saw it in the heat code the Nova tasks are implemented in a quite naive way. So for example when you start a server or boot up a server it will wait until it reaches the running state but this is done very naively going down to the Nova list API and getting the state of the server. So when concurrent way on lots of threads this can easily overload the Nova API I guess. Have you been thinking about this? So we have not tested with actual NOAA but I'm pretty sure it will load NOAA so NOAA also has to scale to the user needs when a user wants to deploy a template which has let's say 50 VMs or 100 VMs in case of convergence you can imagine all the requests will go at the same time in case of legacy also it will go at the same time but there is little bit switching happening in the event lead so it's bit more parallel in case of convergence so it may so happen that the NOAA API returns back saying that this cannot be done that's possible but this NOAA the ball is in NOAA's code to address these issues if I understand the question correctly is the concern that heat is actually overloading these APS because it's doing a lot of this polling yes okay valid concern I think I think this and even it's the number of times it actually disconnects into the observation so do you consider the order of the events you see or you observe so for example when a port disappears from the API do you immediately want to recreate it or do you also correlate it to something like the server has been deleted or something what is being planned is that when you see that certain thing is not there then probably create it so that the stack is in sync with what is desired again I would say like it would be treated you would look actually at the hierarchy the dependent components and then actually decide whether to create it or not so yeah I mean these are finer aspects which are very fine thank you . . . . . . . . . . . . . . . . . . . . . . . .