 Okay, so I guess we can go ahead and start. Good morning, everyone. My name is Tando. I'm with the IBM Silicon Valley Lab in California. With me is my colleague, Fabio Oliveira. He's a research staff member at the IBM Watson Lab in New York. And Winni Seng is also at the IBM Silicon Valley Lab in California. So our talk today is on debugging full-stack deployment in the full-stack. What we mean by full-stack is that it covers both the resource and the software. Debugging is a pretty broad topic today in the next 40 minutes. We're not going to cover everything. But the heat community has pretty broad knowledge in this. So if you have a difficult problem to debug, then the heat RIC is a pretty good place to get some help. But what we'll actually do today is to outline our vision in debugging heat template. We'll describe what you can do today with the tools available. And we'll show how it is being improved. And we'll show you some demo. All right, so I think by now we are all becoming pretty familiar with heat. But let me just spend a few seconds to kind of quickly review what heat. From a user perspective, what you get from heat is a domain-specific language. And with that, you can write a template. You describe the resource you want to do. You do great. And the software you want to deploy and the dependency between them. And then he will take that and basically drive the acquisition to create all the resource and basically deploy your stuff. So here we highlight some of the key features in heat. This is not comprehensive by any means, but we just want to highlight a few points. So besides creating the resource, heat also drives the automation for your VM. So it can run your script. It can interface with the configuration management tool like Chef and Puppet. And it can be extended to other technologies as well. The second point here is that heat updates that in place. So what it means is that it drives the full lifecycle of your stuff. So it does more than just the initial provisioning. And this is a key point that somehow is missed. And then the template. You can make your template portable by parameterizing the part of your template that would be different between cloud-based and open-style. So that way you can take your template and deploy to a different cloud. And then you can customize the part of your template for the cloud. So things like network ID, image ID, and so on. And you can see that heat does a lot of work for you. It does a lot of automation and this is all very nice. The problem is that when something is wrong with all that hidden automation, then it could be pretty difficult to get in and debug. So what can go wrong? Well, you can just have bulk in your template. And heat does a really good job with validating your template. But still you can make some error that makes your template behave in a very mysterious way. You could have wrong dependency in your template. And this will show up as timing error. And it could fail in a bit only or it could fail in some non-deterministic way. It could be difficult to debug. You could have bulk in your script. So bulk in your part script, your chef and puppet and so on. And this tends to show up when you move from image to image. You could be using your cloud resource in a way that doesn't work and could be because you misunderstood the respect on the resource or maybe there's not enough documentation for the particular resource. And the cloud itself can introduce error. You could have issue with capacity. It could be misconfigured if you have a small cloud. And it could be just transition error that happens in your environment. Okay, so you got a fail start on what you do. So here we outline a sequence of steps that a user could go through to debug your fail start. You could start by looking at horizon. You can see what your state of your stack is. Here you should be able to see what resource has been successfully deployed and what resource are in error. So that should give you a pretty good view of where you are. And sometimes you could get some useful error message and that would help you. If that's not enough, then you can go to the hit command line and get some more information for your resource. And then what you can do is you can go to your VM login and look at the loss. There's a lot of log you can use. If you have problem with initializing your VM, you can look at the provided log. If your automation on your VM is not running correctly, you could look at the, there's a lot of logs for those like Chapman Puppet. And he also has a keep log for the activity on the software configuration. That could be useful also. And then finally you can get on this OpenStack server, look at all the logs or the different services to see how the OpenStack services are responding to the request from P to create your resource. So that would be like the final thing you can look at. And then once you figure out your problem, you can recreate your stack. You can run a stack update to continue from where you are. And that fix the problem. So the two observations we want to make here is that first the sequence of the flow here roughly correspond to the level of complexity that a user had to deal with. And that in turn correspond to the level of expertise that you need to be able to use the information and debug. The second point is that the flow also correspond to the level of authorization you need to be able to get to the information. So for instance on the VM to get to those logs you probably need to do publish and the cloud provider might not give you that publish. To get to the OpenStack log then certainly you need admin level access. So based on that two observations we can see two group of users toward the top you will have the typical cloud user and that they would tend to be at the top of that spectrum. And then at the bottom of that spectrum you would have your cloud admin, your dev app, someone who would have full knowledge of your stack. So certainly the group at the top there, the cloud end user is a much broader, bigger base of a user. So the question we will ask is how can we make debugging easier for that big group of user? Well, so here to answer that question we want to outline our vision for building debugging tool. So we start where they bottom. So here at the heat engine level we need a set of low level support to be able to help to debug. So first thing there is that you need a pretty robust error handling with a hook for tool to tap into. So if you look at any processor architecture, this is pretty basic stuff that is provided. And then you need a way to control the deployment of your stack, starting your stack, stopping where you need to and continuing from where you start. Then you need access to all the resources that you have seen and access to the log in some kind of control way. So once you have this kind of low level support then you can envision building a pretty basic command line kind of a debugger, something like a GDB. And so basically what you get here is a context on your file stack. And then with that it makes it nice and convenient for you to quickly use simple command to kind of navigate around and manipulate your stack and start and stop and single step and so on. So it's pretty basic stuff. And then going from there you can envision building a pretty advanced tool that will probably be graphical and will integrate everything from end to end for your development and your operation. So for these, Eclipse could be a good framework to use to build your tool. And something like the Elk stack can also be used to allow you to interface to all the logs in your Elk stack. Okay. So the next question there is who is going to build this? Well at the heat engine level, all the low level support that certainly that belong to the heat engine. So that I think has to be a community effort. The basic debugger that could also be open source could be some contributor tool that someone could write and contribute to open stack and to heat. And for the advanced tool, we think this is where vendor can step in and provide a valid ad offering on top of open stack. And so this is a good opportunity for people to get in. Right. This is a pretty nice, good picture, right? And now where are we? So what I'd like to do next is to show different elements of this picture that are emerging in open stack heat. And so maybe one day we can fill out the full picture. So here I'm showing a number of low level support that help with debugging in particular for the Juno release. Right. So first is a new feature. Now you can update a field stack. Before Juno, if you stack fail, there's nothing you can do but to delete the stack and recreate. So with this feature, if you don't roll back your stack, then this is more or less equivalent to continuing a field stack. So it's one of the low levels of what we mentioned earlier. So there's many scenarios when this is pretty really useful. You could have a very large stack that would take a long time to recreate every time. You could have a long running stack, maybe something running in production, maybe you do an update and then something fail. So in that case, it's not nice to have to kill your stack and recreate it. Another little feature that help with debugging is so when you debug, you tend to repeat the same stack over and over again, and you tend to reuse the same parameter and so on. So this little feature is a new command line operating option and API that let you basically reuse the parameter that you use previously to pull it out of the database. And then finally, you can now kill a stack in middle of an update. If you run the update and look like it's in error, then you can just cancel it. So again, if you don't roll back your stack, then it allows you to get in and do some debugging and maybe potentially resuming from there. Here in this shot is a couple, a number of new blueprint. There's not implemented yet, but these are potential new features that provide the kind of low-level support that we mentioned earlier. So I won't go through a lot of them, but I just want to make the point that you can see this is pretty much really a community-wide effort. And I think the community does recognize the need for supporting debugging in the template. So now that we have kind of laid out the emerging element in Overstack.help debugging, let me now pass on to my colleague Winnie here. Winnie will take you through a couple of debugging scenarios. Thank you, Winnie. Hi. Thanks, Tom. I will show some typical technique most people are going to use to debug heat template on OpenSec Junior Release. Let's take a closer look at my first fairly scenario. In this scenario, it's very simple. All it does is create a lower server and then use the heat software config and the software deploy to config that server. Let's take a look at the status of this stack. I'm showing you here is trying to check the stack from horizon. You can easily get to this page from going to a project and then illustration and then stack on the left navigation. And then on the right side, it will show you all the stack that is belonging to you in this project. And you can simply click on the name of the stack and you will get the detailed information of each stack. And the way I show you there is under the resource tab. That's the detailed resource list of that stack. So if you look closer down there, you will see that the stack failed. The server already created. The configuration also set up already. However, when it deployed, then you see it have error there. Or it says just if I execute to, what does it mean? Not sure yet. But all I know is when I try to deploy this configuration, it failed. So the most logical place to look at probably will be try log to the VM and look at the system log file. Hopefully we can find some error message there to help us to figure out what's wrong there. So let's take a look at the system log file. Try to look for the error message. And in there, I find this error message. It said that when I try to run that configuration script that I sent in, it failed. And if you look a little bit further up there, it's complaining the very first line of my shell script. So I opened my log file. I mean, not log file. There's a script file I see that has very simple nothing wrong. So I'm thinking maybe I should try to just manually rerun it here and see what's going on. And I got the exact same failed error message that I see in the log file that just tell me that's really some problem with my script file. So I open it again, look at it again, stare it a little bit, finally I see what is my problem is the file is in DOSMO. And I'm trying to run it on a Linux machine. That's why it failed. The reason I get to this is because I created this file from a Windows machine, transferred SCP over it to a Linux machine, and I forgot to convert the format. So after I fixed the file format, then I run the stack update from the command line and here you see the update successful. So here I would like to point out I have used two new feature that I've added to a junior release that Torneff mentioned before. The first is I'm able to run this stack update against a stack that is currently in fail state. For the previous release in Icehouse, if you got a stack that failed all you can do is just delete that whole stack and start over again. But in this release, you can update it. The second feature I use is this little dash x option, which is allow you to reuse all the existing parameter that you previously defined from your previous run. So you can reuse it. You don't need to retype everything again. Or if you want to just change some of the parameter, all you need is have this dash x plus the parameter. While you pair that you want to change, then you can just change just those several parameter you want to change. So let's take a look at my second scenario here. This is also a simple case. All it does is create a NOAA instance and then a Cinder volume. Then it will try to attach the Cinder volume to that server. It failed again. So here you see the NOAA server already created successfully. All right, so I'm going to try to create the Cinder volume. We have a problem. And all it said is unknown error. That's another error that is not descriptive. I don't know exactly what is it. Since it failed when I tried to create the volume, so it's very logical to look at the Cinder log file that may give me some information about my failure. Before I get into the Cinder log file, I'd like to point out is most regular user cannot access to your OpenStack system. So they will not be able to look at any of those Cinder log files. So it will be a little bit hard for a regular user to debug this problem. I know it's a simple case, I understand. But just think about a more complex case. So, okay. When I look at the Cinder log file, there's three Cinder log files. One for API, one for scheduler, and one for the volume. I find some error message and warning message under the scheduler log file. I see one error and two warning message. In the error message, it mentioned there's no valid host was found, no weighted host available. So it sounds like I cannot find a host to put my volume. When I look a little bit higher up there, and I see the warning message here, it clearly tell me what's going on. It's insufficient free space for the volume. I'm trying to create there a request for 50 gigabyte while there's only 7 gigabyte available. So this is very simple to fix here. I have three solutions here. One is try to identify some unused volume on a system, delete it, so free up some space. The second choice I have is simply reduce the size of the volume within to that available size and attach the smaller volume to the server. The third option I have is I can scale out a Cinder node so I have more storage space. So here I pick the simple way, which is just simply reduce the size of the volume and attach that to the server. Here you see it run successfully. So here I have showed you two scenarios how I use Horizon and Command Line together. More information about the failure of the stack and then I use my own experience to guide me through to identify what log file have potential to give me more information about the failure. A lot of time it is hard to do. This is very simple. Everybody know where to go. If it is a complex stack, then it will be harder to figure out what you log where all those log files add. And also when the error message is not descriptive, it becomes a little bit more harder. And also some log file is not available to all the user. So this kind of produced an opportunity for Wender to step in to build some tools that can help to ease this process. And I'm going to turn to Fabio to talk about a high-level end-to-end advanced tool that will simplify this deep-up process a bit. All right. So in this part of the presentation, we're going to show you how the end-user could troubleshoot failed-hit stacks end-to-end from the perspective of a high-level advanced tool. And this tool is IBM Urban Code Deployed with Patterns. So remember the lesson there that Winnie showed us. So we're going to revisit that at this part of the talk from a different perspective. So let me just say a few words about this IBM Urban Code Deployed with Patterns tool. It's really a web-based development environment for hot templates. So it provides you with both a graphical editor and a text editor. The user can easily drag resources from the target cloud into the template that they are creating and trigger the provisioning of the stack from there. It provides you with the typical features that you would expect in full-fledged integrated development environments for traditional programming language, such as syntax checking as you edit your hot template. It also verifies whether or not the hit types that the template references actually exist in the underlying hit engine. It gives you warnings if you define a parameter in your template and you do not use the parameter anywhere else. Basically, all the typical features that you would expect in integrated development environments for programming languages. So I don't know if you know, but this tool was presented at the OpenSec Summit in Atlanta earlier this year. It was announced at the round at that time. And it's being demonstrated here at the IBM booth. So you may want to stop by the booth and talk to our colleagues over there if you want to learn more about this tool. So what you're seeing here is a screenshot of this tool. You are seeing here the diagram view, which shows you a graphical representation of a hot template that's being created. So on the right, you see a number of pellets. Each pellet corresponds to a type of cloud resource. So for instance, images, network storage, security groups, et cetera. You can drag those resources into your template and those pellets are actually populated based on the content that are available, all the resources that are available on your target cloud. You can switch between the diagram view and the source code view. Here's the actual code of the hot code corresponding to the previous graphical representation. You can switch back and forth, and the tool makes the experience very fluid when you go from one view to the other. We're going to show a demo towards the end of the presentation. So we have an editor for hot templates. How about TurboShooting? I haven't said anything about TurboShooting yet. So in this regard, we have been experimenting with combining this web-based hot editor with an analytic service. In this service, we call it PDAS. It stands for Problem Determination as a Service. And it's really a service that's supposed to be installed if you want it alongside the IBM Movement Code Deploy with Parents tool. And it provides some UI extensions to the tool so that the results of the analytics performed on your field stack can be shown right there along with the web-based hot editor. So let me give you a little bit more details about this analytic service. It leverages the trio, Elastic Search, Log Stash, and Kibana in order to collect logs from OpenStack hosts, parse them on the fly, extract semantic information out of the log events, and index them on Elastic Search for future reference. In addition to that, the service has the ability to take snapshots of your cloud state at any point in time and also index those snapshots for future reference. So basically, today, those two sorts of information can be used to try and pinpoint suspicious events that are relevant to the context of the failed heat stack. Okay? So when the end goal here is really to allow the end user to diagnose problems faster. Hopefully. That's the idea. Okay? Now, let's delve a little bit into the architecture of the service so that we can have a better idea of how it works under the covers. So as I said, so each OpenStack host is instrumented with a Log Stash agent that is on the fly, parsing log events, extracting semantic information out of them, and sending those annotated events to the service. So we use a ready Kivari store as a buffer, and then a Log Stash index or index those events into elastic search. Upon answering to a REST call, the service can basically take a snapshot of the cloud state at any point in time, and by state really mean the status of the various resources that are available in the cloud at that time. So what images are there? What is the metadata about those images? What resource security groups have been defined? What are the security rules in each security group, et cetera? When that happens, the cloud inspector will crawl your OpenStack cloud state and index that to elastic search as well through our internal search engine. Now let's take a look at this in the context of the web-based hot editor. From the editor that I showed you before, the user can create the templates, open templates that have been previously created, and then trigger the provisioning of high stack right from the tool. Now let's suppose that we provision a stack and OpenStack resources are being created as a result of that, and let's imagine that the failure happens right here at this point in your stack provisioning. So at this point, from the tool, the user can call to an analytics function that will under the covers make the web-based hot editor talk to this analytics service and ask, hey, what's wrong with my stack provisioning? And the service will reply back with suspicious log events and cloud state changes that users should focus the attention on. Those events are rendered in a Kibana dashboard embedded into the tool itself. Now the contents and the layout of this dashboard is customized based on that particular stack failure. So that's the idea here. Okay, let's take a look at that in action. I'm going to show a quick demo, revisiting the last scenario that we need a walk-through to remember the Nova server and the storage volume. Okay. So you're seeing here the tool called the public patterns. You're seeing the graphical diagram of a template that contains one image that's connected to a network called private. The name of the instance to be created from this image is going to be summit-vm. And as you can see here, we are requesting that a storage volume of 12 gigabytes large be created and attach it to this instance when this takes provision. Okay. Now, if you switch to the source view, you see the hot code corresponding to that hot template. You can see the stored volume, the volume attachment, the Nova server, everything is there. Now let's provision a stack from this. So we'll click on provision. And now we provide a name for our stack. Let's call it summit-example stack. And here I'm providing values for the parameters that have been defined in the hot template. So I'm selecting one key for this. So the values there are actually automatically populated based on what is in your target cloud. I'm selecting Fabio key. And then providing default values for the other parameters. Click on provision. Now at this point, the tool is contacting the heat engine to create the stack. Let's take a look at the list of stacks shown by the tool. You see on the top there that our stack, summit-example stack is being created. The creation is in progress. Let's refresh this until this is done and see what happens. Now you see that there was a problem. There was a failure. And the feedback that we got from the heat engine was creation failed on nowhere. So let's take a look at horizon to see if we can get more information about this. If we switch to horizon, you can see that the VM, summit-VM was created and it's running and it's active. That's good. Now if you look at the list of stacks from horizon, you see that our stack, summit-example stack is in failed state. So as we, not surprising, right? So if we go back to the tool now, let's try and get more information that would help the user diagnose this problem. So if you go back to the tool and if you notice each row of the two correspond to a stack that has been, that is either in okay state or failed. And there are two actions there next to the twitch stack. One is to delete it and the other one is to perform analytics on it. So when the user clicks on that little icon there that I'm about to click on, this is going to basically make the tool call out to the analytics service to try and get some use of information to help the user. That's what's going to happen under the covers here. So what you should expect as a result is if there is anything to show then a Kibana dashboard is going to be embedded here in the tool with information that hopefully will be useful to the user. So we'll click on the analytics function and now you see a Kibana dashboard right in the tool. What you're seeing here is out of all open stacks of the systems and all log files in all open stack hosts the tool selected six log events from Heat and 18 log events from Cinder and you see that we detected six clusters of events with the yellow and green bars there. Each cluster contains basically three log events from Cinder and one log event from Heat. So this is within the time interval during which that stack provision was taking place. As you're going to see later each of those clusters correspond to one attempt of the Heat engine to create the storage volume and failing at it. Now if we scroll down here and let's take a look at the actual log events that were returned you see that we have the tool decided to show the Heat events in one table next to a table for the Cinder log events and events are sorted by time the earlier an event happens the closer it is to the top of its corresponding table because typically early log events are more strongly correlated with root causes than later ones. Now if you look at the top event from the Cinder log events let's take a look at that you're going to see here all the semantic information that we have annotated with this so we see the actual path of the log file where it came from you see the actual open stack component that locked this you see that was the Cinder scheduler log you see more information about it now let's take a look at the actual log message and you see the log message shows us that there was insufficient free space for volume creation tells me that type requested 12GB but there's only 7GB available to fulfill the request now let's take a look at the log events for Heat so each log event corresponds to an attempt to create the storage volume infinity so basically log events there are Python stack traces that were just failed attempts to create the volume so each event here corresponds to three events on the other table they are correlated by time so ok so now let's now that we have an idea of what the problem is let's go back to the editor so we switch back to the editor view let's fix the template and apply the change to the field stack so what I'm doing now here is I am basically requesting the creation of removing that 12GB volume from the instance and I'm going to request the creation of a new volume and this time around let's make it smaller so let's give a name to the volume let's call it vol-fixed and let's give it 6GB of size ok now that we have saved this let's apply this template to the field stack now here we are leveraging the new feature of the heat engine available in the general release that allows us to update the field stack as I mentioned to you before we select here one of our stacks we select the summit example stack that failed and provide again values for the problems of the template and let's apply this to the field stack now let's take a look at what's going on on the heat engine so let's take a look at the list of stacks again you see that the update of this stack is in progress let's refresh until it's done and now you see that apparently the stack was successfully updated let's go to horizon to double check things if you go to horizon and if you look at the list of stacks you see let's take a look at the resource of our stack so you see that our stack went from field state to complete which is a good sign now if you look at the resource and look at the volumes you see that evolved-fixed that I have requested the creation is actually attached to the summit-vm that was part of the template so with that let's wrap up the talk with a few final remarks we have briefly talked about new capabilities of the heat engine some of which are available in the general release some of which will come in future releases that we see as basic building blocks for sophisticated hot-buggers that could be written we also showed the integration between IBM movement code deployed with patterns with analytics service to provide end-to-end analytics to help use the troubleshoot heat stack faders and we see the integration of a hot-buggered kind of approach and end-to-end analytics as emerging elements for a more robust approach towards troubleshooting the field stacks as a follow-up on our talk we want to learn more about these blueprints of these new features for the heat engine and if you want to you can stop by our booth and talk to our colleagues there to learn more about IBM movement code deploy with parents if you want so that's the end of our talk if you have any questions we would be helped to entertain them thank you for your time can I use the heat should I wait for Juno or can I use the Ice House version as well does Juno have features that I would not like to miss what do you think is Ice House heat ready or should I wait for Juno and for the releases for the features that I saw a lot of presentations about heat and they all were like Juno Plus or Juno based so what about Ice House is it ready or not so I'll take a short read and then I'll pass it on so as far as the capabilities of doing N2NLX we are agnostic to what heat engine is being used underneath so that's one thing now if you don't use Juno you won't be able to update a field stack you have to basically delete the start from scratch so I'll let Juno elaborate more on the question so that's the key point Juno has that very nice feature that Zane put in so right the other comment is that as we move on the kind of validation and error message get better a lot of things you saw like some of the error message you see in the horizon would used to be like an error like image I wasn't going to comment on that I had a question so the question is you are exposing the open stack logs to the end user how do you ensure that the user is not seeing stuff that they shouldn't see from other tenants or whatever in the logs so right now at this point in time we don't filter anything we filter based on the tenant so if we have information about tenant specific logs that we can collect then we will not show anything that user should not see but that's where we are going that's a good question yeah but I think this is an area that we really need to address because that's a security issue you're not picking at someone else the other thing is one direction that you're planning to look for here is that in addition to have the ability to show the logs directly we want to try to extrapolate and based on the return suspicious events would try to give a higher level kind of idea of what the problem is rather than showing the log messages this would be even more helpful you don't show any logs at all but you give the idea to the user so hey this problem looks like a problem of this type so that's another direction to pursue here any more questions so again the whatever capability is supported by the line hit engine the tool should easily leverage this is typically installed in a separate host and the analytics service if it's used along with it could be installed in the same host as the tool itself or in a separate host right so regarding plans I cannot commit to anything here on stage but yeah that's something to keep in mind more questions? ok thank you