 Good morning everyone. My name is Vipin. I work as a staff engineer in VMware and OpenStack I contribute to Cinder and OS profiler. So this talk is about OS profiler. It's a library in Oslo It's meant for profiling OpenStack API request So here's the agenda. So we first discussed some use cases That's applicable for OS profiler and followed by an overview on OS profiler Then we discussed some internals of OS profiler. So if you're planning to contribute OS profiler or you're planning to implement integrate OS profiler in your project then that will be useful and Finally we show a small demo which describes some of the things we discussed so Let's start with the use cases So I hope you can see the image clearly so Actually, this is part of a call graph generated for a command The command is NOVA boot from boot an instance from block device with an image source So this request it Actually contains it propagates a NOVA, Cinder, Glance and Neutron So this call graph contains method from NOVA, Cinder, Glance and Neutron So let's take a close look at that call graph Okay, it didn't work. Okay, so you can see the call graph so So this is a call graph generated for that command NOVA boot from boot an instance from block device. So So it's it's pretty huge So at the point not he's this doesn't contain all the methods This only contains the API calls, RPC calls, driver and the manager public methods So a lot of methods are not in this call graph. So we can Imagine how big is a call graph the actual call graph will be So as in that we have a performance issue in one of these commands. So how do we troubleshoot it? the performance issue maybe in We don't know which service it will be because a lot of services are involved and The service is running in multiple hosts. So we we don't know where where to start So the problem may be in a single method So it's often the case that some methods doesn't scale well. So and that may be the reason why We have performance issue. So identifying the performance bottleneck is a first step in diagnosing these kind of performance issues and then some cases we need to benchmark two drivers for example LVM and CIF for Cinder and Using a tool such as rally and and often times we wonder, okay Why delete is not it it's taking a lot of time in LVM compared to CIF. So we want to understand how Why it's taking more time so we can use So that is another use case where we would like to understand the performance issues with certain APIs So what what's the solution for this the solution is distributed tracing. So what is distributed tracing? So distributed tracing basically helps you understand how a request flows through the distributed system So it also gathers the timing information how much time is spent in each method so So in OS profiler is actually the distributed tracing tool for open stack Okay, so how do we use OS profiler? To for for troubleshooting these kind of issues. So how to generate a distributed trace using OS profiler? So it's it's really simple you we just need to use a profile option with the command for example for creating Cinder volume we can specify Profile and provide a secret key. So the secret key is something configured by the admin just to make sure that only Certain users can enable the distributed tracing. So when you run this Cinder command It actually prints An ID for your distributed trace which you can later use to generate a report So similarly for Nova we can specify the same option profile with a secret key So once we generate the trace so to view the trace we can use command from OS profiler It's a OS profiler trace show command And we have to give the trace ID printed when we use a profile option So let's take a look at the sample report generated from OS profiler Okay Okay, I don't know Okay, I don't know the reason why it's not working So anyway, we will cover this again in the demo So the sample report, okay, let's move on to the next one. So a brief overview on OS profiler So as I mentioned, it's a distributor tracing tool for OpenStack and it generates a single trace for an API request So the trace basically consists of multiple trace element trace element is is that corresponds to a method invoked during the API request and Using OS profiler, we can generate the trace report in different formats such as HTML JSON and dot so dot is a language used to draw graphs So OS profiler prints can print the trace report in dot format so you can use any dot rendering To actually display the graph. So you have we've seen the call graph which I show in the first slide so that is generated using this dot option and The trace report includes the call hierarchy the time spent in each method the project or service which the method is part of and Some context to do some debugging and also categorize the trace elements into different types such as a WSGI RPC DB or driver calls depending on what type of call it is so Natural question could be why OS profiler? Why can't we use sipkin or Say c profile so the short answer is a scope is totally different so we don't want any tight coupling with a non-OpenStack solution and Also the same time we should be able to enable or disable a tracing without restarting the service and In some cases we should be able to enable tracing for specific API request so we don't want to enable tracing for all all type of API's and As I mentioned earlier, we don't want all users to run tracing and put a lot of stress on the system and also it should be easy to integrate with OpenStack projects and There's no need to There's no need to include all methods in the trace because we start including all methods in the trace The trace report will be too noisy and it'll be difficult to analyze that so these were some of the goals and So we had to come up with a new project of new library for men for OpenStack that is OS profiler So what are the projects supported by OS profiler so? so you can see the complete list here, so some of them are under review and most of the The core projects it's all demudged the integration. It's already integrated so we can straight away start using OS profiler with those commands Okay, so let's move on to some of the internals of OS profiler So these are the different modules in OS profiler So I'll start with the profiler module So that that that module is responsible for generating the trace trace elements for so when you when a method gets invoked it generates a trace element before it enters a method and Once it exits some method it generates another trace element so that we can calculate the time difference The time for execution of that method so the profiler once it generates a trace element It needs to publish and store it somewhere. We distorted something called trace data store So there are different types of trace data so we support So it uses a notify module to publish the trace to the trace data store the notified module it supports a plugin mechanism So which means that we can actually have different kinds of plug-in to support different types of trace data stores So the web module is a one which is responsible for validating the user request whether it's a valid request to enable Profiling so it basically checks whether the secret key mentioned by the user is correct But not in that way. I mean it uses something all a hash signature So So that's a purpose of the web modules initializer as the name indicates it initializes always profiler It initializes the web module with the secret key Specified in the config file the admin specifies the secret key in the config file and the initializer initializes the web module with that and it also initializes the notify So basically tells the notify. Okay. This is my trace data store connection parameters So initialize the trace data store so the notify uses appropriate driver to initialize the to initialize an instance of a driver so So this is these are the main components in OS profiler So the profiler generates a trace element when it enters and exits a method So what's the structure of the trace element? So the most important element the part of the trace element is a base ID So the base ID is a ID of the trace So this is something this is what is printed when you run the the command with the profile option So you this is a ID representing the entire trace the ID is the ID for that for that particular trace element and parent ID is the ID of Of the function of the parent function in the call hierarchy then the project and the service are pretty obvious The project can be sinter and over extra and so is is sinter volume now a compute and The name is a type of that trace element. It's a WS GI or RPC or DB driver So when it before entering the method we generate we suffix start to the the type and when it exits We we put a suffix chop just to understand To calculate the time difference then the timestamp to generate to to find out the time spent in the method and Info is a debugging info. So it varies with the type of the method for WSGA. We will have a path we have a query method and scheme and For RPC and driver we have a function name arguments and keyword arguments for DB We will have the statements and the parameters So I mentioned we support different types of trace data store to store the trace element So that we can generate the report at a later point. So since boys prefer 1.3 We had a dependency on silo meter. So we publish trace data as events to the message bus and Which is consumed by silo meter and persist in its store? so That is a that is not a good dependency because every time the someone wants to run they need they need to enable silo meter and So in 1.4. We introduce a driver framework for trace data persistence. So we could We could we could we could have say log inside a MongoDB as a store for persisting trace data So right now that the current version OS profile is 1.9. It supports silo meter elastic search log inside MongoDB and log inside and grids so so if you're not happy with Silo meter and the Oslo messaging Which which is a default tray store so we can replace it with say reduce or MongoDB so So how is it trace generated? So the first step is the initialization of OS profiler So we initialize OS profiler with a secret key and a connection string. So this is specified by the admin in the config file So the connection string is like it first the scheme indicates what type of back-ended is it's MongoDB or it is then the The connection parameters appropriate for that back-end. So this is actually interpreted by the appropriate driver class so first we initialize the initialize OS profiler with these config options and Then if OS profiler is initialized we patch the manager and the driver classes So and once we patch it then Then it's capable of generating the trace elements. So how does a client initiate a trace request? How does it tell the server that okay? I need to enable trace request. So the user specify the secret key and Once the secret key is specified at the client side We initialize a profiler using the secret key once a profiler is initialized. We send to additional headers so the header contains a data, which is a base ID generated for that trace and hash signature so the ashes signature is computed using this data and and and and the secret key so And and we we pass we we send this data and the hash signature as to additional headers if the profiler is initialized So at the server side, how do we verify whether a trace request is valid or not? So we verify the hash signature So we regenerate the trace signature using the data sent from the client and the secret key configured in the server and We compare the hash signatures and if it matches that means the user entered the correct key So we initialize the profiler with the base ID So so once you patch the class it's capable of generating the Trace elements. So basically it it applied the decorator which is shown here the profiler dot trace to each method and So whenever the method execution comes it Generates before it enters it it generates a trace element and exits it generate another trace element for RPC calls we in the request context serialization before making the RPC call we construct some trace info dictionary and with the with the current ID of the function and the base ID which is the ID of the trace and During DC realization, we check if this trace info is set in the context if it is set We initialize a profiler with the base ID and the parent ID Okay, so let's go through a short demo. This is a record the one or live demo so we will walk through these steps configuring OS profiler and Running the various commands with the profiling enable and and and the trace report generation okay, the First step is to configure OS profiler. So So we call the config file of the project which we want to enable OS profiler. So we need Basically for option we have four options enable must be set to true Then trace equal alchemy By default it's false. So if it's turned on then tracing will be enabled for DB queries as well But it will generate a lot of noise And the HMAC keys is a secret key We can specify more than one key and the connection string is a connection parameter for your trace data store Trace there store. So in this case, I'm using login site. We could use MongoDB already is here or we can just Keep it empty and it will defaults to Cilometer and the oslo messaging So we need similar settings So we need similar settings in other Projects as well. We have to use the same Secret key and the connection string Yeah, so So I'm going to run a command To attach a Cintervolume to a nova instance with the profiling enabled so you can see in the help The profile option is clearly documented. It's my secret key is full So I run the nova command with the Profile option and give the key as full. Okay, so it prints a command To generate the the trace report So you can you can see that the the last part the UID is a base ID of the trace So the command is completed. So we can we can generate the trace trace report so By default it assumes that Cilometer is your trace source. So it tried to connect to Cilometer and Cilometer client was not installed in my system. So that's why It gives an error but actually my trace The trace chore is log inside. So how do I specify to the How do you specify OS profile a tracer because this is running in the client side not on the server side? That we normally run it in two machines. So the config file which we saw earlier that is server side configuration So we need to tell OS profiler. Okay, where to connect to get the Trace information. So we have to specify the connection string Either to the tracer command or we can set an environment variable So so we can see the connection string help here. So We can either Give it along with the command or we can set the environment variable OS profiler connection string. So I'm going to set the Environment variable here. So now it generates a trace report, but it's printed the console. So let's redirect that to a file So now we can go and open that HTML file So this is a default trace report generated by OS profiler So we can we can see we can Collapse it expanded the various various nodes These are different methods. So we can we can see the total time We can see the total time spent on the top at the level zero and we can see the type of the Trace element the project service and The host generated that trace trace element If it's running in different holes, so that will be helpful where it was generated and we can click the detail So that is that's a debugging context. So So in this case, it's a driver method. So it has a function name Arguments and the keyword arguments. So that will be helpful for debugging. Okay, so Let's see one other capability in OS profiler. So I'm going to create a Cinder volume using a volume type So the volume type has this it has a storage policy associated with that So I'm going to use a store the volume type gold So it has a storage policy named bar and this is an invalid storage policy. So obviously when you create the volume It will error out So let's see whether we can get that error information when we run the trace So I'm going to run this Cinder create command with profiling enable so the the volume we just created that is an error state because The extra spec has an invalid value. So this is a trace report generated and And if you look in that method the trace element corresponding that method, we can we can see the exception information as well So so we support Some other formats for trace report. So let's go to the let's go and see the dot format so So we support json. So if you want to inspect the trace report in json, we can use that option So I'm going to show this dot option So when you use the dot option, it generates the trace report in dot format And if needed we can actually so we need a rendering a renter to actually create a call graph using the dot Output so there's an option to render it In OS profiler itself. So it generates a PDF file showing the graph But we need to install graph with package for that. So the volume is now created. So we can run the trace show command Giving the dot option to generate the output in graph format. So we have to use in service channel We have to use dot as output format And we have to also mention render dot option and give a file name to render the output to So we can we can see that the dot format here and We can use any third-party dot dot format rendering to display this graph. So this Since we use a render dot option OS profiler rendered that into a file called trace. So we can open that and And see the output. So this is a simple trace So we cannot put a lot of information here because it'd be difficult to display that unlike the waterfall style graph model in the HTML format But it's it's useful for smaller smaller graphs. Okay, so by that Demo is over so So these are the folks that they are core reviews and OS profiler. So they help me prepare this presentation so thanks to them and Thank you everyone for attending the talk Thank you Christians I'm just wondering if you know if there's how much overhead when you have a trace on does that Production cloud is that gonna be dangerous? Yeah, so I was expecting that question. So so while Implementing the driver for log inside I did some study, but we haven't published it and it's not a formal study So I did it use me a laptop so we cannot say that okay whether the results are valid for a production system So when I did that So I did treat us one was no profiling no patching and one with the default option That is the messaging and the silo meter and then one with the log inside which I'm using So definitely there's an overhead, but it's in the order of milliseconds. I find I was running Few commands like cinder list and Cinder create I was running in a batch and then trying to take the average not just running one test so and compared to comparing log inside and Cilometer I find that the Pushing the trace elements publishing trace alarm was much faster With the log inside, but the report generation was comparatively slower compared to Kilometer, so yeah, so this is something we have discussed so we are planning to do that. Yeah Two questions is there Command that shows you a list of the profiles that have been captured in the commands that generated them And is there a way to aggregate the results from running the same command a few times? Okay, so you just want to know what all traces have been generated in the past Yeah, if I don't copy. Yeah. Yeah, I got it. Yeah, so Unfortunately, it is not there, but our driver framework So we have added the API to extend it So that is something which we can do it. So I'll take a note of it. Yeah, that would be useful Yeah, I have two questions first is how does this support me means the hooks already added in the NOAA or Cinder or this component and How many hooks are added? Do you know? Okay? So you basically want to know how many hooks were added in our center, etc Yeah, and the other already exists in the latest release. Oh, okay. So I don't have the complete Idea of how many places at least I can tell from center because I contribute to Cinder so in Cinder we we actually added hooks to all the public methods in the manager classes that's a scheduler managers in the volume manager and all the WSTI APIs and Whenever depending on your driver all the public methods in the driver will also be patched So NOAA similar things, but I don't have the exact number Instead of yes, yes, yes, that's right The second question is if you use rally or some your horizon to do a batch of operations Will the truth be just one or each operation will have one trace means how to differentiate so you just want trigger from rally, right? Yeah. Yeah, so this so The rally integration is not a completed. So it's actually the spec is much So to do the integration with rally so every time before you invoke an API You should initialize a profiler in the current thread So whenever you initialize a new New trace ID will be created. So it's up to the client if rally is initializing it once all the APS It invokes in that thread will be a single trace So it initialize a profiler run API one API to all these will be part of the same trace But after completing API one it reinitialize the profiler. We will have a different trace So basically each trace is corresponding to one operation instead of one open stack operation So you saw operation we as open stack operation So each trace corresponds to an open stack operation because it So I don't understand what do you mean by open set if you create what use ready to create 100 VMs How many traces will generate? So you may be using the novel boot command right repeatedly In a for loop or some things Okay, so You initialize a profiler then in a for loop you try to create a hundred VMs So all these operations will be part of the same trace But at the end of creating a particular VM we reinitialize a profiler. We will have different different races And you can query the profiler module to get the trace the current race ID If you're doing this programmatically so So, I mean you you mentioned there that you've got a shared secret effectively to to authorize whether the profile isn't put in You know turned on yeah, is there anything you're doing to expand upon that because obviously you can see how in a Production cloud or a production intent cloud then being able to hand out one-time keys to users with problems might be a useful thing Yeah, so this is something the core team has discussed in the past So definitely that is there in the The goal list. Yeah, so that is something we plan to address. Yeah, sorry two questions It's recording profile in synchronous or asynchronous Recording sorry notify profile. Okay, not if oh it synchronize actually yeah, yeah Yeah, so we can do some optimizations there like caching the the Profile elements and sending it in a batch. Yeah, so those optimizations are not in done It's synchronous Okay, if cell meter is crushed The API process is Stop So what's a question Okay as a question Should I set Key at same value between components. Yeah, no machine that you should set the same value. Otherwise because One service will call the other service, right? So it the decline will send the secret key So and it should use the same key. Yeah Thank you. One question over here. This went on I guess two-part question first one is Would you recommend or recommend against using this as a distributed tracing tool outside of OpenStack and kind of related to that You mentioned you didn't want to tightly couple distributed tracing to some other tool. Yeah Were there other distributed tracing tools that you looked at? Yeah So so currently there is a proposal a blueprint proposed to make the trace element information We generate compatible with open tracing API. So this open tracing API once we generate our trace elements in that format We could use we could use a lot of other visualization tools. So that is a plan rather than putting always profiler in other projects because Our use cases may not match. I mean it may not be generic one, right? So so so the plan is to make the trace element compatible with open tracing and then Integrate with a lot of other tooling which supports open tracing. Yeah, thank you So would you recommend this to I mean have do you know of anyone that have used this, you know in the production? Yeah, so I know some of our customers used it. Yeah, so diagnose performance issues and we as developers we use it whenever we get some performance issues Yeah, I see but not on the regular basis. We're on the regular base when you have performance issues. Yeah, I see got it. Thank you Yeah, I think I think the question that got missed over here before was they were asking since the Sending the trace data is synchronous. If you're sending it to Selometer and Selometer is crashed. Does that stop your? Like your sender proxy process? Yeah, so let me see I Think it depends on the driver implementation how the driver catches the exception So it's in and if it ignores that then it will continue otherwise And if the driver doesn't it propagates that exception back then your OpenStack request itself will fail Yeah, so we need to we need to fix that. Yeah, that's right. Okay. I think that's it. Thank you for attending