 Yes, thank you for coming. I know it's late and he all tired So my name is Martin. I work for Amazon for a prime video and I want to talk a little about the profiler that we've done at Amazon so Before I start talking about my profiler. I'll just do a quick introduction to profilers in general and Probably some of you never used one. So I just do a quick introduction Then I explain why did we decide to do another profiler even though there is a bunch of Profilers already that I've talked a little about the Hawk Tracer features Do a demo and hopefully we'll have time for questions at the end So a profiler basically is a tool that allow you to to measure performance of the application and You can basically Split existing profilers into two groups The sample based profiler that runs periodically and checks some some information from your application So the simple profiler simple sample based profiler can be like that that you have a while loop and you check for example, the call stack of a specific process based on process ID and and later on you can gather those that you can gather the data and Generate some statistics of for example, which function was called? Most frequently This is not very accurate because that depends the accuracy actually depends on the on the sampling frequency so for example here we have one second a frequency and It's a good method to To find why your your application is in general slow But if you want to have like a very detailed data, for example, you had the spike performance by spike and you want to understand what's going on. It's good to Do an instrumentation based profiling So you basically modify your source code by putting some trace points and you exactly know How long or how much resources did we use at that point of time? So for example here We have very simple profiling Instrumentation based profiling we measure the time we spent in the full function by Saving the timer before calling the function and after calling the function and then we print the results doesn't apparently another Profiling a methodology is a guessing based profiler. I've learned from Alex today Which is basically a developer looks at the code and tries to guess why the code is slow but yeah Yeah, guess but it's profiling well I'm not gonna talk about this and the whole tracer is instrumentation based profiling that means that you I need to modify your Code you need to put some trace points in the code to know to find why to measure Some of the metrics so why did they now? Why did we create the profiler? There's a bunch of profilers already the spare of F trace LTT and G Etw for windows and many others So we have very very specific environment at Amazon my team is responsible for delivering Prime video up on on living room devices like streaming sticks smart TVs game consoles and stuff like that and Some of the devices are very have a very limited capabilities In terms of development. So basically all we can do we can just generate a package Which is executable with some assets and upload it to the device. So there's no way to SSH to that device There's no way to run another process and stuff like that. So we kind of cannot do much We well all we can do is just to run our application that we build also From the language point of view. So we have a native stack and the scripted stack And we wanted to be able to profile both stacks at the same time And we couldn't really find a good profiler for For this kind of use case So we decided to build our own and before we did that we gather requirements what we actually want to achieve by building the new profiler and So those are the requirements. First of all, we only targeting user space profiling Obviously for those limited platforms. We couldn't even load anything to kernel. So user space only We needed to build it as library because as I said, we can't run another process So it needs to be embedded to to the application itself Since those devices are sometimes very low and devices like I don't know a single core CPU 600 900 megahertz well for some people is not really low and but for us it is So we try to make the profiler so the overhead is Not that significant For some devices, we also don't have access to persistent storage. So we can't save results on the disk Oh, sometimes we can but we cannot access those data after that Because this persistent storage is only available from the application point of view So you can't log into the device and gather the data So we decided, okay, let's assume that we don't have the persistent storage at all And since we're running on different platforms different manufacturers, we wanted to have the Profiler as portable as possible so we can build it once Our developers learn the two ones and they can use it for for all the possible platforms that we support And it should be easy to use of course. So Everybody can instrument the code easily and gather data quite easily So we come up with a very very basic design As you can see there's a user application layer is basically the application That is running on the device and this application links to the Hawk tracer library Hawk tracer library might have a bunch of timelines. The timelines are basically like a buffers our user Sends events. So whenever you want to trace something. I don't know for example We want us we want to know how much memory we use at this point of time It generate we generate an event that contains the information about the memory usage It goes to the timeline and the timeline Accumulate those buff those events and once the timeline is full It calls the flush method and the flush method sends all the events that we gather to To a listener and the listener can be either you can use either the The one that is already existing in the library or you can define your own listener and what listener can do Well, basically listeners should save the data somehow So there is a file listener that saves it to the file or there's a TCP listener that streams the data Over the network and then the data is basically the binary stream and You can't really You can't really analyze the binary stream you need to convert this binary stream to some human readable format and in order To do that we created a Library, it's called hope tracer parcel library that allow us to convert this by stream to some structures that then can be converted to another To another format and so there are two options you can either use hope tracer converter Which is an application that converts this by stream to one of the well known formats like we currently support trace event format that is Supported by Google trace viewer and we also generate flame graphs If that's not enough for you the tab parcel library can be used You can use that for writing your own client your own converter and is available either for C++ Python or or us I'll show later how to do it in Python. It's quite easy So yeah, I mentioned those three components already So the event is as a thing that's basically carries the information that we want to That we want to kind of have in the in the results like for example I don't know time spent in the function or memory usage CPU usage at the that point of time It's a person for inheritance so you can have an event and then you can inherit from that event To add more fields for example The timeline is basically a buffer where you push events The timeline also timestamps the event So every time you you push the event event to the timeline it gets time stumped So we you know exactly when the event happened It can be it can be either Threat safe or not threat safe lock free that means that if you if you know that the timeline is used only in the single Thread there's no point of introducing new text and stuff like that But if you want to use the same timeline across different threads and push events from different threads Then you need to enable the threat safety feature, but obviously that Introduces us some extra overhead because you need to lock the mutex Every time you push an event and the timeline listeners basically a C function that user defines That gets all the events and does something with them and so I show you how to define Your your custom class, so let's say You want to create a you want to trace memory usage and the CPU usage? So how you do that you basically use the HD declare event class macro and the first argument of that macro is is The name of of your event the second one is the base event class because as I said it supports inheritance And all the events must at least inherit from HD event is a base class And then you define fields and the each field is defined by three properties Is a type is either integers string struct flow double and pointer and then the second argument is a C type So it's either in you in 6040 as here or a car or other and the last last one is is the name of the of that field and that That macro generates basically loads of code the most important was one is that it actually generates a C structure So you can see here that it generated the resource usage event C structure with those fields and with the With the base fields of type HD event and the HD event type has basically an ID each event has unique Identifier so we can distinguish them. It has a timestamp as I mentioned That's a pointer to a class that describes that data structure and It also generates few helper functions Is the one the function is for the first one is for serialized serializing the event So it serializes that event to to the byte stream The other one is for Getting even class instance So if you at runtime need to know what's the what's the structure of that event you can use that function and The last one is the most important I think because it's it's used for registering this event class in the in the hope tracer system So before you use the event class you need to call that function. Otherwise, hope tracer doesn't know What's the class and probably will crash if you forget about this and Basically, so the bad the the the byte stream that is sent to a client The client doesn't know about events event classes that you defined in in your application, right? You can define your applications and then have the client that receives the stream But it doesn't know what what events did you define? So first before we send any events we send the metadata stream and it basically contains Information about all the all the event classes that you registered So it contains information about the class name about the class identifier and all the fields That are defined in the class. So then and the next when we actually send the stream So it's below First we send the class ID And then we send all the other fields. So then the parser knows, okay, I got this class ID I know how to parse this because I already got the metadata stream So I know that the next field is for example a timestamp field and it has eight bytes Then now next field is a CPU usage. It also has eight bytes in is integer So I know how to parse it. So there's no need to recompile a client if you add a new class to your library Timeline basically most of the time we are using the global timeline And the global timeline is not actually the single timeline is timeline per thread. So we don't have to lock everything Every time we we push something to that timeline But it shares the listeners. So if you register a listener to one of the timelines in one of the threads you will Kind of automatically register it for all the other threads for global timeline if you really need to create your own timeline You can do that But as I said, it's very very uncommon use case So I just recommend you to use a global timeline and to access the global timeline Just call HD global timeline get it will return a pointer to that timeline So Our most common use case is to measure the time spent in a function or time spending the scope So we've also introduced some of the helper macros That allow you to measure the time we spend in a scope So there is HTTP function that takes the timeline pointer as an argument and it basically measures The time spent in the full function And the output if you look at the end of the slide is basically it generates a new event With the duration, which is how much we spent in that in that scope It also adds information about the thread identifier. So, you know that this function was called in In this particular thread and it also adds the label. So in that case The HTTP function macros sets the function name as a label If you want to trace custom scope, you can do that. It's just another macro and plus You also need to add the custom label for that and this those macros are only available in C++ and in the new C compiler because C basically does not have such thing like like a distractor So we don't kind of we don't have a way to call a call back at the end of the scope If you want to measure arbitrary code, not necessarily Not necessarily a scope. There's also a set of functions start Call stack start string and the stop and that measures the time between Calling those two functions and again it generates the same event with the label that you specified here Yeah, so this code is basically like a scope of the C C++ variable So in this example The scope of this trace point HTTP function is the whole function But here it only is instead of inside this curly braces So yeah does the scope And So that was more or less about the whole tracer internals And now how can you integrate it with your project if you just download the source code do make make install It will install the PKG config file So you can just use a PKG config and compile it with your with your project This whole tracer itself uses a C make as a build system So you can either use it as an external project very easily There is a an example how to actually do that So I recommend you to copy just the whole tracer that C make file to your repository and and include it in your project or If you installed it as a system library You can just use a find package hook tracer with a specific version and then link it to your project and the third option is The one that I actually recommend is That we basically is like SQLite. It has many files But at the end all those files get merged to a single file So the same hook tracer and we have three files. Eventually. We have HT config where you can modify a configuration We have a header file and the cpp file where the whole implementation is and the cpp file It can actually be compiled using a C compiler. You might just Not have all the features But yeah, it's it's possible to compile it just using the C compiler MIT sorry I should mention that before yeah, the license is MIT So you should be able to use it as you want Okay, so demo the first demo So I've implemented a Sorting algorithm it sorts 400 numbers and Turns out that the algorithm is very slow. So if you see I press enter now and Yeah, it took a while to to sort those numbers. It was only 400 numbers. They should be like very quickly So let's look at the source code We call the quick sort here. So now we want to know why is it slow and in order to know why is it slow? We need to instrument our code So I'll show you how to set up The code to to to basically work with hope tracer at the beginning you need to call ht init function This is very important. It initializes some internal buffers and registers some base classes So don't forget about this Then we need to create the listener as I said the timeline needs to have a listener Otherwise, no one will be able to access those events that we generate So for this purpose we we decided to use a file listener a file listener basically saves all the events To a file. So we decided to save save it to a sort dot htdump file as it so it will generate the binder file There is some check if the listener was created correctly Then we registered the listener to the timeline because we only created an instance here but now we need to actually register it to the timeline and We use the global timeline as this is the most convenient thing to do This is the callback of the function and this is like a user data. So the callback knows the context and Then we initialize our input. So we generate 400 random numbers and Then we quick sort this array So if you see if you look at the functions, we basically I added to all the functions I've added this trace point. So whenever function gets called we generate the event. How long did it take to? To execute that particular function and we have a few other functions We have a partition function. We have a quick sort function and we also have a swap function So now I run this it was already instrumented. So if you look here, it already generated this sort dot htdump file and now This is a binary file. So if you try to open this Yeah, you will see this is basically the binary file So what we can do we can convert that binary stream to something that we can actually read and I mentioned about the how tracer converter Program, so if you look at this, you see that This program takes basically three mandatory parameters. The first is format, which is an output format Which we convert our data to and it can be either a flame graph or a chrome trace Or it can be a debug which basically prints everything as it is So we decide to use Chrome trace another parameter is output file let's say sort slow JSON and And the source source is very important as the program needs to know where to get the data from and the source this Chrome tracing How tracer converter tool only supports two sources. It's either a file name or a TCP IP address In case we for example stream the data over the network. It can listen to to a specific port and receive the data So the source is sort htdump We run it. It's processing the data. It's completed successfully. Let's see. We have this. Yes So we have sort slow and now We can use the chrome tracing viewer to actually see Why our application is slow? So I love the JSON file that we generated and we see basically this shows the cost tax of In of our application the X X axis is a time and And we see that basically since the quicksort is a recursive algorithm We see that quicksort cause quicksort and so on and so on But if we look closer, we basically see that apart from calling quicksort It actually it also calls something different and if we zoom in More we see that actually most of the time we spend in the swap function I don't know if you can see but this is basically it's it's quicksort the blue one is a quicksort function then the Green one is partition and the green the light green is is a swap method So we spent quite a lot of time in the swap method. So that's probably a function that we should look into and If we see how the swap method is implanted it takes a and b as an argument and we want to swap values So how did we do that? Well, yeah It's someone implemented it as the saving is to a temporary file and then It's actually serialized to a string and then we read it That's probably why it's slow. So if we just change the implementation to something like This Yeah, there is a trick with xor. Yeah, I I'd have to think about it, but yeah, I'm definitely that's possible Yeah, so if we fix that then probably our trace is gonna look completely differently believe me Saving temporary data to a file and read it after that. It's is ways of time So I'm not gonna recompile it and rerun it I guess you all believe me that this is gonna fix our performance problem and it's gonna be much much faster Sorry, oh Absolutely, thanks Yeah, live demo never works So that was the first demo and the second demo So as I mentioned so imagine that we previously we I I showed you how to how to create your own class which was for memory usage and CPU usage tracking and Actually using the Chrome tracing viewer we couldn't visualize this so How can we how can we see like a custom data? So for that I as I mentioned We can build our own Our own client to to process events So there's a module hope tracer for Python. It's very simple to use you create a client You start the connection. So I said, okay one two seven zero zero one on port eight seven six five And then whenever we receive an event Here we check if the event name is the resource usage event If it is then we take the CPU usage we take a memory usage We print CPU usage and we draw memory usage on the graph I'm not a Python developer. So maybe that's not the best way to to graph something but It works and the our code is very very similar to the previous one Instead of doing the file listener. We created the TCP listener that basically creates a server that streams the data We also have this The class that I showed before so this is the definition of our class and then every Every second here in the while loop we allocate some amount of memory half of the kilobyte Then we make our CPU busy and we sleep for for one second and we report resource usage So let's see how do we report your resource usage? There is a macro ht timeline push event It takes a timeline as a first argument then the event type Then values of that event that we want to attach to this event and then we and then we pop push The and that pushes it we also flush the timeline manually instead of waiting For the buffer to gets full because we want to have the data immediately in the client So it's kind of smooth drawing So we can run the demo I Run the resource usage and I run my Python client and Yeah, you can see that it draws the memory usage here So basically what we've done we made our custom kind of event converter And we decided how we visualize the data in Python just in I don't know 20 lines of code And also the CPU usage is printed here. So it's around yeah one two three percent Yeah, so that was the second demo There's bunch of things that we want to do for the future. There are some missing features like we don't support floating point numbers at the moment Optional fields is also something we want to have so for example You have a class with some event with some fields But sometimes some of the fields should not be included. So that would be nice to have We also want to have more converters in the whole tracer converter like a CTF is a common trace format supported by trace Compass or LTD and G scope viewer So we nice to have we use that profiler also for profiling C++ and Lua or JavaScript stack at the same time But Lua bindings and JavaScript bindings are not open sourced So we also want to do that and lots of documentation improvements even though the documentation is quite okay I see we can we can improve that more. So yeah, if you want to help just go to hocktracer.org slash community And you will find how to how to contact us and We we can work together. Yeah, there's a bunch of links There was also a rust bindings Alex had a talk today Afternoon, so you search for FOSDM 2019 profiling rust you should find that presentation. It was pretty cool Okay, thank you. I think we have Yes To create So that depends Creating new event of Having ability to create a new event class in the other language might be quite tricky But assuming that you already have all the event classes defined in your C or C++ code base And you run it your other language on top of that It should be fairly easy because you just need to expose So for measuring time you need to expose two functions start and stop Plus maybe start hocktracer or star registered listener. So you should be fairly easy for us I don't know how much work that was for Alex Yeah Yeah That depends what you want to do, but if you just want to measure time it should be should be fairly easy. Yeah. Thank you