 Hi everybody, thanks for coming. I know for some of you it's quite early yet. My name is Marcin, I work for Prime Video and this talk will be about the profiler that we did to fix performance issues in Prime Video app that's running on living room devices. And I'm not gonna talk a lot about the internals of the profiler, I'd rather focus on how to use it, how you can use it and adapt to your project. But if you wanna know more about internals, I'll be here around till the end of the day so you can just speak to me anytime you want. So I'll start with explaining why we actually started a new profiler because as you might know, there's quite a lot of them on the market already as LTTNG, Ftrace, UFtrace, Strace, Perf and a lot of others. The thing is that the development environment that we had to work in was quite different than what you might get used to. So we've been doing applications that's running on living room devices such as game consoles, smart TVs, streaming sticks and some of the platforms are very friendly for developers like game consoles are amazing. They provide lots of tools, debugging utilities and so on but smart TVs are not that friendly for developers. So very often that all we have is compiler and we can't really deploy any tool to that device. It's very close. So all we can do is basically deploy the single app that's our Prime Video app and that's it. So we can, for example, deploy Strace or LTTNG. We can't deploy some kernel modules and so on. So it's very, very limited environment. And also our application, the Prime Video application is built using three different languages. So we have a native layer which is a C++ and then that native layer exposes two scripted engines. One for Lua and one for JavaScript and we run them at the same time. So there's basically three different languages and because those platforms, some of those platforms are very low and in terms of the CPU and memory, we had quite a few performance issues especially on the cheaper platforms and we wanted to debug it but as I said, there was not much tools for debugging. There were some tools but they didn't really meet our requirements. So we said, okay, maybe we can just quickly prototype something and we see if that works. Yes, you see this presentation is about the profiler so it actually worked. So we come up with the list of features that we want to have from our profiler. This is just a short list but there's quite a few more features. I just listed the more important. So it's user space and instrumentation base. We basically don't have access to kernel by any means so user space for us was the only option. Instrumentation base, we can't really deploy a second demon that I don't know for example, checks, traces every second or so. So everything had to be built into the application since our application is written in C and JavaScript and Lua, we decided that we gonna write it in C++ but it should be available for those languages as well in somehow by providing a bindings layer and so on. Also, because we port to different platforms, those platforms are very different to each other. Some of them have different operating systems, different CPUs, even different and DNS. So it should be easy to port this profiler to all the platforms that we support by Prime Video. Below overhead, I think this is something that all the profilers are trying to achieve. We wanted to measure timings because that's what usually people think of when they do profiling, how long does it take to execute a function? But apart from that, we also wanted to measure other metrics like memory usage, CPU usage, how many HTTP calls we've done and so on. And the other thing that we wanted to achieve, we had a group of people who were very into performance and were responsible for fixing performance issues but we wanted also other developers to use profiling tools on a daily basis. So while they're writing the code, they can focus on performance as well so we don't have to do the job afterwards. So we wanted to provide consistent user experience across different platforms. So no matter what application, what device you're developing on, whether this is a streaming stick, web, game console, smart TV, you always have the same tool so it's easier, they just need to learn one tool and they can use it for all the different projects. So those are the features and we come up with a very simple, high level design. We have a profile device that's running the application. We have a Huxracer library attached to that application. That generates a binary stream of events. We call it htdumpstream and then we have a client which is on the developer desktop that's converting this byte stream to some human readable form like, for example, Chrome trace format, flame graphs, or you can, as I show you later in the demo, you can build your own client very easily to do some other visualizations that our client doesn't provide. This is a bit more detailed diagram so there's an app. This app has the Huxracer library linked and the Huxracer has a timeline. A timeline is essentially a buffer and application pushes events to this buffer. This buffer, these events are serialized and then once the buffer is full, we push it to a listener and the listener is basically a function callback so that then the listener decides what to do and we, Huxracer by default provides two listeners. It's either it stores the stream to a file or it can send it over TCP IP but you can extend it, I don't know if you wanna send the stream over a serial port, for example, you can do that. And then the stream goes to the developer desktop and again, we provide the library as written in C++ and has Python binding that deserializes the stream of events and provides you nice data structures in your language so then you can build your client and we already provide the Huxracer converter client that converts this binary stream into some common trace formats. We also have the library that's written in Rust. It's not really experimental because I know people already use it and I'm gonna use it today during the demo so if that's gonna work then it won't be experimental anymore. Hopefully it will work. So I mentioned here, you see that there is a timeline and the timeline is conceptually is a buffer but it actually is a little bit more and you need to provide some configuration for that timeline and so to simplify this process of creating the timeline we Huxracer provides a global timeline which is quite efficient. It's kind of multiple timelines actually. It's timeline per threat so if you push events to the timeline we don't require any locks because you have an instance per threat so there's no problem with data races and so on and you can easily access it by just calling htglobaltimeline get function. I recommend just using that to be honest I don't think in my real projects I ever use different approach so it's probably good enough. So I mentioned that the basic data unit in Huxracer is an event so we have quite a few events defined in Huxracer event types but you can define your own. It's basically a C structure you define it using a C macro and you provide the event class name which could be my event for example it supports inheritance so all the events must eventually inherit from ht event which is a very base class and then you provide a set of fields that you want to have in this event and you define it by three properties as a type is either integer string, struct, flow, double or pointer I think then you provide the actual C type of that and the name and that converts to the C structure and additionally it generates automatically a few helper methods like for example a method for serializing the event so when you have this event and you wanna push it to the byte stream this macro already will generate a function that does it for you and then when you already have the event if you wanna push the instance of this event type to the timeline you just call ht timeline push event macro you pass the timeline as a first parameter the second parameter is the name of your event type and then all the values, all the fields and that's pretty much it so I said that before that we provide the client that parses the binary stream but then I said that you can write your own event type so you might be wondering how the client knows how to deserialize this new event types and for that purpose we have the we divided the event stream into two sub streams there's a metadata stream that describes all the types and there's the actual event stream with all the values and since everything in Hock Tracer is an event even the definition of the event is event as well so we have special events that describe your type and the first event is just describing the name and the number of fields that you have and then you have each new event for each field of this class so in that case we'll expect one event for class info event and three events for class field info event and in class field info event we provide information like the field type, a field name the size of that field and the data type whether it's a string integer, a struct and so on and this both streams are serialized as a byte stream so you can see that basically it's just like 30 or 40 40 bytes for those events and then eventually you have the actual event and this event has a type so it's number nine as you can see the class ID is nine here and the info class ID nine again so all of them are kind of connected to each other by this identifier and it's important that you first need to send the definition of the event type before you actually send the first event of this type because otherwise the parser doesn't know how to parse this event and yeah but the Hock Tracer does everything for you so it's just Can I ask a question? Yeah go ahead If the parser loses sync with the byte stream is there a way for it to re-insync later? So no at the moment it's not possible I'm working on improvement that it will be actually possible but at the moment if you lose the metadata stream yeah it's a problem so yeah but I'm aware of this problem and we'll be working on that, okay? So I mentioned that we also want to measure time and that was actually quite important requirement for us so to make it even easier we have predefined event types that are already for measuring time and they take a label and they take the duration and the time stamp and we have a macros that automatically generate those events and push them to the timeline and so for C++ depending on what you wanna do if you just trace the whole function you just put this macro httrace function at the beginning and if you just wanna trace the random scope there is httrace there is a few more macros that are more optimized for that for example they have some hash maps so you don't have to send the string every time for every event it's even more optimized version that uses some tricks with static thread local objects so even hash maps is not necessary for some of them it's documented so I don't want to spend much time on that and we did bindings for a few languages for now the public one is Python and Rust we have one for JavaScript as well it's not ready yet to be published but we'll do that in the future and for Lua as well so this is how you do that in Python you import the trace decorator and you just put it for the function that you wanna trace and the same for Rust there are some macros I'm not an author of the Rust bindings so probably if you wanna know more about this Alexandru he's in FOSDM not sure if he's today in this room but yeah he's an author of that and basically that's how you do that there are some kind of Rust macros that you can trace the scope okay so that was it and I just wanna show you some demos so you can see how you can use it the first one is we have a C++ application that allocates a memory and then we wanna see on the graph how the memory grows we also wanna know how many allocations we've done in this program so we'll write a simple Python client that receives the Hoke Tracer event stream and does the visualization and I start with the program itself so at the very beginning we need to initialize the Hoke Tracer library and then so this is the event that we even type that we define as I said there's a name the base type and we have two fields memory usage and allocation count they both are integers size t type and the event before you wanna use it you need to register it this method is auto generated by the macro that I showed before by this macro HD declare event type you just need to call it before you use it it registers some information Hoke Tracer system and then you create a listener to the timeline I said the global timeline is the best option so we're using global timeline here there are some parameters like port this is the buffer size we set it to zero so we'll get we don't actually buffer anything we stream it directly to the client and we say that we wanna use it we wanna use TCP client so we can stream it directly to the client and we can get the real data Hoke Tracer already provides the functions to get the memory usage it also provides some functions to trace the allocation so you just need to provide a callback this is for malloc pre malloc there's post malloc pre realloc post realloc and so on and so on we are only interested in pre malloc hooks so we register that and this is basically our function we get the memory usage we get the allocation count which is read in the other callback so here this is our malloc callback so we increase the allocation count every time somebody calls malloc so we get that all in the context object and that's basically what we do we push the event so this is our type those are our values and this is the memory usage virtual memory usage and this is the allocation count and this is the client it's written in Python all we need to do we need to start the client the Hoke Tracer client we say okay listen to 8.7.6.5 port on this IP address this is the animate method that's basically the anime that's drawing the graph but the most interesting bit is this one so this is basically how we read events from Hoke Tracer this is our client so we waiting for end of stream if it's not then we pull the event if there is an event in the queue we check the first this is a tuple and the first element of the tuple is the event name so we check if this is the event that we want and if so we get the timestamp we get the allocation count and we get the memory usage and then we put the memory usage on the graph and we print the allocation count so it's very very simple let's run it so we run the client and now we run the program that we wanna trace so you can see that the memory usage is growing and you can also see the allocation count is going up and so you might be wondering why the memory usage is growing if we go back to our application we see that we have a loop and we do malloc so and it's called 100,000 times so we can see how many mallocs we actually got it's more than 100,000 it's probably some other allocations going on yeah and this is the graph so that's how you can get the real data from the Hoke Tracer by defining your own event it doesn't have to be memory usage you for example can trace the number of HTTP calls for example and so on yeah so that was the first demo and the second one is going to be more complex a little I wanna show you how you can trace multiple languages at the same time because that was actually our real use case we had application written in C++ and then we run Lua and JavaScript on top of that so I do a slightly different example I have a main application written in Rust this application downloads the image using a downloader library that's written in C it uses curl and then I rotate this image using some image I think it's OpenCV maybe not OpenCV I actually changed it but yeah it rotates the image in Python so I run Python interpreter in Rust and then at the same time while I'm rotating the image I also upload the image that I downloaded in the first step to the S3 backend and I wanna know how much time I spend in each particular operation so I again start with the code so this is our Rust client even though you might not be familiar with Rust I think it's pretty simple so we start with download file this is download file RS it's basically just the wrapper for the download file function from the library that's written in C then once we have the file downloaded we spawn two threads one is doing the rotation in Python it starts the Python interpreter and does the rotation and the other one at the same time uploads the file to S3 you might see that all of the functions here are decorated with those macros and we also have some other trace points in here we can look at the downloader file it's pretty simple one it's just the curl wrapped around we trace the function here at the beginning and we also see how long does it take to call curl easy perform so we trace the whole scope here and just last bit is the Python one we import the decorator and we decorate all the functions so we basically rotate the image and save the image we load the image we rotate the image and then we save the image to the file so not really a rocket science here it's important that if you wanna run it with Python this environment of variable needs to be set otherwise the Python bypasses all the trace decorators it doesn't have much impact on the performance if you disable that so you can even have it in the production code okay so let's run this so this is our program it downloaded the file successfully it's doing the rotation and it's doing the upload so you can see that it actually did rotate some image let's start reading from zero yeah so it's a hook and it's basically all the angles are here if we delete that one we have also rotate htdump which is the binary file with all the events, all the traces oh this is not gonna work because I forgot this environment variable here so we wouldn't have the Python traces so I run it again I'm gonna delete the images so we have the htdump file which is the binary stream and we can now use hook tracer converter this is the one that's experimental so hope it's working and we generate the Chrome trace format so we say it would be output.json okay it converted so now you can use the tool that's deployed with Chromium and you see that first you see that we have three threads here one, two, three that's exactly what we had because we had the main thread and then we spawned two extra threads for rotate and for upload this is the first bit that's in C++ or C we download the file there's HTTP request and then just after that we start rotating Python and those trace points here those are from Python save image rotate main those trace points are from Python and this one upload to S3 is in Rust so as you can see you can trace you can have everything in one trace file and you can easily analyze what's going on in the program you can also convert it to flame graphs so if you want the flame graph view you can just do that this is a very simple program so the flame graph also is very simple but yeah so that's it we have some plan for improving this there are some of the items but not all of them as you mentioned we wanna add some extra protection layer in case we lose some events on the way and yeah that's it thank you there was a bunch of links feel free to contact me there is a documentation on the website there are tutorials they say how to integrate it with your project basically Hocktracer can be combined into one single C++ and header file so it's easy to link it to your project if you want there are other links for bindings and for converter so thank you and I think we'll have one minute for questions but if that's not enough then I'll be around okay yeah which methods like you showed TCP now right mm-hmm so in addition to TCP what do you currently natively do? so this is the TCP that we use in the first demo and the file which was used in the second demo yeah so those are natively supported at the moment something like serial sorry so if you want something like serial yeah you need to write the neural so sorry the question was what listener types we support and my answer is currently we support file and TCP listeners yeah thank you