 I'm Shailen. I am a technical consulting engineer at Intel in the developer products division team and we are based in Munich Germany and today's focus will be performance analysis of Python applications and yeah we have to say it's of no denial that Python is getting a lot of traction, a lot of importance these days and if you look at what our friends at CodeEval have published, indeed Python has grown in popularity over the last years and in 2016 Python remains the number one most used language and also what is more surprising is that Python remains the number one programming language in hiring demand. So it's a great skill to have in this decade to be proficient in Python and when it comes to performance analysis there are certain fields that are kind of driving the technologies of the future and technologies that are kind of really important right now and these fields I would say would be mathematics and data science and to get my facts straight to get the numbers correct I went to Stack Overflow our favorite website where we have problems and Stack Overflow shows me that indeed Python is the most used language in the fields of mathematics and data science. Now you may think the math doesn't make sense if you add all the percentage it doesn't make up to 100 well that's because out of those approximately 50k people who responded to the survey they chose several languages but most of them chose Python over 50% of this so that's quite impressive. So math and data science these fields actually drive high performance computing or HPC and other fields like artificial intelligence, machine learning or deep learning and Intel realizes that these fields are going to define the future and so we have worked really hard to release a distribution of software which we call the Intel distribution for Python and it comes up out of the box with highly optimized sub libraries to allow you to develop high performance applications with Python. We made it super easy to use super easy to install packages can be easily downloaded through Anaconda or even YUM providing the RPMs and our distribution of Python comes with highly optimized libraries like NumPy, SciPy, Scikit-learn which actually at the base leverages Intel Mkl which is short for math kernel library. Now in itself Mkl if you've never heard about it if you were it's about it it's made in assembly it's super optimized mathematical routines have been designed to make the most out of the Intel architecture how many calls you have make use of the latest instruction set architecture for instance AVX, AVX 512 whatever you have out of the box make the most of vectorization so you don't have to worry about this by using the Intel distribution for Python. Now performance is really important so how do we actually measure performance of a Python application? Enter Intel V2n amplifier. V2n amplifier is a code profiler it is a profiler that allows you to know where are the performance problems in your software. It has been developed over many years over 15 years and it's still in development we're getting a lot of improvements deep every day our engineers are working hard and over the last four years we have worked on profiling Python and what is great is that it comes with its low overhead sampling technology which is unrivaled no other profiler is able to get performance data as good as Intel V2n amplifier so there are some techniques how we are able to get performance data with low overhead so basically when big brother is watching there is no big impact on the performance of your real application with Intel V2n amplifier we are able to get precise we're able to get precise line level information some profilers allow you to do that but others you may use give you data at the function level so basically you have to kind of guess where the performance is if you have a big function but with V2n you get right to the source line where there are bottlenecks now a bottleneck is basically like you know the bottle and the neck this is where the performance is kind of capped and our goal is to find those errors in your code and optimize on them in order to eventually increase the performance of your application and what is also great is that we can not only analyze the Python performance but also site and language and if applicable any C code that your Python code is calling essentially you can analyze your whole system and get data about just not just the Python binary and and the Python files being called but other modules that can be built in C or C++ so in the coming 10 to 15 minutes I'll be talking about why Python optimization is important how do we find those bottlenecks and a very short overview of the various profilers available on the market and then a very quick demo of how the GUI looks like what you see in the tool and a few words about mixed-word profiling so why do we need Python optimization well it's no denial Python is everywhere Python is being used in a lot of application that today need a lot of performance so if you look at web frameworks Django turbogears flask so all these require that stuff be done really really really fast there are built systems like scons build bot don't know if you use it in your company but we use build bot for instance actually to build the package for intelligent amplifier and other tools across Intel scientific calculations there are tools like free card it's a 3d modeling software that that has large sections built in Python so these require high performance there are also other tools if you are if you know game pan linux made out of Python games there are games like civilization for the sims for these are Python based games obviously you want your game to to be efficient and run fast right so how do we measure the performance there are a couple of techniques there is code examination you can open the editor and check the code this can be very tedious if you don't own the code you haven't coded it or if the code is super large how would you check everything but that's one way there is another way logging you basically enter pieces of code in your in your Python script and say okay print this time step here and then let me know at the end of my function how much time that function run this is also tedious manual work then there is profiling profiling is basically the cred how intelligent amplifier works in a sense what we are going to do is gather metrics from the system as the application is running and then at the end of the run we are going to analyze all those metrics and make sense out of all the data that we get and we are going to focus on cpu hotspot profiling and find places in your code where your code is spending a lot of time on the cpu or wasting a lot of time or if you have a threaded application whether one thread is waiting on a lock and not doing anything or essentially stalling finding those issues and removing them is the way to go now profiling there are a couple of types of profiling there is event based profiling which is essentially collecting data when certain events happen for instance entering a function or exiting a function or loading a class and loading a class so things like that so these are those certain events we get performance data there is also instrumentation where the target application is modified and basically the application profiles itself and then there is sampling statistical profiling now this is how vtune works vtune is a statistical performance profiler there are some caveats to bear in mind obviously as a statistical method the larger the data the larger the time that your application is running the more accurate it is so this is why I have underline approximate there but I've also put in bold much less intrusive so with this statistical method that we employ in order to measure performance of python applications we're able to to get low overhead performance profiles and the longer your application runs the better the results this is a short overview of the various profilers you may have seen or not there may be others but these are the most common ones Intel vtune amplifier what is great is it with it is that it comes with a rich highly advanced highly customizable GUI viewer in order to see quickly and visually where are the problems works on linux windows and what is also nice is this line level profiling not at the function level but right at the source line where your problems are and overhead very important python interpreted world only 1.1x performance hit and that's a really low number compared to other line profilers like line profiler itself which has a 10x performance hit so basically when you use line profilers unusable you get bogus data c profile gets you data at the function level with a relatively low overhead but then again it's the granularity is very coarse and there are also other python tools that come bundled in ids like visual studio again function level two works performance hit our tool works with basically every python distribution you may be using the even the python distribution supplied by ubuntu or whatever system you're using or our own obviously um the intel distribution for python which is built with icc support for 2.7x python free and remote collection over ssh so you can be using a windows machine and then you can remote profile a linux machine where your python code is running so that's really great um you can attach to a running process if your python code cannot be stopped you can just attach to the pid and get performance data and analyzing performance is actually really simple some free basic steps create a project in our tool configure the various settings run interpret essentially um i did a small test just to show you how it works so i have actually a code in python it's doing something very very simple i will show you this piece of code i hope it's not too small um can you see it is it good enough yeah okay i get your thumbs up so it's good uh so this code is very simple not a lot of linux of code that's it only one script but it does some computation some heavy computation so imagine seeing this in some high performance kernels um what it does is there is a small main script there are two parts one is going to use multiprocessing and create two um two sub processors and then call multiply which is essentially going to multiply as it says um two matrices a times b and store it in c so we are going to create two sub processors and do this um highly well quite badly made free nested loop computation here so if you guys do this don't do it it's really bad implementation okay and um and then there is another method which is out of the box using numpy so this is the blast multiply okay so basically in algebra and then we are going to run the code i've already run it in my linux virtual machine i collected the results in order to save time and opened it in vtune here on windows so this is how it looks like i have in my summary page an overview of um an overview of um my of the time that the application has run there is also the cpu time which is basically the time per cpu core here i see 130 which is which looks good because i have a dual core system and the elapsed time my wall clock time was 57 times 2 approximately 100 so my code was actually quite parallelized um and you can also see in the cpu usage histogram um my cpu concurrency was 2 and that's great and some um collection platform but although it i opened two multiprocessors because i have a two core system that doesn't mean that it was great because you know we're free nested loop um it's not so nice that's why i also um in this script i'm also um profiling the performance of this blast numpy um code if i go in the bottom up oh actually one more thing in the top hot spot it has already listed where you need to spend time to optimize your code so if i go into the bottom up this it has sorted all the various methods called in your python script and um you can see that uh multiply the aggregation of those two multiplies contributed to most of the time and uh because i've also collected the call stack i can go and drill down to how my method was called in my python script i can double click on it and with it will open the source file and tag at the source line where most time was spent so this is what i've done i've double clicked on that line of that on that cold stack line and it has automatically opened the source script i can move that line a bit here so um we can see that most of the time was spent in doing this matrix multiplication 26 percent of cpu usage and going back to the bottom up we can see the timeline how active was my cpu over the whole runtime of my application you can see that uh for the two multiprocess uh that the package multiprocess has created my cpu was active both processes were busy computing the matrix multiplication and then at the end of my um stupid multiplication i had the blast one and this can be seen at the very tiny end here i can do zoom in and filter in by selection this is the zoomed in timeline there is a very tiny little piece on that main thread which is thread id 3 3 4 5 and that was the blast version using numpy we can even zoom in further filter in zoom in and filter in so what this does is um it will get that timeline i'm zooming in and then filter on the timeline you tell me during that timeline which methods were being called so even more control and more power and what you see so i can see that um for this little part here for instance array matrix predict was called it is a shared object so an empire built uh we've seen so it's a shared object and yeah and the call stack for an empire so going back to my slides you are able to also run mix mode analysis so basically get performance um information about your python code and also um site and or native um code being called in your application bc c plus plus and you get all these for instance here um shared object so that's a native library and the other one is pie so um python script so the summary um training your application obviously is a good thing you everybody has to do it there are ways to do it vtune is a tool for it um because i've been asked earlier by tomas who's sitting in front um maybe that's uh interesting for you intel vtune amplifier is a commercial product but there are ways also to get it for free it's for free in the beta program so if you sign up for the beta 2008 and 18 that comes up with more advanced profiling capabilities for vtune for instance getting detailed information about threaded applications and also memory consumption uh it's available in the 18 version beta it's for free for testing evaluation for a long period of time uh it's also for free for all people in academics students professors universities anybody from academia it's for free um but only for companies that um work on real projects and generate money you require a license just a small word about it i'm an engineer i don't talk about business but i think that might be relevant for you uh you may get more information in two talks uh conducted by my colleague david louis there is infrastructure design patterns with python on wednesday but what is more relevant to this talk would be probably the workshop on thursday which is all about the hands-on on how to tune your application with our tools on this thank you very much for your attention hi and thanks for your talk um if i understand well uh you can annotate the source of python and also see to see line by line the time of execution will it be possible also to annotate directly siphon source and not the c++ or c source that uh this siphon generated uh what do you mean by annotate else because there is instrumentation but tell me more about annotation in your case i mean just as we saw in uh in the diagram that you can see actually the source uh lines and the time that they took to execute the cumulative time this kind of uh profiling part if instead of showing the c source that was generated from the siphon if we can see that directly with the lines of siphon yeah actually you will start directly from the line of siphon okay yeah how does it work okay because the question was said without microphone so the application is already running it has a process id how do we actually attach to it well there are mechanisms so you already know the p id right if it's running uh also if it has a p id you can also know the name of the application and then in the gui um you can either provide the name of the process id and vitrin will attach to it and one other question when you have c extension modules uh you also need that module to be compiled with the debug flag so that you can sample from it yeah and if you don't have access to that it's like if it's just the binary that came in your distribution yeah that's a very good question um yeah in this case you will basically see uh fank at dead beef which is basically a hex code for functions that you don't know the name uh our python um binary provided by um the distribution is built with icc with a debug flag so essentially you can see deep down in python itself uh the method names being used that's for python for an external library obviously you would like to have minus g to get the debug information for your code and your python distribution comes with anaconda distribution or you have other channels this this is just one of the way you can actually just uh do um you just add the um repository and then you can also do yum install but anaconda is a preferred way there is anaconda conda and some of this is thank you oh hello um uh so you mentioned uh vtun is a statistical type profiler uh and we've seen some results of some of the code that you're running so the math uh metric multiplication yeah i was wondering if the results that we've seen are actually the results of running the code maybe like a number of times 10 000 times and taking some statistics things over those or was that just the one of run and we just displayed the results of them also that's an excellent question in this case it was run once so this is what you get right away but in order for yourself to confirm that the data that i got actually makes sense and is true you run it yourself many times you can have actually another python script that runs your script many times and um also our tool vtun comes with the command line uh interface as well so you can have this one line that does the profiling for you saves the results and everything so you can have your script and automate the running of your program many times and have vtun wrapping your application in its command line interface and this is how you can have your own build system or regression testing system and get data and if that's the case is this how do you find the time is it uh quite slow to to run this kind of analysis like running multiple times or this one i didn't get so i was just wondering uh what time how much time do you have to spend to have to say you run your code 10 000 times and draw statistics from it do you have any type of metrics on that okay this depends on the resolution of your analysis so in my case i did a um an analysis with a resolution of 10 milliseconds which is quite big actually so if you want more data more resolution you can lower this time and how many times obviously the the lower the time duration to get the samples the larger the data the larger potentially the overhead and less accurate would be your result so it's playing around um in general um anything longer than two three seconds is good enough hi a couple of questions uh can you attach the profiler to a running process and does that process has to be built in a special way for that yeah you can do some profiling in production kind of thing i think that question was asked already sorry okay so the answer is yes you can attach the running process the second question was you had an early slide where you showed the presentation of um the time taken in on a particular line of code that line of code had two calls it was like logging dot info brackets template dot format so there's two function calls in there can you decompose that in the in the browser to those two function calls and the process time that each one took because you're just showing the sum of the total for that line and when you have these multi the multiprocess so your question is you have created two multiprocesses no no i'm making two function calls in one line two method calls in one line it's something like logging dot info bracket template dot format so you're calling info and format yeah can you decompose it in the browser well in this case it will aggregate the time and show you on that one source line the whole time for that but i think it's a bad practice to do this for code readability in my opinion i don't know how you do it but um wait hold on hold on i will add one more thing by the way in this case it you will see the source line because we're actually associating time to a source line your source code but in the bottom of view you will see different functions two functions when you but the thing is when you click on both functions you will go to the same source line but you will know the time for each function next here i would like to ask what interpreter do you use in your distribution and if you have applied the modifications to the interpreter to make it faster oh the acoustics are not so great um what i got what what i got is that how is alignment done memory alignment no no no what interpreter do you use and have you made any changes to the interpreter to optimize it can anybody rephrase this for me please oh okay thank you adam um yeah well our interpreter has been made from scratch and compiled with icc there were some changes um i don't know in detail what has changed but there were minor changes in the interpreter however all the libraries making use of heavy mathematics these have been redesigned completely making use of mkl so this is the benefit we're bringing with our intel distribution of python so that you guys when you do hpc-based applications made in python or machine learning deep learning or even using sdks or frameworks like tensor flow cafe or even the autonomous driving sdk or the computer vision sdk from intel that leverages the python distribution you get the performance out of the box so you don't have to be like a a a math genius to code properly or a super um um software engineer with great skills and code optimization to create high performance it's done out of the box for you thanks welcome uh it may be already lunchtime um just one thing if you have really interesting questions that you really want to get answered our workshop on it just on this topic could be very useful for you it's on thursday check it out i have a question for cluster users because i see that the on my machine it can connect to the to the process but if i have a cluster how can measure the performance of all the workers machine or is it possible great question yes it is possible so you're probably using mpi right yeah no i'm not i'm not using mpi but i'm using just okay let let me take mpi as an example you have a cluster several nodes your python code is being running on all you will have v2 an amplifier the driver the sampling driver on all those guys and with mpi g tool for instance you just pass mpi run g tool um amplifier xecl which is the command line interface tool and then your python script and it will do the job out of for you and get you the results it's magic it's really nice very interesting thanks yeah okay thank you