 Right. Good morning everyone. Thanks for joining my talk today. In the next 35-40 minutes I'm going to talk about Python on our platforms. So this is the agenda for today. After a brief introduction, to set the scene, I will cover Python on Linux on Windows operating systems and as Python is one of the main languages for machine learning, I thought it deserved a coverage as well. The end, we will see what gaps we have today and what the next steps are. I'm starting introducing myself. My name is Diego Russo. For my entire career, I've been using Python in a variety of environments. In the last 12 years, I've been working at ARM in Cambridge. During these times I've been in different departments, and since 2020 I've been in a m RemoCy learning group. Almost three years ago, I started also a Python Guild, which is an internal Python community, where we organise talks, events, summits, etc. But in the last few months I have been doing a secondment within the company, which is kind of a rotation in the open source group. The goal of this, a secondment, is to investigate CPython on-arm platforms. In today's presentation I want to share with you what I discovered so far. So, let's check something. Can you please raise your hand if you know ARM or what ARM does? Okay, that's good. But I'm going to cover what ARM does anyway. So ARM creates the designs of microprocessors, GPUs and other technology. Then it sells these designs to partners to create silicon chips from it. Beside the designs, ARM provides software and tool for supporting the ecosystem. As an intellectual property company, an IP company, the business model is based around license, which is given to a customer to use a specific IP. And once the customer creates the product, the ARM will receive a fee for every product sold at these alcohol royalties. Finally, I want to mention the huge ecosystem of partners. And this is very important to highlight because you can see throughout the presentation ecosystem partners play a critical role for the success of ARM. And in the picture you can see, you can recognize some of the big names out there. And some of them are here at EuroPython as well. And as part of this engagement with our partners, a big part is done through the open source group. The open source group works at any level of the software stack from the lower level with the software hardware interfaces up to high level languages and applications. We have our own open source projects, but also we contribute to hundreds or third party open source projects. As an example here, we can see LLVM, TensorFlow, GNU or Linux and of course projects in the Python ecosystem. And ARM is everywhere. Think about a piece of technology and very likely there is an ARM IP implemented in it. We can find IP, ARM IP, microcontrollers, smartphones, laptops, cars, appliances and etc. But in this presentation I want to focus on two main markets. So the first one is the cloud where the infrastructure requires the performance and power efficient compute foundation build for instance on ARM NIOverse. We are going to cover what ARM NIOverse is later. NIOverse is all major public clouds nowadays. And every developer across the world can get access to a modern cloud based on ARM. And the second market is Windows on ARM. For instance, the Windows Dev Key 2023 or Project Volterra is a mini desktop form factor to help developers build Windows applications that leverage the power of an ARM processor. This is the first Windows on ARM developer kit. Inside there is a Snapdragon SOC. And it's ready to install a comprehensive ARM native developer tool chain. But nowadays we can find also other devices that support Windows on ARM like Surface Pro 9, Surface Pro X, there is the Lenovo Thinkpad X13 and etc. So let's see Python on Arc64 first. First of all I want to recognize all the effort that the upstream community has done to enable ARM architectures. So ARM architectures and Python share a long history together. I've dug around the see Python history and it seems that the first ever commit related to an ARM architecture was back in 2001. And the Arc64 support instead has been added in 2014. But before continuing let me define what Arc64 is. So Arc64 is the name of the 64 bit extension of ARM architecture and it was first introduced with the ARM V8.0a in 2011. Before that ARM processor were 32 bit only. But what are the changes related to see Python? See Python is written in a way that can be compiled for multiple architectures with a very low effort. In fact changes are more related to tooling used around it like build and installation scripts. Then in 2019 the PEP 599 has been created and accepted. This PEP defines the many links 2014 platform tag. It is important because it officially introduces the support for ARM platforms like ARM V7 and ARM V8. The ARM V7L is the 32 bit version and the ARM V8 is the Arc64. Now let's see what ARM and its partner have done to enable the Python ecosystem on ARM. As part of the first launch of ARM instances on the public cloud we want the developers to have a smoother experience when dealing with the Python ecosystem. So basically if a developer installs a build distribution package it should just work without falling back to a recompilation step from the source distribution. So this is done in a collaboration with an ARM partner. It started in 2020 and lasted for a couple of years. During this time frame almost 2,900 packages were analyzed. So by testing them on X86 and Arc64. And then see what their issues were. The failing ones then have been sorted in a priority list and fixed. After a couple of years more than 200 Python projects have been enabled by generating Arc64 build distribution. Very quickly I want to mention CondaForge. Although we have acted just on a dozen of packages, CondaForge seems to be kind of a healthier ecosystem as they have a migration plan for Arc64 packages although the migration is coupled with a PowerPC64 Lithuanian packages. You can see the progress of the migrated packages over time and the migration seems very close to an end. So far we have seen the enablement of the Python ecosystem on Arc64. But what about performance? If you go on speed.pyton.org you see a series of data and graphs of benchmarks. It's all good except one thing, there is no reporting of Arc64 results. The website is powered by Codespeed which is a Django application that runs the web interface, the API and the database. The benchmarks instead are part of the PyPerformance benchmark suite. PyPerformance can then upload benchmark data directly into Codespeed so they are very well integrated together. So at all I can take this software and replicate the infrastructure internally, which I did. So I set up an internal instance of Codespeed and by a Jenkins I scheduled PyPerformance runs on different machines that we have within the company. Then I run a series of experiments which are listed in the slide and I'm going to talk in the next slides. And one of my goals is to make these metrics publicly available. So for that I started a conversation in the public Python forum. So let's see how 3.10 and 3.11 compares each other on N1. So Nioverse is the arms line of infrastructure CPUs which provides a balance CPU design optimised for delivering both performance per watt and performance per dollar. You can find Nioverse N1 processors in the AWS Graviton 2 instances and this is based on MV8.2a specifications. By now you should be aware of the performance benefits that Python 3.11 brings compared to Python 3.10. What you see in the picture is the manifestations of it on R64. Apologies if the text is small, but what is important is to have the full picture of the whole benchmarks. So on the x-axis there are all the micro benchmarks belonging to PyPerformance and on the y-axis there is the speed improvement expressing percentage. They are all ordered by speed improvements. On the far left there is the first benchmark which is delta blue that brings 14.5% speed improvement if you run with Python 3.11 compared to 3.10 and the other sign is that the last benchmark coverage is 35% lower with Python 3.11. So except the last five benchmarks we can see that most of them run faster with Python 3.11. Similarly the same thing happens on Nioverse N2 which is the next generation of N series and it's the first processor to be based on ARM V9. These CPUs can be found for instance on Alibaba Cloud. I run experiments also on Nioverse V1 and it has a consistent behaviour as well. V1 can be found on AWS Graviton 3 instances and it is based on ARM V8.4 specification. So bottom line nothing to worry about. Python 3.11 performs as expected on Rx64. So I want to touch briefly on the distribution channels of Python so Python can be installed in different ways. You can use the distribution package, the depth snakes PPA, PyM or you can compile it from source by yourself. These binaries are compiled in different ways so don't assume they are all the same and I want to highlight these because in my experiments I've seen they have an impact in the performance metrics although it's minimal. Bottom line is to use whatever you have available and if you really care about performance you can try different combinations to see what best works for your workload, you can compile it with different flags so some investigation is required but this is only if you care about performance otherwise the standard routes are more than okay. Then in order to preserve compatibility and portability distribution packages on Rx64 are all compiled against ARM V8.0 which is the first version of Rx64. So this is also the default behaviour in GCC and for ARM V8 we have for instance versions from 8.0 to 8.9 and for V9 we have until 9.4 so this means that packages in distribution and when I say packages not just Python all of them don't make use of all the new features of these specifications unless the user recompiles the application and this is exactly what I did. I've host compiled Cpython on different systems with different CPUs like N1, V1 and N2 and past the optimization flag MCPU native and check if the performance were affected. With the MCPU flag you can specify the target processor to optimize and tune for and with native basically it picks the architecture of the host systems for this reason I've host compiled Cpython. I've also tried different compilers I've tried GCC 11, 12 and 13 I think 13 came out this year the result that is using the flag MCPU doesn't really affect the performance except in one case so GCC 11 versus GCC 12 both with the MCPU native passed the compiler on Neoverse V1 basically on that case I had a 10% improvements so I think the reason is Neoverse CPUs are relatively new and the compile support is not fully optimized and tuned yet for these CPUs so I expect these kind of to improve over time with the newer releases of the compiler. Now let's see some benchmark results that I run on AWS I've run them on different flavors the C6G which is Graviton 2 C7G which is Graviton 3 and then a comparable X86 one which is C6i so for the sake of brevity let's see just Graviton 3 versus the X86 the meaning of the data in the picture is the same of earlier so micro benchmarks on the X-axis and the speed improvements in the Y-axis we can see that C7G performs better and then C6i just on the last nine benchmarks on the far right so this means that there is room for improvements for single process performance I've also done a cost analysis though for the whole workload so I measured how much it would cost to run the whole pie performance suit but Graviton 3 results to be more cost effective because of course it's cheaper by the hour than X86 one so as I was saying at the both with some people yesterday when it's about performance and there is always a trade off so you need to understand what's important for your workload and if you want to save some money for the workload and etc but also I've run benchmarks for measuring multi-processes performance so the way I did it is to run an instance of pie performance for every core present on the machine and for this kind of experiment I've chosen the biggest flavours available the 16x large has 64 CPUs and the 32x large has 128 CPUs so I've also tweaked pie performance for running benchmarks for longer basically the whole experiment lasted about for 30 hours so the motivation for this experiment is to simulate a cloud workload so use multiple cores for longer in order to avoid the boost clock spikes that belong to shorter workloads and what we can see is that graviton 3 requires almost half of the time to run the whole workload can you guess why? well I think it's written there so these because the C6I flavours have the SMT which is simultaneous multi-threading enabled basically the open system C is two cores two virtual cores for every physical one so two processes can be each other to use one physical core and this doesn't happen on graviton 3 when you have 64 vcpus you have 64 physical cores so if your workload requires multi-process graviton 3 machines before much much better compared to the x861 now let's have a look to windows on ARM so windows on ARM support has been added since python 3.8 but no official builds were present until today basically in fact python 3.11 officially supports windows on ARM and this has been possible thanks to the joint effort ARM called Microsoft 6 technology and linear the overall goal of this partnership is to have an ecosystem which support native development of windows on ARM and when I say ecosystem it's not just python but all the software that you see there and of course python is part of this ecosystem I mentioned earlier there are nowadays devices on the market where you can buy a windows on ARM machine and this is how they did it similarly to the x64 enablement the top 520 packages have been tested on windows on ARM and the good thing is that over 70% succeeded it so 135 packages failed due to the lack of direct or indirect windows on ARM support they've acted on python package directly but also on dependence is needed to build such packages like third party libraries and toolchain packages so it's not just python also Leonardo has been hosting a surface Pro X in the official C python build bot instance to enable the build of ARM 64 python version and of course if you are a windows user you have a windows on ARM machine if you have any feedback and all you want to contribute there is a collaborative space where you can read all the latest updates and get in touch with Leonardo Leonardo is driving these activities then if you want to build a package or for windows on ARM a python package for windows on ARM don't worry because this is possible even if you don't have access to a windows on ARM platform so you can build it use cross compilation cross compilation is still the best approach and then you can test it on using for instance QMU Wine and Docker on anarch64 Linux on the slide there is a link of guides on how to do this Leonardo wrote a guide on how to do that so now the question is why do we want to build a native package for windows on ARM so the answer can be found in the performance so windows 11 on ARM can run x64 binaries by evolution x64 are x86 binaries by 64 bit and this but you can run it by evolution but this has an impact on performance in the case of py performance have run basically py performance using python AMD64 version so the xx64 version and also run the official 311 ARM64 python and python ARM64 is almost twice as fast than AMD64 version which is emulated the non-native binary work but they are much slower than the native and of course this has an impact on the user experience of this platform and the application so my recommendation is to start building native package for windows on ARM and users of these platforms will really appreciate the effort right as last part of the presentation let's see what has been done in the machine learning world at the beginning of presentation I did say that since 2020 the machine learning group in fact I work in the same team that collaborate with Google on TensorFlow Lite and TensorFlow Model Optimisation TensorFlow Lite is a library for deploying models on mobiles and microcontrollers and other edge devices and the contribution that we have been doing are all around quantisation support the TensorFlow Model Optimisation Toolkit instead is a suit for tools for optimising machine learning models for deployment and execution we have contributed with different optimisation techniques and how to apply them together without affecting performance this is an ongoing engagement that ARM has and the main goal is to run your network to models efficiently on ARM IP but what about TensorFlow packages for ARM platforms so thanks to a collaboration between AWS ARM, Google and Linaro starting from TensorFlow 2.10 Arc64 packages are available on pipi.org so if you go on AWS pin up any C2 machine and you do TensorFlow it just work also from TensorFlow 2.10 there is an integration of the ARM computer library for the ARM architecture it's called ACL and through the 1DNN API this is done to accelerate the performance on Arc64 CPUs so it's also optimised for that platform so again yes you can just pipi-stall it and it will be optimised Windows on ARM though doesn't have an official package yet but in the slide I put a link on how to build it so if you are a Windows on ARM user you have a way to build TensorFlow for Windows on ARM PyTorch for PyTorch instead so the Arc64 packages are available since version 1.8 but if you want to test the bleeding edge version of PyTorch does provide Docker images that basically contain both Open Blast which is the default backend and 1DNN plus ACL backend so the same backend that TensorFlow has like TensorFlow there is no Windows on ARM package yet available but Leonardo is keeping track of the issue I've seen that there is an issue open on the PyTorch GitHub also in this slide I want to highlight all the contributions that we have been doing in the machine learning space so the first one is Apache TVM which is a machine learning compiler framework for CPUs, GPUs and machine learning accelerators we have added support for ACL for instance contributed to the packaging and the CLI and added support for many microcontroller platforms in Open Blast instead the contributions were all around NeoVerse and NeoVerse course and SV SV stands for Scalable Vector Extension and last but not least we have implemented some faster math routines in Lampai so we have seen Python on Arc64 on Windows on ARM related to machine learning so I guess now it's time to understand what's next and how the whole Python story can be improved on our platforms so let's start analysing the gaps that we have today so PEP 11 which is the C-Python platform support defines the support tiers for C-Python which are three in total currently the Arc64 platforms are in tier two it would be great to start a conversation to move these in tier one next to X-ray six so there are requirements to be satisfied so we can take some action to move these platforms to tier one for instance second as we saw earlier speed.python.org doesn't have any Arc64 benchmarks matrix and as I said earlier I'm personally working on it to get this fixed the Windows on ARM story is still in early stages so it's kind of reasoned but it's progressing very quickly so I'm sure the situation will improve over time and you can help if you are providing Python packages you can start building packages for these platforms so users don't have some blockers to use these platforms Last but not least the single process performance whilst multi-processes performance are excellent compared to X-ray six we have seen that single process performance lags a bit behind and they could be improved so maybe we can do something about that the rest of my second and I will dig a little bit more deeper on the reason why these benchmarks don't perform as well as X-ray six I think these are the gaps I think we have today and now we need to understand how we can work together to make the whole Python on ARM story a better one As a first step try migrating your workload to ARM so if you have any workloads in your company try to migrate to ARM so the migration at this point in time should be seamless and painless but if you see any issues I think the best thing to do is raise the issue with the upstream communities of the projects that you are using so we can get this fixed and as I said earlier basically nowadays every developer can access ARM platforms on all the clouds so give it a try Also if you provide a package we are building and validating the package on ARM platforms we saw that there are real benefits in terms of performance on windows on average and more and more people are migrating workloads to ARX64 and at last we are here to help feel free to engage directly with us during the conference we are available for a chat we are more than happy to help you out Of course another way to engage after the conference is with the ARM developer program this was launched last year there are about 5,000 developers on that platform where you can ask questions share your experience basically it's a discord server where you can have access to ARM experts who are actually ARM employees we have office hours so where you can ask questions and there is a portal with guides and we keep adding guides and how to do things you can ask your question so go on ARM.com developer program sign up or feel free to come out both you can sign up there thanks for your patience thank you I'll finish with a presentation and now if you have any question I'll do my best to answer it thank you do you go for the presentation about the state of python on different ARM platforms now if you have any questions there's a microphone in the back so feel free to go there and ask questions if you have any hi thank you for the talk I have a question regarding the optimizations so you mentioned like in the last slides that you contributed to some improved routines for instance from NumPy and then there was a bit earlier the fact that right now they're using Manilinix 2014 or 2010 so the optimizations you contributed are they like backwards compatible to ARM v8 or is it basically you can enable more if you recompile with native flags so I think the optimizations you are talking about is this one is it this one yes it's related to this so what I did basically I so GCC targets 8.0 which is the basic R64 specification so this means all the next specification 8.2.3.4.5 that introduce new harder features the software cannot use it until you recompile it so it was an experiment to see what if I recompile C Python passing these MCPU so basically trying to target that specific CPU and tune it for it and see if this has an effect and then once I have the binary compiled for that platform I run pi performance and I've seen that there are no real benefits except for one specific case which was GCC 11 vs GCC 12 with a flag passed on v1 I think yes on v1 and the reason why is I think v1 was one of the first CPUs to be implemented so the compiler team had these optimization added in early stages so it's more kind of optimized alright and in terms of the software contributions you made which is like the in the machine learning side of things so those are then bound to the current compiler flags this is the contributions that we do for instance on TensorFlow they are fairly generic so that this means that if you run this stuff on X6 they will benefit X6 as well so they are not specific to any architecture unless you are talking about ACL our compute library which is a library that works only on Arc64 and uses all the new features of the hardware yes so there are different levels of optimizations so the ones in TensorFlow they are very high level and related to machine learning specifically to machine learning in fact over there we can see maybe it's not that let me see there are different optimizations like pruning so if you have a neural network model you start taking out nodes without affecting performance so this is pruning then there is clustering as well so you can take out you can cluster nodes in different sets and then there is the quantization so basically you convert from a floating point models to an integer model that could be in height in 16 then once you have the quantized model then it can be used on accelerators for instance so one of the line of CPUs that we have is an NPU which is a neural process unit and this is optimized to run quantized model so it can run inferences much faster so there are different level optimizations that we are acting to thank you thank you for your talk I had maybe two questions but first I usually hear the benefit of ARM at least in my circles that the benefit is mainly around power consumption and I feel like I missed that in your slides but do you have a comment on saving power and as a result saving cloud costs and do you have a question on this? it was kind of implicit in what I said in the let me say no hold on yes in here it's more cost effective in a sense that the cost per hour of the Gravidon 3 is much cheaper than X86 and the reason why is for the cloud provider it's much cheaper to run it because it consumes less power so that's kind of implicit in the so the cost benefit is passed to the customer because for the cloud provider there is they need to pay less electricity they need to pay less electricity for cooling the server farm all these kind of things okay thank you another question on the windows in ARM part on the end of the slide I think you mentioned it's almost twice as performance almost as an afterthought is that again because of the multi-process splitting? no this is because it's emulation basically as you can see here so the graph is the baseline of the ARM64 so in this case less is the better so I put the baseline of ARM64 as one and as you can see the AMD64 are much higher than in terms of running benchmarks of the pi performance suit so it works so if you go on windows in ARM download the Python for AMD64 you install it, it works it runs, you can pip install everything and it works but it's slower because it goes in binary translation and it's in emulation so there is a difference between also in the usage of Python so if you use Python AMD64 the X86 version then when you do pip install as a build distribution it will fetch the AMD64 one because it's linked to the AMD64 Python version if you use the ARM64 and you require a build distribution then it will fetch the ARM64 and here is where potentially you can have issues because not everyone is producing AMD64 so my recommendation is to start building these packages for the platform there are no remote questions there are also normal questions in the room thank you very much Diego for your talk and please give it a thank you