 And please join me in welcoming Ralf de Varnier, who's going to be talking about HPC on Intel hardware. Can you hear me? Or maybe I'll take this mic. That works. Okay, I can't hear me. All right. Good afternoon. Welcome to this talk. My name is Ralf de Varnier from Intel. Actually, my Intel software, because everybody knows Intel has a hardware manufacturer semiconductor company, but it's also a very large software company. We are about 15,000 software developers at Intel who develop things from low-level BIOS to up to high-end big data solutions. And my talk today is going to be about high-performance Python on Intel architecture. Actually, I come from more from the high-performance computing, which is traditionally... The traditional language is, of course, Fortran. Intel, we have a Fortran compiler. Who is using the Fortran compiler? Nobody. That's good. Thank you. And, okay, normally when I go to high-performance computing, specialized conferences, a lot of people still do Fortran, right, because the code didn't change since 1960. And so it's still running. It's running really fast, and it runs on the latest architecture. But more and more, of course, people do C and C++. But what we saw in the past couple of years, since I'm in high-performance computing, is that more and more people are doing Python. So that's why I'm here, actually. So that's why Intel is in Python. It's no wonder. And for the last couple of years, we also love Python at Intel. And so who here in the room is from high-performance computing, or does some kind of high-performance data analysis, data scientific computing, something like that? All right. Okay, who is using Intel processors? Very good. Who is using GPUs? Don't worry, don't worry. I like GPUs also. Okay, so today I'm going to talk first a little bit about Intel architecture. Where are we at the moment? And after that, I'll go into what are we doing for the Python community to make the bridge between software development and our high-performance processors. I will mostly talk about high-performance processors like Intel Xeon or Xeon Phi or even your regular Core i7 in your Mac laptop, not for the embedded stuff. All right. So, wow. What's that? Who knows? It's what? What is it? Xeon Phi. Wow, you can get a T-shirt later from me. Xeon Phi. Who knows Xeon Phi? Who heard about Xeon Phi? All right. Half of the room. Great. So, this is the latest Xeon Phi. We just released it end of June. It was called KNL, Night's Landing. That's a code name. We love code names at Intel. So, that's Night's Landing. It has 7T2 cores. You can count them. Every core has two threads. And they are all connected through Omnipad fabric. So, we hope it's going to be a really fast system. So, how does it look like in reality? So, here on the left you have the version with Omnipad fabric integrated. There is also a non-Omnipad fabric integrated version. It's like a small cluster, right? You can imagine this as a cluster with a high-performance connect between all the cores. And the good thing is, and the new thing with this Xeon Phi is that it's not anymore a core processor. This is the third generation. And until now it was a core processor. So, you had to plug it in a PCI slot and then you had to pass all your data through this small slot, right? And then it gave you a lot of limitations. The next one will also have a core processor version, but the version we just launched is actually a bootable version. So, you don't need a host system anymore. It's the host and the core processor in one. So, you can boot Linux on it at the moment. It can run a lot more workloads than the core processor, right? It has less limitation. It has MCDRAM on the chip, right? 16 gigabyte MCDRAM, which is really fast RAM. And it can run normal IA code. So, Python, normally it's IA code. So, Intel Architecture code is really easy to program. That's why it says programmability. It's power efficient, all right? It has a large memory. It can use per server up to 400 gigabyte, 384 gigabyte. And it's really scalable. You can build a cluster out of it. So, like I said, there is a regular version without the fabric, integrated fabric. This one, this is the integrated fabric one. And it goes onto a host. It's its own host processor, right? So, you can use it as a regular workstation, right? Currently, we have a lot of customers, software developers who have a regular PC. It looks like a PC. And instead of Xeon or Core i7, they have the Xeon 5 processor in there. And at the latest stage, it will also be available as a co-processor, like a GPU content. Just to go back to a little bit more of hardware. So, in the high end, we have two processor families. We have Xeon. Xeon is the regular processor, which boosts most of the servers at the moment. All the cloud, the internet, everything runs on Xeon. Xeon 5 is our new processor targeted at high performance computing, but also big data analytics and machine learning. And the good thing is it features the... It features 72 cores, right? With 288 threads. And it can run... It has vector units for up to 512 bits. So, it can run AVX 512 extensions. It would give you a really great scalability if you're doing mathematical operations with single instruction, multiple data vectorization. And so, this is what we have now, and the future will be parallel. We will have still more cores, so at least in the next five years, we'll go more cores, more threads, and more vectors. So, that's gonna be really essential if you want to have performance with your application with numerical, mathematics, at least, and high performance computing. And the current version of Xeon is Broadwell. Broadwell, this one here. It still doesn't have AVX 512. It has AVX 2 with 256 bit-wide SIMD. The next version will be Skylake with AVX 512 on the server. So, what does it have to do with Python? So, as a lot of Python developers are using these platforms, they want to use the newest and latest, they want to have performance. Of course, you are paying for all these transistors. If you buy Xeon 5, you are buying 5 billion transistors. But if you run a regular Python code on it, you will not use 5 billion transistors. But you paid for these 5 billion transistors. We want to help you get more performance out of it and use all these transistors, especially for production, right? Not for prototyping, but for production. But we are also seeing that for normal coders, it's really difficult to use these high-performance extensions. Even for us, it's sometimes really tricky. So it's really hard to combine Python and those high-performance extensions. So that's why this year, in two months from now, we are releasing our own distribution of Python, which will be called Intel Distribution for Python. At the moment, it's still in beta, and it's going to be released in the first week of September. Let me check the timing. All right. So, our aim is to give you, as a Python programmer, easy access to high-performance in Python, of course. So it's based on C Python. We recompiled it with our low-level libraries. For instance, the most important one is MKL. Who knows MKL already? Great. MKL is always at the forefront of performance. It's always optimized for the latest processor technology. We have been able to recompile it into our distribution, but we are not only using MKL, we're also using other libraries like DAL, which is a new library. It's going to be called PyDAL for data acceleration, no, Data Analytics Acceleration Library. It's also in there, and also we are including TBB for parallel programming. So what is required for making Python performance closer to native code? Of course, in HPC, it's always, you want to have native code. That's why most of them programmers are using C++ or C or Fortran. And here it gives you an example of what's required. So there is a very interesting book that came out two years ago from one of my colleagues about high-performance computing on Xeon Phi. It's called High Performance Parallelism Pearls. A lot of people in that book are writing an article about how they parallelized their code in high performance. So we took here a very simple example, which is the optimization of black choice pricing. It's really easy to parallelize that formula. And if you run it on pure Python, right here it's the number of options, thousand options per second that can be calculated. If you use pure Python, you are maybe for the same time frame for one second you can have like 100,000 per second. If you don't, if you move this to native C implementation, you can have a 55 times performance implementation with static compilation. But if you really use the hardware, all these vector units, all these cores, whatever is included in the Xeon Phi processor, vectorization, trading, and data locality optimization, you could get up to 350 times more performance. So what are we putting into Python to make that happen for you without having you to code everything in C? So first of all we are accelerating the numerical packages of Python with our libraries, right? MKL, as I said, DAAL, and some little bit of IPP, which is a smaller scale library. We are implementing TBB for better parallelism, to get rid of oversubscription, for instance, and in some cases also for MPI if you're using the small cluster version. We are also having VTUNE, which is a profile. I'm going to show you quickly what that means. We are also optimizing other extensions in Python like Cyton, Numbap, et cetera. And we are also working on the big data machine learning platform and frameworks like Spark, Tiano, Caffe, et cetera. So what is in there? It's going to be MKL, so I'm repeating myself, but we are also optimizing NumPy, SciPy, SciKit, PyTable, et cetera, all that stuff. I can give you a description of all the really long lists about what we're optimizing, what packages. We are providing a specific interface for DAAL called PyDAAL. It's going to be available through this distribution. And also it's available from Anaconda, so from Continuum Analytics as a conda package. And of course, we like the open source community, the Python community. It's amazing now. I really like this community. It's amazing what happens here. And of course, we want to bring all the good things back to the community, and in the end, eventually we are going to also optimize all the other packages of Python. So a quick overview of MKL. What is in MKL? So at the moment, for the first version, we are going to include Blast, LA Pack. We are going to include multidimensional FFTs, some Vectormat, and RNGs, which are random number generators, very strong random number generators, which can be used in Monte Carlo simulation. Here's an example of what can happen if you use our beta version of Python. So this is a FFT implementation. You can see on the right side a comparison. If you use regular Python, if you change it with one thread or 32 thread, you can get up to 10 times acceleration. Same if you use vanilla Python here, on one thread or 32 thread, can get up to 5 times acceleration. Okay, random number generators. No, no, if it's interesting to you who use random number generators, a few. So we really can get very nice results on random number generators, right up to more than 50 times more performance than regular Python. Okay, DAAL. DAAL is optimized for machine learning and statistics and big data analytics. So it has a lot of components, and we are currently working on really making that available to PyData to a real Python library. So let's have a look at VTune. VTune is a low-level profiler who is using VTune or who knows VTune. VTune is an old product from Intel which we make current every year. It's mostly used by C-programmers, C++, Fortran to find hotspots in the code, right? Where is my hotspot? Where is my application running slow? Is there any performance gain I can implement? Am I using all my cores, all my threads? Am I really using the processor at low level the right way? It's a very visual tool. It uses not much of the performance, right? And it can give you visibility up to the code, right, to the source code. So it pinpoints the source code where your hotspot is. That worked until now in C, C++, Fortran but we made it also available on Python because up to now if you would use Python it would show you the C code of the library so that was not the intent, right? So how it works, here you have some Python code. It's something very simple. You have two routines. One is slow and one is fast. That's easy, right? And we want to see if BTUN is able to run at the same time as this small program and find the code that is slow where we know which is slow, right? So we start it. It runs at the same time as your code. It can even analyze the performance measurement units, and see what's happening in the processor directly. And so it shows you here this is the result. It's very visual. Normally you have it on a much bigger screen so this is a small screen. And it shows you where in your code the hotspot is. So here slow and code, surprise, it's here, right? Fast and code, it's okay. It runs fast. The slow and code runs slow. And if you click on this, you'll get directly to the Python source code. So that's the goal. So BTUN is now available also for Python and recognizes Python code. And BTUN is really a low-level profiling tool which doesn't use a lot of overhead like 1.1 to 1.6. It's on Windows Linux and can use Python 3.4, 3.5, all the versions basically. It's really a rich graphical user interface. I would not say it's easy to use, but we try to make it intuitive more and more. It supports different workflows so you can start your application, start BTUN at the same time, and wait for BTUN to end and analyze the results. Or you attach it to an application and you only profile a certain area of your code. All right, so that's the end of my talk. I still have 30 seconds. So you can download our version of Python at the moment from Intel Software, software.intel.com. It's still in beta, like I said, in the beginning of September it's going to be released. It's really supporting the full stack for high-performance computing and big data and machine learning, whatever. All right, thank you very much. Thank you so much, Ralph. Any questions? In the latest processors, you are starting to embed FPGAs. Do you have something of that? Yeah. So the question was about... What about FPGAs? Intel bought Altera a few years ago, and since this year it's part of Intel, Altera, and our plan is of course to put some FPGA technology in our Xeon processor going forward. At the moment I have nothing specific to say about that. It's still in the works. You mentioned machines that have Xeon Fi as the main processor. Who's building these? Who's building these? Where can I get one? You can get one. So you can go to your local OEM and order them, so they can take orders. At the moment, on the market, we have the software development product, so when I say it's like a workstation, you can buy it from Colfax or from a German small OEM. I don't recall the name now. So you simply go on the Xeon Fi webpage and you have the list of... There are two small OEMs who build this workstation. You can order them. It's like $6,000 one workstation. But other than that, if you want to have more information, you have to go to HP or Dell and they are building currently their offerings. The Intel distribution for Python is only for the Xeon series of processors, also for others. It's for Intel architecture. So you can use it on Core i3 in a laptop. i3, i5... It's all the same, basically. Core i3 is basically the same as Xeon e5 or Xeon Fi. The configuration is different, but the basic architecture is the same. That's why it's really good for programmability. More questions? We have a couple more minutes. There you go. So I'll have one. So the basic idea of the special Intel Python distribution is that you've just taken the exact same C Python source code and you've just compiled it with different flags, which are going to magically optimize the way Python works, or you then have to also rewrite things like NumPy, slightly to... As far as I know, I'm not the full Python expert. We recompiled C Python using our low-level libraries, like MKLDAR. So you should not... I'm not promising anything, right? Don't record me now. It should work out of the box. Thanks. It's compiled with ICC, right? It's compiled with ICC, right? The Intel compiler. That's the plan. Okay. ICC is the Intel C compiler, C++ compiler. It's our C++ compiler. We optimize it to the maximum, to what is doable, and we of course use it for ourselves in this case. But we also work with GCC, so we also optimize GCC. It's not that we're not... We love the open-source community. The other question is, if I have a library that's compiled with GCC, if I have any issues with interfacing to it from the ICC compiled... So theoretically, the ICC and GCC are binary compatible, so you should not have any problem. You can mix codes from both... with both compilers. Any more questions? We probably have time for one more, so it might be the best question yet. It will have to fit into 45 seconds, including the answer. 42 seconds. So one thing, the distribution is free. Of course, it's free. It's not open-source. It's free. The MKL is not open-source. It's our proprietary library. But the distribution will be free to download, free to use, and with community support. It's free to pay through the Intel Parallel Studio Package. Great. Thank you, Ralph. Thank you very much.