 Yeah Our keynote speaker for a sign scientific python and or on the pie data track today I think the most of you know gail already he's One of these like the core member like one like one of the main contributors to the scientific stack arm and Yeah, please welcome gail Okay, am I on Good So screen is working Mike is working slides are working cool Okay, so Thank you everybody for coming here. Thanks up to the organizers and Alex for the introduction So I think we'll agree that your python is pretty cool, right? Yeah, yeah, yeah, right so the site the cider event was really cool yesterday So I hope you'll get coffee this morning. I did Okay, so What I'd like to do in this talk is to address a bit the very diverse community that we have here And so what this talk tries to be is a reflection on what we have in common, which is python So I'll be talking about things you don't understand Which is my science and things that I don't understand which is web development So I don't know how I get into these horrible situations Anyhow, I did at some point a phd in quantum physics So I think I qualify as a scientist, but these days I do computer science for a neuroscience So what we try to do is that we try to link neural activity So firing of the neurons basically two thoughts in cognition like what you would do when you drive a car The way we do this is we use brain imaging and specifically we We pitch this as a machine learning problem. This is what I do And we've developed python software to do this of course So if you want to try this you can actually do prediction of things like visual stimuli based on recordings of brain activity Using this open source software and open data you can go online. It's there, but I won't be talking about this today So on the way We created a machine learning library, which is known as scikit-learn So I say we because it was many people. It was of course not only me or my lab And so it was a huge success We suddenly became cool Because data science as you might have noticed is a fairly cool thing these days So these days python is the go-to language for data science So I'd like to think a bit about how did that happen Because we did build scikit-learn and other builds pandas and other tools But these were built on a solid foundation and python is really giving us that foundation So to set a bit the picture Scientists do have a reputation of being a bit different in the python community At least historically you may say that they come from jupiter but then to us web developers are very different and Actually most scientists do not know what a devops is I saw these kind of discussions. What do you do? I'm a devops. What does that mean? Okay, so we're different. For instance, what developers worry about strings? Well, we worry about numbers in arrays, of course What developers care about databases? Well We think in terms of arrays of numbers, of course So you might think of object-oriented programming No, arrays are good enough flow control and we can actually do with arrays, right? All right. So there's a bit of a culture gap, right? All right, so let's let's do something together How about we sort The europe bighton Website, I mean, there are too many abstracts 205. I can't read them all And they're you know, they're hugely varied, you know, they go from open stack to making 10 million Dollars with a startup. So let's find Using data science And so the way we'll do this is that we'll do a bit of web scrapping to get the data from the website I could have asked the conference organizers, but that was boring, right? And then we'll do a bit of text analysis and then we'll do data science and we'll give you topics So the nice thing about this example is it walks us through a good part of the whole Python stack. That's why I like it So we're going to be using things like url lib Or beautiful soup, but also a scikit-learn and matlot label word cloud for plotting So the first thing that we're going to do is that we're going to crawl the website And so our goal here is to get a schedule So from the schedule, I mean to retrieve the list of titles in url And then we're going to just crowd the pages and retrieve abstracts And we've I've been doing this using beautiful soup If you've never used it, it's an awesome library that allows you to basically do some matching On the document object model tree of an html page. So that's really awesome. The scientists would never have developed that Then we're going to vectorize the text. The idea is that if you get a text It's a bunch of words, right or characters So for each document, we're going to count how many times a word Appears and we're going to put this In a table So we're going to call this the frequency Frequency for each term So here we have a term frequency Vector that's describing my my my document And you can see that the most common word is a and then the python is very common So maybe that's not a very good description Because some of these terms are all over the documents So what we can do is that we can do the ratio between the terms All over the documents the the frequency of the terms over the whole database and the frequency of the term in the document So we call this the tfidf so term frequency inverse document frequency And you can do this with scikit-learn using what what's called the tfidf vectorizer Okay, so now I feel a bit more in my comfort zone. I've gone from text, which I don't understand two vectors of numbers Feels better So if we look at all the documents Then we have a matrix right a 2d array that gives us The terms in the document. So it's the term document matrix This could be represented as a sparse matrix because Most of the terms are present in very few documents, right? So we can use the scipy stack to use sparse matrices And the good news is that the scientific community not even the scientific python community has developed lots of fast operations for sparse matrices So we're doing text mining with Things that have been developed by people who do partial differential equations or things like this cool Then we want to extract topics. So what we're going to do here is that we're going to do matrix factorization. We're going to take this term document matrix and we're going to Factorize it into two matrices one that gives the loadings Documents on various terms and the other that gives the loadings No, sorry loadings of documents on what we are going to call topics and then loadings and topics On terms, right? So here the first matrix tells me What documents are in a different topic? And the second matrix tells me What terms are in a different topic? So this is a matrix factorization algorithm So once again, I'm back into things. I know as a computer scientist often we do this with non negative constraints in text mining because the fact that a term is Negatively loaded on a topic might or might not mean something You can do this in scikit-learn sclearn.decomposition.nmf for non negative matrix factorization. That's where the magic happens So we run this and we get word clouds So that's the representation of the first topic and what is it about? It's about the python language good news The second topic is about well science and machine learning And then the third topic is something like testing And then we can look at all the topics and there's a bunch of different things You may have a synchronous. You've got a topic about the community one about basically conference organization Internet of thing best practice and one I'm not showing here, which is talks in Spanish or BASC Okay so As python is not only a numerical language. We can also output a website from this using a templating engine And if you work a bit, I think you can get a reasonably usable website. So It's on the web you can have a look at it and there's a link to the code that actually generates all this So you can run it if you're interested So you want to try it? Okay pip install scikit-learn Ah, no it complains that numpy is not installed All right pip install numpy Bang it wants a c compiler Now you're starting to get angry at me, right? So it's back to the fact that we're different historically. We've had a lot of problems with Well, people don't have fortune compilers. Why don't you guys have fortune compilers? Why are you laughing? Fortune is giving us really really fast libraries I mean between a naive c implementation of Major cuts operations and a fortune optimized one you can get the factor of 70 of difference the factor of 70 is something, right? So packaging has been historically a major roadblock for scientific python And the reason is we really rely on a lot of compiled code and shared libraries So we've been hitting problems like the fact that libraries were not there or a bi Competibility issues now the good news is that there is a huge amount of progress for two reasons The first one are wheels and specifically recently many linux wheels So the idea being that you rely only on a conservative course set of libraries So that basically is solving so that the problem I showed Shouldn't happen anymore. It should should work. You can try it. Tell me if it doesn't work and the other other reason is that There's this thing that's called open blast which is linear algebra Not using Fortran. So that's good news. By the way, Fortran is a very modern language that is super performant because it allows you automatic vectorization Which c cannot do because it's got different semantics So don't think that Fortran is something from the 70s. Well, it is but Okay, so we're different But if we work together we can get really awesome things So for instance, I hope that you can get this example to get text mining in any of your websites It should be easy to do, right really So it's magic, but you can use it All right So now let me let me help you think a bit more like a like a scientist And in how we code And you know what it's mostly about numerics So we really love numpy, you know numpy, right? It's the numerical python library It's matrix operations Arrays operation. So the reason we really love numpy is because it's fast. So let's try for instance to compute The product of term frequencies versus inverse document frequencies on a hundred thousand terms, right? So we can do this with less comprehension And it takes six milliseconds now six milliseconds may not sound a lot But when I do Say non-negative matrix factorization algorithm, I do these things many many many times and actually a hundred thousand Terms is not big data. It's tiny data So that is actually a toy example Now if we do this with numpy So the code is slightly different And we get 70 microseconds. So that's almost a factor of thousand speed up Another thing that we really like is that if you're used to it It that it's actually very much more readable a ray computing requires learning it But once you've learned it is extremely readable, right compare The tf times idf to compute tf times idf to the less comprehension So it's important to realize that arrays are actually To us nothing but pointers What what defines an empire ray is a memory address a data type A shape and a stride So the shape and the stride are things that tell you how you can move Through the array And basically you're moving through the array by pointer at metrics. Okay, you're just moving from one One point to another by computing offsets So what an array represents Is regular data in a structured way So this is really important because it matches the memory model Of just about every numerical library Whether it's in c C++ for trend Or actually I believe other languages most languages So it allows us Copy less interactions across this combined language border So for me the value of numpy is really that it's a memory model So let's look a bit at why it's fast So if you're computing tf times idf One thing is that you're not getting any type checking During the operation you're first you're getting all the the dynamic type during the Computation to do to know what tf times idf will do, but then it's compiled code that runs the operation But then maybe most importantly You're using direct regular sequential memory access Okay, so you're just grabbing your data. There's no pointer. Do you referencing? Well, there's one but after you're done You're just grabbing chunks of data From from the ram or from the cache and that's really fast And so then your cpu or your math kernel library can implement things like vector operations using for instance simday Operations, so that's what really really makes numpy fast. The type checking is part of it, but it's not only it All right So it's much faster than this. It's cool Now let's look at this once the array gets big enough then suddenly we get a factor of two cost in compute time per element So do you have an idea what this may be due to? Excellent It's cache So 10 to the 5 elements that's approximately the size of a cpu cache. You could do the computation, you know These are probably float 64. So they're 8 bytes Right, so the problem is that memory is much slower than the cpu So your goal when you want fast calculation is to get things in the cpu as fast as possible And here You're starting to get out of the cache So that's bad news for for array computing, but there's even worse If we do a slightly more complex operation to tf times idf minus 1 Then the cost actually starts increasing. So what's going on here? Well, if we look at what's happening Python is computing tf times idf and creating an array That we don't see I'm going to call it temporary array and then it's removing one from this temporary array So what we're doing here is that we're really moving things in and out the cache hugely So we get pretty bad caching validation here so and and This is because of the python computing model. It's just the way python works So we can find this and we see that there's a huge cost to removing this one in terms of competition Okay, we can play a trick is that we can enroll this and do things slightly better by using an in-place operation for the second one So the idea is that we're reusing the allocation of the temporary array We're not allocating arrays twice if we do this it gets much much faster And the reason is we've become much better with cache. We invalidate less cache So if we look at our graph we can do an in-place so it's still going up with with the number of elements But because of this in-place operation, it's um faster So what we have here is really a compilation problem, right? We want to go from this expression to this expression Uh, so we want to do things like removing or reusing temporaries or we might want to actually chunk operations, right? So if I could do a four loops that does Loops on the data size of the right size, then it would be fast And so for instance num max Which is something that's mostly developed by uh, french as cal 10 Can do this using string expressions. So that's that's an example num x but evaluate tf times id of minus one and Without being clever num x was clever for us. You get the speed up Okay, so you get the same speed up as uh num by in place all right, so Have you heard of number? So number is basically, uh, uh, just in time compiler Well a compiler that does these kinds of things Uh with byte coding inspection another approach is a nice package that's called lazy array that basically Bills an expression But doesn't evaluate it and then evaluate it. It's when you call it, okay? So basically it's going around the uh python evaluation model And I'd like to point that this is actually not a problem that is specific To scientific computing It's a similar problem to things like grouping and paginating uh sequel queries Now i'm talking about things. I don't know here, right? So just to to summarize the kind of things you could give to your your cto your ceo If it's too small you get overhead overhead of python overhead of creation of arrays If it's too big you fall out of cash So your optimum lies in the middle We probably want to be lying here because that's where big data is that's where the magic is the money is See people taking pictures sorry This part right Okay What if we need flow control? For instance, we don't want to divide by idf one is zero So I told you we don't use flow control So what we're going to do is that we're going to do an expression. That's a test expression It's basically saying that where idf is zero That returns a array of bullions Then I will put tf idf to zero. Okay So that way we don't need flow control cool So um Suppose we're looking at ages and in population and I want to compute the mean age of males versus females So then I can uh select The age array with a gender array and say well where gender is equal to male I'll compute the mean and I'll subtract where gender is equal to Gender that's a typo Uh Now this is really look starting to look like a database right We're really trying to starting to do selections. So um On top of on top of numpy and parallel to numpy there's a library called pandas That is really something in between arrays and in an in-memory database So it's it's been huge hugely by the pydata community because it's fantastic for these queries and this data messaging For numerical algorithms. It's maybe less fantastic because uh anyhow We're going to be falling back to numpy Okay So what you guys are going to tell me is You're not really doing python, right? You're doing a bit of beautiful python code That sits on top of lots of ugly fortress in c++ routines And that gives you scalability, but it's installation problems But then I realized that most web development is actually Some beautiful python code that's sitting on services like a database That could be in c++ in java and error lang in god knows what in no gs And that actually gives Us deployment problems. So you don't have compilation problems. You have deployment problems. So we're not that different, right? We're just struggling with similar things instantiated in a different manner so These days I like to think as numpy is the scientist equivalent to an orm and I don't use orm. So I don't know what I'm talking about so numerics as we've seen Are really efficient Because we apply them to regularly space data But numpy the way it works creates cache misses for bigger arrays So we need to fight to remove temporaries and maybe chunk data If we do queries Then they're going to be really efficient If we can use indexes or trees So typically databases do that, but we're going to need to group queries So all these are compilation problems But compilations is unbytonic So we can do for instance, we could think of a computation and query language That's a bit what numexpert does But I really hate domain specific languages and each time I try to use sql Because I'm not a web developer. I get it wrong and I get annoyed And the other problem is that numpy is actually extremely expressive The amount of things that you can do with numpy or with related tools is extremely varied So I don't think that's a good way to go and anyhow, I like python. I want to be doing python So one approach is to hack python and a really cool example is pony orm. Who knows pony orm? It's web development. You should be better than me at that So what pony orm does Is it will compile python generators to optimize sql queries So you're going to write something that looks like a python generator It's going to do bytecode inspection. Well as the inspection, I believe And and then grab grab the ast and build a sql query on top of this and optimize it via compilation and grouping So so that that's really cool. It's no longer really pure python, but it's really cool So I'd like to to draw your attention to something that is happening a lot in the big data big data world Which is something that's known as a spark and it's it's a rising star And it's in in scala. So basically on top of the jvm on top of the java world And it combines two things it combines a distributed store So people don't realize this usually but it combines a distributed store Which is some form of data base like A store and a computing model and plugs them together and it allows it to do distributed computing in a reasonably efficient way Now the thing is that we so pydata world Are actually much faster when the data fits in ram and the reason is that We're really representing data As regularly space arrays and so then we're going extremely fast Whereas the java world has a lot of pointer dereferences Okay so If we want to scale up Maybe we're going to have to do operations on chunks, right? Maybe we need to chunk the data And then maybe in parallel or in series doesn't matter Compute things on arrays that say fit and ram or fit and cache Now this is great for certain computing patterns things that for instance are known as extract transform and load But if you're doing multivariate statistics, which machine learning is about You're really combining information From all over the arrays You're reading, you know, you're really learning to that that uh the interaction between Machine the term machine and learning those two together make a topic So the kind of compute graph that you get are horrible And it means that things like out-of-core operations, which is basically what we're doing when we're chunking Data are not efficient. There's no data locality So one approach is to do algorithm development, which is what I do So I'm happy and the idea being that you use Online algorithms, so it's basically you don't use the same algorithm You use an algorithm that works on a stream and then you start chunking the data in the algorithm so if You've heard of deep learning Then the number one algorithm that's using deep learning is to cast a gradient descent and that's how it works That's how people can apply deep learning, which is extremely computationally expensive to huge data sets so Back to data science So we have shown you how we can go from A matrix of term document to a factorization and there's magic here, right? So there's an algorithm. I did not discuss how it works. We just imported it from scikit-learn What what the scikit-learn devs do is that they take herbal papers full of math expressions And drinking a lot of coffee. They turn it into this code It's actually really hard by the way People have been asking me yesterday. So why do we still use code that's written 40 years ago or 20 years ago in Fortran because writing stable numerical code is extremely hard and No better code has been written so far So the reason that we scikit-learn and pi data have been able to do this Is thanks to the high level syntax of python and everything I've presented here So the reason all this is important is because it reduces Our cognitive load and allows us to do math All right, let's talk a bit about something else than numerics And let's talk about the future and about what's going to make pi data great again maybe So I think that we've been seeing recently that data flow and computation flow are crucial So you can have you know the simple data parallel problems. You can have the messy Compute graphs you can have you know online algorithms And so data flow engines are actually popping up everywhere. So for instance, maybe you've heard of dosk so dosk is a pure python static Graph compiler so it will represent a set of calls of function calls on data as a graph and compile it And then use a dynamic scheduler on this to do parallel and distributed computing Okay So it's really nice except it's Basically static which means I can't add things to my graph. Okay another uh tool that people use in deep learning is tiano and People probably don't realize but it has expression analysis In pure python builds a graph of operations And optimizes this tensor flow is a c plus plus. I believe library Developed by google to do deep learning and it also builds a graph of operations So graphs of operations are there in many Many different libraries below the hood I believe that python should really shine here Because it's reflective we can do some form of meter programming And because of the recent async developments Because I think the future is for parallel and distributed computing So as nathan ill smith who is an empire developer said python is the best numerical language out there because it's not a numerical language And I believe this is extremely true Now we have a bit of a problem here is that the api is really challenging because We're doing algorithm design And we can't really do what what you guys have been doing in something like jango where there's basically an inversion of control And and you're no longer writing imperative code as you would do you're buying into a framework And I don't believe that we can write really complex algorithms like this. There's just too much cognitive overload But it's just an api design. Well, we'll solve it So in terms of ingredients for future data flows, I think distributed computation and runtime analysis are really important things and for this, I think Reflexibility is central It's really useful for debug. By the way, if I'm not in python, the number one thing I miss is is the ability to debug And I can debug in a high level way, which means I can debug things like numerical instability in my algorithm And that's really hard to do. You know, you've got something that blows up somewhere in terms of numerical precision Python is fantastic to debug this I can do interactive work, which is how most data scientists work This will enable us this already enables us and will enable us more code analysis, which is going to be really important for being efficient And it gives us persistence Which is extremely important for parallel computing because when you're doing parallel and distributed computing need to move Data well, you need to move objects around between different computers And you need to move code and for this you need reflexivity So I realized that so we've been relying On on pickle distributed computing has been relying hugely on pickle And the idea is that It uses it to distribute the code and the data between the the different workers But we can also use it to serialize intermediate results Okay, so that's one way of doing computation on data where All the intermediate results might not fit in in ram It Can be made very easily with python And another thing that that we do is that we actually use pickle to get a deep hash in the sense of a cryptographic hash Uh of any data structure So that's really nice because it allows you to see if things have changed or not so to avoid recomputation Now the problem is that pickle is actually very limited the way it's implemented in the core the core library For instance, there's no Support for lambdas and these things Are not fundamental limitations There are trade-offs basically and so there are variants of pickle like Dill or cloud pickle And I must say that I really like one of those two or maybe ideas from one of those two to go in the standard library Because it's actually limiting hugely parallel computing not to be able to pickle everything So I realized we're never going to be able to pickle absolutely everything And I also realized that I can write code that always pickle. That's what I do. But when I give this to Not very advanced user. He will at some point write the code that doesn't pickle So for me, by the way, this is more important than the gill That may be surprising, but when you you get to know a distributed computing. Well, these things are a problem Data exchange basically Now we have this this small library that we call joblet that allows us to do Ingredients for data flow computing and one thing it does is a very simple parallel computing syntax, which is basically a Syntactic sugar for parallel for loops and behind the hood it uses Threading or multiprocessing or just about any back-end you can plug in you can plug in your own back-end in there It does a fast persistence. So it's basically a subclass of pickle that does clever things for an empire raise And it gives primitives for out-of-core computation The reason I'm pointing this out is it's actually very non-invasive syntax and paradigms So this with a library like joblet We can write algorithms and it's actually used Inside scikit-learn even though you may you may not know it uh It's fast it's being designed to be fast on numpy arrays and it's getting More and more an extendable back-end system. So, uh, I'm looking forward to a world where we can use things like celery to basically distribute computation From scikit-learn in more of a web development environment. I don't know if it's a good or a bad id, but I'd like to try it so I think the python vm is great It's awesome And one of the reasons it's great. It's because it's simple, which is what a lot of people have been criticizing So for instance the java world tells us that they have software transactional memory and it's really cool It would be nice for python But I personally really need to use foreign memory. I need it And interestingly java has gained recently Jamaloc to allocate basically foreign memory We'd like better garbage collection. We really would like But just about every c extension relies on reference counting And the reason is it's actually very easy to manipulate the reference counting if you're not sitting in the vm Right, so basically the python vm is something that I can manipulate without being inside it Which means that it's really great to connect to combined languages And I'm talking to people in the conference Many people actually use this many people use libraries that have been developed in another language through python And I'd like to to draw a bit of attention to a sython Who knows sython? Good who uses sython good It really gives us the best of c and python You can add types for speed And they've done things so right that when you add when you type an empire ray, it basically becomes a float star so Float array in c so superfast But you can also use it to bind external libraries and it's surprisingly easy The good thing is suddenly you're working with c libraries or you're working with the c like code Without any malloc free of pointer mathematics, which is for me the none in one problem of these languages So I see this as an adaptation layer between the python vm and c and it's really a fantastic tool By the way, I think everybody should be writing c extensions using sython Because it's an abstraction over the c python library the c python API so for instance, you can write code that's very readable and that combines with python 3 and python 2 even though there's been a lot of changes In the c python api. So it's also good. I believe it's also good for the numpy core developers because They'd like to change things in the c python api And if everybody writes sython, they will be able to because sython will do The the impedance matching okay So we Scientists can work with web developers and we really actually get to love each other. I believe Actually, I'm I'm I'm really serious here. I actually really enjoy people who are not doing science In the python community. They're first. They teach me things section. They make fantastic tools that I can use and so I'd like our tools to be useful for us And I'd like to point out that scikit learn is actually really easy machine learning It's really a very simple syntax. Basically you import an object and it's a magic object that will do classification So recognition of things you instantiate it And then you give it data. So it's basically matrices, right? We only do matrices And so you have to figure out how you convert your own data to matrices And then you call fit and then you call predict okay So people one of the successes of scikit learn is is this encapsulation People have really loved the fact that the classifiers is semi black box. So they can use it without fully understanding it So that's another thing that python has given us is object oriented in a really really cool model that allows to do object oriented programming without say crazy Crazy class diagrams And another thing that we've used hugely is what people have called documentation driven development So there was a talk about this So to try to make this api as simple as possible What i'm trying to get at here is that we're trying to give you a high level simple api to reduce your cognitive load just like python and numpy reduces Our cognitive load when we're implementing these algorithms so We're all doing very different things here and we can all benefit from each other But we can do this only if we're really careful to reduce each other's cognitive load On what the other does not understand. I think that's extremely important So it's important to be didactic outside of one's own community and actually python is really good at this the jango A documentation is known as being really excellent The python worries about syntax being beautiful So To do this we need to do things like avoiding jargon. So machine learning is really bad. It's full of jargon We in second land try not to have too much We need to prioritize information and so for instance students That are applied math students and learn about numerics I hate to tell you they don't care about unicode Even the french ones that have umla it's on their first name One recommendation I have for people that that do api design is Build your documentation upon very simple examples and examples that run So one thing that we do is that we this thing that's called sphinx gallery that basically Uses sphinx sphinx is awesome to build our documentation and running all the examples So it means that the examples must run they must run fast It means they must be small enough to run And so I think that's helped a lot both the documentation, but also the api design of scikit learn All right to wrap up I think it's it's because of the interaction between people like scientists and people who are Not scientists whether they're web developers or dev ops or anything Have I been censored? Oh, okay, cool. Yeah Um, what was I saying? Well anyhow The python language in its VM is the perfect tool to to manipulate low-level concepts Whether you know their arrays or Actually, you can manipulate things like like trees in sea with high level wording and I personally think it's a personal opinion But this has been key to the recent success of python python has been growing hugely And when you look at how people are using it at some point, they're plugging into something low level very often Dynamism and reflexivity are crucial because it enables meter programming and debugging But we also find that we need for compilation For speed so then there's this this tension between dynamism and compilation And I have the feeling it's everywhere. It's also in web development with say compiling the sequel queries Uh, and I'm extremely excited about the peps that victor's thinner is pushing forwards like guards, uh on internal structures to allow checking At runtime for modifications So that will allow us any kind of hacks that we do on the code to be, uh invalidated if the environment changes Or the pep for functional specialization Finally, I think that python has gained and will gain hugely from a database world And the concurrency that are developed a lot in the world and dev ops world But I think it can also give back things like knowledge engineering and ai which are really You know growing hugely and just in case you haven't noticed Data science is disrupting Just about every job That that you're doing so it's cool that there is data science in python. All right. That's all I have. Thank you Very much gale. Um, yeah great keynote great insights and uh different little different world. So questions Raise your hands Give the mic to mic Thanks very interesting keynote one thing. I just it's not a question. Just a statement Uh, the scientific world was a very early adept of python 3. I think They were just several years ago The most of the scientific stack was in python 3, which is a great thing. I actually can use pretty much any A good scientific packaging python 3. That's something I want to add Yeah and I agree and the biggest cost of python 3 first with the change of the See python api and so actually people still in niche applications have Code that doesn't run on python 3 because of the see python api But all the main libraries by vast margin run on 3 and everything I do runs on 3 and 2 question Okay, uh, you probably get that a lot, but I will ask anyways. Have you heard about pipi? I was actually trolling a bit in my talk. Yeah, I know a lot about pipi So, uh, to give a bit of background my my brother, uh, studied a language theory So we've had crazy discussions all the time So, yeah, I know a lot about these things, uh, and Part of the things I wanted to talk in my talk was the fact that it's not only about type checking NumPy is not only about type checking. It's about the memory model And I think by the way pipi has progressed hugely in the sense Which is it is no longer trying to say I'm going to control the memory for everything Which historically was a big roadblock for us. I mean we I could not believe That pipi would be useful for scientific computing Because for a long time I heard that the end goal of pipi were things like a software transactional memory Which is really cool. By the way, but Will cost us things a lot in our world and the other thing is we're not going to get rid of the compiled code because there is So much history in making those algorithms really good and it's extremely hard But I do believe that what what the pipi world is is doing, which is A lot of analysis on the code is extremely extremely useful That that I absolutely believe. Thank you very much Any more questions? Are really in the back. Okay. Sorry. Daria's faster. I'm sorry, but Yeah, go ahead You keep you keep referring to our world your python world. Is the division that clear? Not for me not for me at all. I've got Personal friends in all the communities. I use all kind of different tools But I'm afraid there is a division Uh And I'd like to think that it's fueled by by different trade-offs Uh, and I'd like to fight it by the way. I don't want it. I don't think it's useful But when when you hear, you know things like conda which is so back at your For python and other things and the reason it was created was basically because and the way I think it is The reason it was created was because the the the scientific crowd was unable to explain the struggles it was having with the Packaging tools in python and just went on and did their own stuff Now the good thing is that some people were some people actually came back and and worked and now I believe should be able to Work fine But that's one example of the division and I think it exists And I think we need to fight it because our value that's something I really believe in our value is the fact that we're diverse And we're able to work together Great question. Um, how do you see those? How do you see the scenario? in five ten years python comparing to other languages like are or The old wall from or things new things so are or Wall from language. So you're talking in the scientific python in the scientific world. Yeah, all right. I'm going to be extremely opinionated I think r will die Well to give you background when we started scikit-learn what Almost seven years ago. Everybody would walk up to us and say you guys are crazy Everybody does r or a machine learning. Everybody does math lab. Okay You know seven years down the line nobody's mentioning this So by the way r is awesome not as a language. It's a horrible language But in terms of libraries as I told you know numerical algorithms are really hard Well r has a crazy amount of them and for me as statistician r is the reference But the value of data analysis is not only in numerics. It's in combining things And I think we have an edge here So a math lab. Yeah, I think we're eating math labs slowly and they're fighting back I'm getting emails on a monthly basis get a training to come to math work to see how we're cooler than python But but the fact that we're going up whereas they're pouring money to fight us is telling me something Maybe it's going to take a bit of time. But in the scientific world I mean The strong container would be julia Junior is a type language That is able to do fantastic type inference and compile to extremely fast code actually uses the llvm I really don't like it I mean, it's a fantastic language. It's awesome fantastic language design I really don't like it because it's a numerical language And they don't think of it that way But it's the whole community is a numerical community and I'm worried that it's going to paint itself in a corner Gil, uh, thanks for the fantastic talk in the fantastic library scikit learn is only one of the libraries in the scikit family There is also a scikit image and scikit bio What is your relationship with this scikit family? So that's very historical. We used to have That's like 2008 There used to be a side kits with a nest A namespace package if you guys remember namespace packages. They're one of my nightmares In sci-fi And that's how we all started and then we took it out of sci-fi because sci-fi was getting too big And then we got rid of this the namespace package It used to be called scikit dot learn and we turn into scikit dash learn And it means scientific kit It's very historical, but what's our relationship? I guess we're friends. We're good friends Okay, um last question One it's sort of question and to point out One specific thing about konda that Is beyond python and beyond pip is where it comes to struggle People come to struggle with non python specific stuff. So if you want a database or a specific You want to install a stack with no js and and python you can you can actually do that. So it actually sits On top of python not it's more like a apt to get than than pip So in this case I'm not I'm not really sure python should have Something in the standard library that actually does that What's your opinion on that? Oh, so I completely agree. So So the comment is konda is more than python basically And I know this by the way, but historically it's not been marketed like this I mean, I've heard way too much Don't use pip use konda, which is let me now hear this in my lab by the way and I fight it And the other thing is I haven't seen much Work go from konda to i'm not even talking about Contributing back to pip, but I'm talking about explaining What was being nerd and I think it's extremely important. I would really like I would like konda forge I'm gonna say bold statement, but I would like konda forge For for python to either die or to push it ematically to pi pi pushing it ematically to pi pi would be awesome But we need one place where we can tell everybody go and get your stuff and we need this place to be good And we need to work together And in a sense konda has achieved this because one thing it has created is it's created and maybe an insight here At least it's shown that you can do things better But you need to go all the way back and get get you know back In the wider python ecosystem the improvements because it's all going to benefit us Okay, so um, we have one more thing to announce So please don't run away after you've given a fantastic enthusiastic applause for gales kino. Thank you very much gail